Summarize feature selection intuition for Tree and Regression based model, and some common feature selection strategies like Recursive Feature Elimination, Feature Permutation and Feature Dropout.
Complete Feature Selection Techniques
Tree Based Model
In Decision Tree growing(or splitting) process, Decision Tree will estimate all features, and select one feature(and its value) to split a Tree Node, if after node splitting, the Tree get maximized improvement in terms the splitting criteria like GINI, Information Gain(classification) or Variance(regression).
So, for each feature we can aggregate
how many times the feature used by node splitting or
how much improvement in splitting criteria after node splitting by the feature
as the feature’s importance score, then select high score features
The same intuition applies to all the Tree-Based model like Random Forrest, XGBoost etc.
Below is a example in Python
import xgboost as xgb
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_splitboston = load_boston()
data = pd.DataFrame(boston.data)
data['PRICE'] = boston.targetX, y = data.iloc[:,:-1], data.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=22)xg_reg = xgb.XGBRegressor(
n_estimators = 50,
max_depth = 4,
learning_rate = 0.3)xg_reg.fit(X_train,y_train)preds = xg_reg.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, preds))
print("RMSE: %f" % (rmse))
RMSE: 3.509045feature_imp= xg_reg.get_booster().get_score(importance_type= 'gain')
sorted_feature_imp = np.array(sorted(feature_imp.items(), key=lambda kv: int(kv[1]), reverse=False))
plt.barh(
boston.feature_names[sorted_feature_imp[:,0].astype(int)],
sorted_feature_imp[:,1].astype(float))
plt.xlabel("Feature Importance")
Regression Model ( LASSO Linear Regression)
In linear regression, the target function with L1 regularization as below
The intuition is we using coefficients β as feature importance:
If feature’s coefficient β is far way from 0, then the feature has more importance.
If feature’s coefficient β is very close to 0, then the feature has no importance, and we may remove the feature in model training.
One note here is feature value need to be normalized or scaled to be fairly compare its coefficient.
Recursive Feature Elimination
The strategies intuition is illustrated as below
Feature Dropout
The feature dropout strategy is similar to Backward Feature Elimination
Feature Permutation
Feature Permutation means, for one feature, we randomly shuffle this feature’s order and breaks the relationship between the feature and the target variable, below is an example:
The basic intuition for Feature Permutation is:
We assume the shuffled feature should have less importance than a good feature, so if one feature has less feature importance than the shuffled feature, then that feature is a candidate to be removed.
Based on this idea, Project boruta come up with an implementation to run the feature selection.
The general process of boruta is as below:
1.Create randomly shuffled features SF_X, and attach to original dataset
2.Train a Random Forest Model using above dataset.
3.Calculate each feature’s feature importance
4. Xi = 1(i=1,2,3,4) if Feature Importance Xi > max ( Feature Importance SF_X1, SF_X2, SF_X3, SF_X4 ) else =0
5. Xi_Sum = Xi_Sum + Xi (i = 1,2,3,4)
6. Rerun step 1 to 5, N times
7. Check each Xi_Sum against N trails Binomial Distribution, and base on preset confidence level to classify all features into strong, week eliminated feature.
pip install Boruta or conda install -c conda-forge boruta_py###########################
from boruta import BorutaPy
from sklearn.ensemble import RandomForestRegressorforest = RandomForestRegressor(
n_jobs = -1,
max_depth = 5
)boruta = BorutaPy(
estimator = forest,
alpha = 0.05, # p value
max_iter = 100
)boruta.fit(X_train.values,y_train.values)print('Strong Features')
print(boston.feature_names[boruta.support_])
print('Weak Features')
print(boston.feature_names[boruta.support_weak_])Strong Features
['CRIM' 'NOX' 'RM' 'DIS' 'TAX' 'PTRATIO' 'LSTAT']Weak Features
['AGE' 'B']
REFERENCE
- Feature Selection Methods for Data Science (just a few)
- Boruta Explained Exactly How You Wished Someone Explained to You
- Permutation Importance
- Automated feature selection with boruta