Complete Feature Selection Techniques 4 - 4 Model Driven

Summer Hu
4 min readFeb 25, 2022

--

Summarize feature selection intuition for Tree and Regression based model, and some common feature selection strategies like Recursive Feature Elimination, Feature Permutation and Feature Dropout.

Hopi, Native American Tribe

Complete Feature Selection Techniques

  1. Statistical Test & Analysis
  2. Correlation Analysis
  3. Dimension Reduction
  4. Model Driven

Tree Based Model

In Decision Tree growing(or splitting) process, Decision Tree will estimate all features, and select one feature(and its value) to split a Tree Node, if after node splitting, the Tree get maximized improvement in terms the splitting criteria like GINI, Information Gain(classification) or Variance(regression).

So, for each feature we can aggregate

how many times the feature used by node splitting or

how much improvement in splitting criteria after node splitting by the feature

as the feature’s importance score, then select high score features

The same intuition applies to all the Tree-Based model like Random Forrest, XGBoost etc.

Below is a example in Python

import xgboost as xgb
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
boston = load_boston()
data = pd.DataFrame(boston.data)
data['PRICE'] = boston.target
X, y = data.iloc[:,:-1], data.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=22)
xg_reg = xgb.XGBRegressor(
n_estimators = 50,
max_depth = 4,
learning_rate = 0.3)
xg_reg.fit(X_train,y_train)preds = xg_reg.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, preds))
print("RMSE: %f" % (rmse))
RMSE: 3.509045
feature_imp= xg_reg.get_booster().get_score(importance_type= 'gain')
sorted_feature_imp = np.array(sorted(feature_imp.items(), key=lambda kv: int(kv[1]), reverse=False))
plt.barh(
boston.feature_names[sorted_feature_imp[:,0].astype(int)],
sorted_feature_imp[:,1].astype(float))
plt.xlabel("Feature Importance")

Regression Model ( LASSO Linear Regression)

In linear regression, the target function with L1 regularization as below

https://en.wikipedia.org/wiki/Lasso_(statistics)

The intuition is we using coefficients β as feature importance:

If feature’s coefficient β is far way from 0, then the feature has more importance.

If feature’s coefficient β is very close to 0, then the feature has no importance, and we may remove the feature in model training.

One note here is feature value need to be normalized or scaled to be fairly compare its coefficient.

Recursive Feature Elimination

The strategies intuition is illustrated as below

https://medium.com/analytics-vidhya/feature-selection-methods-for-data-science-just-a-few-fca3086eb445

Feature Dropout

The feature dropout strategy is similar to Backward Feature Elimination

Feature Permutation

Feature Permutation means, for one feature, we randomly shuffle this feature’s order and breaks the relationship between the feature and the target variable, below is an example:

https://www.kaggle.com/dansbecker/permutation-importance

The basic intuition for Feature Permutation is:

We assume the shuffled feature should have less importance than a good feature, so if one feature has less feature importance than the shuffled feature, then that feature is a candidate to be removed.

Based on this idea, Project boruta come up with an implementation to run the feature selection.

The general process of boruta is as below:

1.Create randomly shuffled features SF_X, and attach to original dataset

2.Train a Random Forest Model using above dataset.

3.Calculate each feature’s feature importance

4. Xi = 1(i=1,2,3,4) if Feature Importance Xi > max ( Feature Importance SF_X1, SF_X2, SF_X3, SF_X4 ) else =0

5. Xi_Sum = Xi_Sum + Xi (i = 1,2,3,4)

6. Rerun step 1 to 5, N times

7. Check each Xi_Sum against N trails Binomial Distribution, and base on preset confidence level to classify all features into strong, week eliminated feature.

https://towardsdatascience.com/boruta-explained-the-way-i-wish-someone-explained-it-to-me-4489d70e154a
pip install Boruta or conda install -c conda-forge boruta_py###########################
from boruta import BorutaPy
from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor(
n_jobs = -1,
max_depth = 5
)
boruta = BorutaPy(
estimator = forest,
alpha = 0.05, # p value
max_iter = 100
)
boruta.fit(X_train.values,y_train.values)print('Strong Features')
print(boston.feature_names[boruta.support_])
print('Weak Features')
print(boston.feature_names[boruta.support_weak_])
Strong Features
['CRIM' 'NOX' 'RM' 'DIS' 'TAX' 'PTRATIO' 'LSTAT']
Weak Features
['AGE' 'B']

REFERENCE

  1. Feature Selection Methods for Data Science (just a few)
  2. Boruta Explained Exactly How You Wished Someone Explained to You
  3. Permutation Importance
  4. Automated feature selection with boruta

--

--

Summer Hu
Summer Hu

Written by Summer Hu

Data Scientist & Engineer from Sydney

No responses yet