Hybrid methods

2.4.4. Hybrid methods#

2.4.4.1. Feature shuffling#

Shuffle the values of each feature, one feature at a time, and measure how much the permutation (or shuffling of its values) decreases the accuracy, or the roc_auc, or the mse of the machine learning model (or any other performance metric!). If the variables are important, a random permutation of their values will dramatically decrease any of these metrics. Contrarily, the permutation or shuffling of values should have little to no effect on the model performance metric we are assessing.

The procedure goes more or less like this:

Build a machine learning model and store its performance metrics.
Shuffle 1 feature, and make a new prediction using the previous model.
Determine the performance of this prediction.
Determine the change in the performance of the prediction with the shuffled feature compared to the original one.
Repeat for each feature.

To select features, we chose those that induced a decrease in model performance beyond an arbitrarily set threshold.

Note: In fact, the importance of the features is determined specifically for the algorithm used. Therefore, different algorithms may return different subsets of important features.

from feature_engine.selection import SelectByShuffling

rf = RandomForestClassifier( n_estimators=50, max_depth=2, n_jobs=4)

sel = SelectByShuffling(
    variables=None, # automatically examine all numerical variables
    estimator=rf, # the ML model
    scoring='roc_auc', # the metric to evaluate
    threshold=0,# the maximum performance drop allowed to select the feature
    cv=3, # cross validation
    random_state=1 # seed
)

sel.fit(X_train, y_train)

SelectByShuffling(estimator=RandomForestClassifier(max_depth=2, n_estimators=50,
                                                   n_jobs=4),
                  random_state=1, threshold=0)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

# performance of model trained with all features
sel.initial_model_performance_

0.6884136580496532

# drop in performance caused by shuffling the feature
sel.performance_drifts_

# bar plot of the performance drops
pd.Series(sel.performance_drifts_).sort_values(ascending=False).plot.bar(figsize=(18,4))

<AxesSubplot: >

../../../../_images/f0f8c9474d0d39994ad10a5b94381e073a6d8102b1c93b378b4a3c076952a69a.png

# remove features

X_train_sl = sel.transform(X_train)
X_test_sl = sel.transform(X_test)

X_train_sl.shape, X_test_sl.shape

((35000, 8), (15000, 8))

X_train_sl.columns

Index(['var_16', 'var_34', 'var_48', 'var_55', 'var_69', 'var_70', 'var_91',
       'var_108'],
      dtype='object')

2.4.4.2. Recursive feature elimination#

This method consists of the following steps:

Rank the features according to their importance derived from a machine learning algorithm: it can be tree importance or coefficients derived from linear models.
Remove one feature -the least important- and build a machine learning algorithm utilising the remaining features.
Calculate a performance metric of choice: roc-auc, mse, rmse, accuracy, etc.
If the metric decreases by more than an arbitrarily set threshold, then that feature is important and should be kept. Otherwise, we can remove that feature.
Repeat steps 2-4 until all features have been evaluated.

This is a hybrid method because:

it derives the importance derived from the machine learning algorithm, like embedded methods
it builds several machine learning models, like wrapper methods.

Advantages:

This method is faster than wrapper methods and often better than embedded methods. In practice, it works extremely well.

Note:

The minimum drop in performance to decide if a feature should be kept is set arbitrarily. The smaller the drop, the more features will be selected, and vice versa.

from feature_engine.selection import RecursiveFeatureElimination
from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor
from sklearn.metrics import roc_auc_score, r2_score

# the ML model for which we want to select features
model = GradientBoostingClassifier(
    n_estimators=10,
    max_depth=2,
    random_state=10,
)

# Setup the RFE selector
sel = RecursiveFeatureElimination(
    variables=None, # automatically evaluate all numerical variables
    estimator = model, # the ML model
    scoring = 'roc_auc', # the metric we want to evalute
    threshold = 0.0005, # the maximum performance drop allowed to remove a feature
    cv=2, # cross-validation
)

# this may take quite a while, because
# we are building a lot of models with cross-validation
sel.fit(X_train, y_train)

RecursiveFeatureElimination(cv=2,
                            estimator=GradientBoostingClassifier(max_depth=2,
                                                                 n_estimators=10,
                                                                 random_state=10),
                            threshold=0.0005)