2.4.2. Wrapped methods#

  • Use the ML models to score the subset feature by train model in this subset.

1. Advantages:

  • It often give the best of subsets feature score for a desired ML model

  • Detect the interactions between features

2. Disavantages:

  • Use more computationally expensive than other methods.

  • May not give the best feature combination for difference ML models

3. Usage:

  • Appropriate for select subset for single model that was build by this subset feature

4. Produce steps:

  • Seach for a subset of features

  • Build a ML model on the subset feature and evaluate model

  • Repeat for another subset

5. How to search ?:

  • Forward selection: Start with no feature, adds the best improved feature (get best performance model) in un-selected features remained pool each iteraction until a predefined criteria is met.

  • Backward selection: Start with all features in pool, removes the least significant feature (get best performance model) in pool each iteraction until a predefined criteria is met.

  • Exhaustive search: searches across all possible of k-feature combinations that get best performance model, with k from 1 to N (number of all features), then select the combination of features that performs the best (often impractical)

6. How to stop the search ?:

  • Performance increase/decrease

  • The Max number of features that predefined is reached

2.4.2.1. Step-forward#

from sklearn.feature_selection import SequentialFeatureSelector as SFS

# within the SFS we indicate:
# 1) the algorithm we want to create, in this case RandomForests (note that I use few trees to speed things up)
# 2) the stopping criteria: see sklearn documentation for more details
# 3) wheter to perform step forward or step backward
# 4) the evaluation metric: in this case the roc_auc
# 5) the cross-validation

# this is going to take a while, do not despair

sfs = SFS(
    estimator=RandomForestClassifier( n_estimators=10, n_jobs=4, random_state=0),
    n_features_to_select=10,  # the number of features to retain
    tol=None,  # the maximum increase or decrease in the performance metric in each iterator change of n_features
    direction='forward',  # the direction of the selection procedure
    scoring='roc_auc',  # the metric to evaluate
    cv=2,  # the cross-validation fold
    n_jobs=4,  # for parallelization
)

sfs = sfs.fit(X_train, y_train)
sfs.get_feature_names_out()
array(['var_16', 'var_17', 'var_21', 'var_44', 'var_45', 'var_48',
       'var_55', 'var_91', 'var_103', 'var_108'], dtype=object)

2.4.2.2. Step-backward#

sfs = SFS(
    estimator=RandomForestClassifier(
        n_estimators=10, n_jobs=4, random_state=0),
    n_features_to_select=65,  # the number of features to retain
    tol=None,  # the maximum increase or decrease in the performance metric
    direction='backward',  # the direction of the selection procedure
    scoring='roc_auc',  # the metric to evaluate
    cv=2,  # the cross-validation fold
    n_jobs=4,  # for parallelization
)

sfs = sfs.fit(X_train, y_train)

2.4.2.3. Exhaustive#

from mlxtend.feature_selection import ExhaustiveFeatureSelector as EFS

##################################

# in order to shorter search time for the demonstration
# i will ask the algorithm to try all possible 1 and 2
# feature combinations

# if you have access to a multicore or distributed computer
# system you can try more greedy searches

###################################

# within the EFS we indicate:

# 1) the algorithm we want to create, in this case RandomForests
# (note that I use few trees to speed things up)

# 2) the number of minimum features we want our model to have

# 3) the number of maximum features we want our model to have

# with 2 and 3 we regulate the number of possible feature combinations to
# be evaluated by the model.

# 4) the evaluation metric: in this case the roc_auc
# 5) the cross-validation

# this is going to take a while, do not despair

efs = EFS(RandomForestClassifier(n_estimators=5,n_jobs=4,random_state=0, max_depth=2),
          min_features=1,
          max_features=2,
          scoring='roc_auc',
          print_progress=True,
          cv=2)

# search features
efs = efs.fit(X_train, y_train)

selected_feat = X_train.columns[list(efs.best_idx_)]
selected_feat