2.4.2. Wrapped methods#
Use the ML models to score the subset feature by train model in this subset.
1. Advantages:
It often give the best of subsets feature score for a desired ML model
Detect the interactions between features
2. Disavantages:
Use more computationally expensive than other methods.
May not give the best feature combination for difference ML models
3. Usage:
Appropriate for select subset for single model that was build by this subset feature
4. Produce steps:
Seach for a subset of features
Build a ML model on the subset feature and evaluate model
Repeat for another subset
5. How to search ?:
Forward selection: Start with no feature, adds the best improved feature (get best performance model) in un-selected features remained pool each iteraction until a predefined criteria is met.
Backward selection: Start with all features in pool, removes the least significant feature (get best performance model) in pool each iteraction until a predefined criteria is met.
Exhaustive search: searches across all possible of k-feature combinations that get best performance model, with k from 1 to N (number of all features), then select the combination of features that performs the best (often impractical)
6. How to stop the search ?:
Performance increase/decrease
The Max number of features that predefined is reached
2.4.2.1. Step-forward#
from sklearn.feature_selection import SequentialFeatureSelector as SFS
# within the SFS we indicate:
# 1) the algorithm we want to create, in this case RandomForests (note that I use few trees to speed things up)
# 2) the stopping criteria: see sklearn documentation for more details
# 3) wheter to perform step forward or step backward
# 4) the evaluation metric: in this case the roc_auc
# 5) the cross-validation
# this is going to take a while, do not despair
sfs = SFS(
estimator=RandomForestClassifier( n_estimators=10, n_jobs=4, random_state=0),
n_features_to_select=10, # the number of features to retain
tol=None, # the maximum increase or decrease in the performance metric in each iterator change of n_features
direction='forward', # the direction of the selection procedure
scoring='roc_auc', # the metric to evaluate
cv=2, # the cross-validation fold
n_jobs=4, # for parallelization
)
sfs = sfs.fit(X_train, y_train)
sfs.get_feature_names_out()
array(['var_16', 'var_17', 'var_21', 'var_44', 'var_45', 'var_48',
'var_55', 'var_91', 'var_103', 'var_108'], dtype=object)
2.4.2.2. Step-backward#
sfs = SFS(
estimator=RandomForestClassifier(
n_estimators=10, n_jobs=4, random_state=0),
n_features_to_select=65, # the number of features to retain
tol=None, # the maximum increase or decrease in the performance metric
direction='backward', # the direction of the selection procedure
scoring='roc_auc', # the metric to evaluate
cv=2, # the cross-validation fold
n_jobs=4, # for parallelization
)
sfs = sfs.fit(X_train, y_train)
2.4.2.3. Exhaustive#
from mlxtend.feature_selection import ExhaustiveFeatureSelector as EFS
##################################
# in order to shorter search time for the demonstration
# i will ask the algorithm to try all possible 1 and 2
# feature combinations
# if you have access to a multicore or distributed computer
# system you can try more greedy searches
###################################
# within the EFS we indicate:
# 1) the algorithm we want to create, in this case RandomForests
# (note that I use few trees to speed things up)
# 2) the number of minimum features we want our model to have
# 3) the number of maximum features we want our model to have
# with 2 and 3 we regulate the number of possible feature combinations to
# be evaluated by the model.
# 4) the evaluation metric: in this case the roc_auc
# 5) the cross-validation
# this is going to take a while, do not despair
efs = EFS(RandomForestClassifier(n_estimators=5,n_jobs=4,random_state=0, max_depth=2),
min_features=1,
max_features=2,
scoring='roc_auc',
print_progress=True,
cv=2)
# search features
efs = efs.fit(X_train, y_train)
selected_feat = X_train.columns[list(efs.best_idx_)]
selected_feat