2.4.3. Embedded methods#

  • Perform the feature selection as a part of ML process by considering the interation of model and subset feature.

1. Advantages:

  • Use less computationally expensive than wrapped method because this fit model only 1 time by the importance features in these algorithms.

  • More accurate than filter methods

  • Detect the interactions between features

2. Disavantages:

  • Constrained to the limitations of the choosed algorithm.

3. Usage:

  • Typically, the embedded is choosed at the time of selecting features

4. Produce steps:

  • Train a ML model

  • Derive the feature importance then remove the non-importance features

5. Algorithms ?:

  • Lasso

  • Tree importance

  • Regression coefficients

2.4.3.1. Linear coefficients#

  • Base on Simple model (LinearRegression, LogisticRegression, RandomForestClassifier, RandomForestRegressor) to evaluate the importance of the variables, then select

  • The magnitude of the coefficients is directly influenced by the scale of the features. Therefore, to compare coefficients across features, it is important that all features are on a similar scale. This is why normalisation is important for variable importance and feature selection in linear models.

Linear Regression assumptions:

  • There is a linear relationship betweent the predictors Xs and the outcome Y

  • The residuals follow a normal distribution centered at 0

  • There is little or no multicollinearity among predictors (Xs should not be linearly related to one another)

  • Homoscedasticity (variance should be the same)

Advantages

  • The ability to assess the importance of variables quite accurately

  • Useful to interpret the output of the model

Disadvantage

  • Need to feature transform and scaling process to met the linear model assumption. There are a lot of assumptions that need to be met in order to make a fair comparison of the features by using only their regression coefficients.

  • When regularisation applies a penalty on the coefficients, different value of penalty may be got the different of subset features because regularisation masks the true relationship between the predictor X and the outcome Y.

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel


# remember that here I want to evaluate the coefficient magnitude itself and not whether lasso shrinks coefficients to zero
# ideally, I want to avoid regularisation at all, so the coefficients are not affected (modified) by the penalty of the regularisation
# In order to do this in sklearn, I set the parameter C really high which is basically like fitting a non-regularised logistic regression
# Then I use the selectFromModel object from sklearn to automatically select the features
# set C to 1000, to avoid regularisation

lr = LogisticRegression(C=1000, penalty='l2', max_iter=300, random_state=10)

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('ft_slmodel', SelectFromModel(estimator = lr)),
        ])
pipe.fit(X_train, y_train)
Pipeline(steps=[('scaler', StandardScaler()),
                ('ft_slmodel',
                 SelectFromModel(estimator=LogisticRegression(C=1000,
                                                              max_iter=300,
                                                              random_state=10)))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Keep in mind that increasing the penalisation will increase the number of features removed. Therefore, you will need to keep an eye and monitor the final model performance to ensure that you don’t set a penalty too high so it removes a lot of features, or too low, and thus useless features are retained.

# this command let's me visualise those features that were kept.
# sklearn will select those features which coefficients are greater than the mean of all the coefficients.
# it compares absolute values of coefficients. More on this in a second.
selected_feat = X_train.columns[pipe.named_steps['ft_slmodel'].get_support()]
selected_feat
Index(['var_3', 'var_11', 'var_21', 'var_23', 'var_24', 'var_26', 'var_32',
       'var_33', 'var_39', 'var_40', 'var_48', 'var_50', 'var_52', 'var_55',
       'var_56', 'var_60', 'var_63', 'var_69', 'var_70', 'var_72', 'var_74',
       'var_77', 'var_80', 'var_83', 'var_84', 'var_88', 'var_89', 'var_91',
       'var_93', 'var_98', 'var_100', 'var_102', 'var_106'],
      dtype='object')
# and now, let's compare the  number of selected features
# with the number of features which coefficient is above the
# mean coefficient, to make sure we understand the output of
# SelectFromModel

print('total features: {}'.format((X_train.shape[1])))

print('selected features: {}'.format(len(selected_feat)))

print(
    'features with coefficients greater than the mean coefficient: {}'.format(
        np.sum(
            np.abs(pipe.named_steps['ft_slmodel'].estimator_.coef_) > np.abs(
                pipe.named_steps['ft_slmodel'].estimator_.coef_).mean())))
total features: 108
selected features: 33
features with coefficients greater than the mean coefficient: 33

2.4.3.2. Lasso#

  • Regularisation consists in adding a penalty to the different parameters of the machine learning model to reduce the freedom of the model and avoid overfitting. In linear model regularization, the penalty is applied to the coefficients that multiply each of the predictors. The Lasso regularization or l1 has the property that is able to shrink some of the coefficients to zero. Therefore, those features can be removed from the model.

  • Keep in mind that increasing the penalisation will increase the number of features removed. Therefore, you will need to keep an eye and monitor the final model performance to ensure that you don’t set a penalty too high so it removes a lot of features, or too low, and thus useless features are retained.

  • Ridge regularisation does not shrink coefficients to zero, only Lasso

# for logistic allow `penalty`
sel_ = SelectFromModel(
    LogisticRegression(C=0.5, penalty='l1', solver='liblinear', random_state=10)
)

# Then I use the selectFromModel class from sklearn, which
# will select the features which coefficients are non-zero
# for linear regression does not allow `penalty` parameter, need to import `Lasso`
from sklearn.linear_model import Lasso

# here, again I will train a Lasso Linear regression and select
# the non zero features in one line.

# bear in mind that the linear regression object from sklearn does
# not allow for regularisation. So If you want to make a regularised
# linear regression you need to import specifically "Lasso"

# alpha is the penalisation, so I set it high
# to force the algorithm to shrink some coefficients

sel_ = SelectFromModel(Lasso(alpha=100, random_state=10))
sel_.fit(scaler.transform(X_train), y_train)

2.4.3.3. Tree importance#

The more a feature decreases the impurity, the more important the feature is. In random forests, the impurity decrease elicited by each feature is averaged across trees to determine the final importance of the variable.

In general, features that are selected at the top of the trees are more important than features that are selected at the end nodes of the trees, as generally the top splits lead to bigger information gains.

Advantages: Very straightforward, fast and generally accurate way of selecting good features for machine learning. In particular, if you are going to build tree methods.

Note

  • Random Forests and decision trees in general give preference to features with high cardinality

  • Correlated features will be given equal or similar importance, but overall reduced importance compared to the same tree built without correlated counterparts.

from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.feature_selection import SelectFromModel

# we fit Random Forests and select features in 2 lines of code

# first I specify the Random Forest instance and its parameters

# Then I use the selectFromModel class from sklearn
# to automatically select the features

# SelectFrom model will select those features which importance
# is greater than the mean importance of all the features
# by default, but you can alter this threshold if you want to

sel_ = SelectFromModel(RandomForestClassifier(n_estimators=10, random_state=10))

sel_.fit(X_train, y_train)
SelectFromModel(estimator=RandomForestClassifier(n_estimators=10,
                                                 random_state=10))
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
# and now, let's compare the  amount of selected features
# with the amount of features which importance is above the
# mean of all features, to make sure we understand the output of
# SelectFromModel

selected_feat = X_train.columns[(sel_.get_support())]
print('total features: {}'.format((X_train.shape[1])))

print('selected features: {}'.format(len(selected_feat)))

print(
    'features with importance greater than the mean importance of all features: {}'.format(
        np.sum(sel_.estimator_.feature_importances_ >
               sel_.estimator_.feature_importances_.mean())))
total features: 108
selected features: 27
features with importance greater than the mean importance of all features: 27

Selecting features by using tree derived feature importance is a very straightforward, fast and generally accurate way of selecting good features for machine learning. In particular, if you are going to build tree methods.

However, as I said, correlated features will show in a tree similar importance, but lower than compared to what their importance would be if the tree was built without the correlated counterparts.

In situations like this, it is better to select features recursively, rather than altogether like we are doing in this lecture.

2.4.3.3.1. Recursively tree importance#

Random Forests assign equal or similar importance to features that are highly correlated. In addition, when features are correlated, the importance assigned is lower than the importance attributed to the feature itself, should the tree be built without the correlated counterparts.

Therefore, instead of eliminating features based on importance by brute force like we did in the previous notebook, we could get a better selection by removing one feature at a time, and recalculating the importance on each round. This procedure is called Recursive Feature Elimination (RFE)

RFE is a hybrid between embedded and wrapper methods: it is based on computation derived when fitting the model, but it also requires fitting several models.

The cycle is as follows:

  • Build Random Forests using all features

  • Remove least important feature

  • Build Random Forests and recalculate importance

  • Repeat until a criteria is met

In this situation, when a feature that is highly correlated to another one is removed, then, the importance of the remaining feature increases. This may lead to a better feature space selection. On the downside, building several Random Forests is quite time and compute resource consuming, in particular if the dataset contains a high number of features.

# we do model training and feature selection in 2 lines of code

# first I specify the Random Forest and its parameters

# Then RFE from sklearn to remove features recursively

# RFE will remove one feature at each iteration => the least  important.
# then it will build another random forest and repeat
# till a criteria is met.

# in sklearn the criteria to stop is an arbitrary number
# of features to select, that we need to decide before hand
# not the best solution, but a solution

sel_ = RFE(RandomForestClassifier(n_estimators=10, random_state=10), n_features_to_select=27)
sel_.fit(X_train, y_train)

In my opinion the RFE from sklearn does not bring forward a massive advantage respect to the SelectFromModel method and personally I tend to use the RFE to select my features if we are not sure about the correlation between features