2.3.3. Ensemble approaches#

Combine Learner through bagging and boosting to make more robust predictions:

  • Bagging/Boosting + Data-level (under and over sampling)

  • Bagging/Boosting + Cost-sensitive (higher miss classification costs)

2.3.3.1. Bagging ensample#

Xây dựng các models cùng lại nhưng train trên các subsamples khác nhau từ raw-sample, các subsample này được sinh ra từ phương pháp bootstraping with replacement. Các model này độc lập và song song với nhau, kết quả cuối cùng sử dụng trung bình cộng các decision rules hoặc majority class –> Mục tiêu là giảm variance - áp dụng cho các model đã có sẵn bias thấp và đang bị variance cao

  • Kết hợp với imbalance resample, Các subsample được sinh ra là balance dataset (so với imbalance raw dataset)

  1. Balanced Random Forest = Random Undersampling + Bagging Decision Trees

  • Select random majority class and keep minority class

  1. OverBagging = Random Oversampling (\(R(x) = 1\)) + Bagging

  • Bootstrap with replament both majority and minority to get final \(R(x) = 1\)

  1. SMOTEBagging = SMOTE + Bagging

  • Majority class bootstrapped with replacement

  • Minority class

from imblearn.ensemble import  BalancedBaggingClassifier, BalancedRandomForestClassifier # with balance sampling
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier, AdaBoostClassifier # without balance sampling


# balanced random forests (bagging) , with balanced resampling
BalancedRandomForestClassifier(
        n_estimators=20,
        criterion='gini',
        max_depth=3,
        sampling_strategy='auto',
        n_jobs=4,
        random_state=2909,
    )

# bagging of Logistic regression, with balanced resampling
BalancedBaggingClassifier(
        base_estimator=LogisticRegression(random_state=2909),
        n_estimators=20,
        max_samples=1.0,  # The number of samples to draw from X to train each base estimator
        max_features=1.0,  # The number of features to draw from X to train each base estimator
        bootstrap=True,
        bootstrap_features=False,
        sampling_strategy='auto',
        n_jobs=4,
        random_state=2909,
    )

# bagging of Logistic regression, no balance resampling
BaggingClassifier(
        base_estimator=LogisticRegression(random_state=2909),
        n_estimators=20,
        n_jobs=4,
        random_state=2909,
    )
BaggingClassifier(base_estimator=LogisticRegression(random_state=2909),
                  n_estimators=20, n_jobs=4, random_state=2909)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

2.3.3.2. Boosting ensample#

Boosting là sử dụng các model nối tiếp train on toàn bộ data, các miss class observation sau mỗi 1 classifier sẽ được tăng weight và train cho classifier tiếp theo:

  • 1st classifier: all obs are given the same weight

  • 2nd classifier: obs miss-classified by previous classifier have a higher weight, then use new weight for the next classifier

  • 3th classifier:…


1. RUSBoost = Random Under sampling + Boosting

Boosting train on RandomUndersampling each iteration (thay vì all data như vanilla Boosting gốc), việc tính error và adjust weight sẽ vẫn được thực hiện trên all observation


2. SMOTEBoost = SMOTE + Boosting

Boosting train on SMOTE Oversampling each iteration (thay vì all data như vanilla Boosting ), việc tính error và adjust weight sẽ vẫn được thực hiện trên all observation

  • add more instances of minority class

  • improve classifiers accuracy

  • Requires quite a bit of computational time to perform SMOTE at each iteration


3. RAMOBoost = ADASYN (adaptation) + Boosting

Boosting train on ADASYN Oversampling each iteration (thay vì all data như vanilla Boosting ), việc tính error và adjust weight sẽ vẫn được thực hiện trên all observation

  • add more instances of minority class

  • improve classifiers accuracy

  • Requires quite a bit of computational time to perform SMOTE at each iteration

from imblearn.ensemble import RUSBoostClassifier

# boosting + undersampling
RUSBoostClassifier(
        base_estimator=None,
        n_estimators=20,
        learning_rate=1.0,
        sampling_strategy='auto',
        random_state=2909,
    )
RUSBoostClassifier(base_estimator=None, n_estimators=20, random_state=2909)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

2.3.3.3. Hybrid ensample#

(Bagging + Boosting + Balanced resampling)

Trong đó:

  • Các subsample được tạo theo bagging balanced resampling

  • Các submodel là các sub-sequence of models (như boosting)


1. EasyEnsemble = Random Undersampling + Bagging of AdaBoost (Boosting)

from imblearn.ensemble import EasyEnsembleClassifier

# bagging + boosting + under-sammpling
EasyEnsembleClassifier(
        n_estimators=20,
        sampling_strategy='auto',
        n_jobs=4,
        random_state=2909,
    )
EasyEnsembleClassifier(n_estimators=20, n_jobs=4, random_state=2909)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.