Regression metrics

2.5.3. Regression metrics#

Chú ý : Bài toán logistic bản chất là 1 bài toán regression bới output của model là dự báo xác suất (giá trị continuous), kết hợp cùng với cutpoint để phân loại nhành kết quả nhị phân Positive/Negetive. Tuy nhiên target của bài toán là nhị phân 0/1 nên kết hợp điều chỉnh cutpoint và đánh giá kết hợp theo nhiều metrics.

y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]

2.5.3.1. Variance metrics#

2.5.3.1.1. R square#

R-square / adjusted R-square: tỷ lệ variation được giải thích bởi model, from 0 to 1, dùng trong việc model giải thích tốt được bao nhiêu % trong hồi quy. Trong đó, adjusted R-square điều chỉnh lại R2 bằng số lượng biến trong model, vì càng nhiều biến multicorrlinearity dẫn tới overfitting, R2 càng cao, cần phải được điều chỉnh. $$R^2 = 1-\frac{\sum_{i=1}^{n}(y_{i}-\hat{y}_{i})^{2}}{\sum_{i=1}^{n}(y_{i}-\bar{y}_{i})^{2}}$$

\[R_{adj}^2 = 1-(1-R^2)\frac{n-1}{n-p-1}\]

from sklearn.metrics import r2_score

r2_score(y_true, y_pred)

0.9486081370449679

2.5.3.1.2. explained variance score#

The Explained Variance score is similar to the R^2 score, with the notable difference that it does not account for systematic offsets in the prediction. Most often the R^2 score should be preferred.

from sklearn.metrics import explained_variance_score

explained_variance_score(y_true, y_pred)

0.9571734475374732

2.5.3.2. Error metrics#

Trong bài toán dự báo thì chúng ta muốn sai số giữa giá trị dự báo và giá trị thực tế là nhỏ nhất thường lựa chọn các metrics:

MSE: Trung bình tổng bình phương sai số giữa giá trị dự báo và thực tế. $$\text{MSE} = \frac{1}{n}\sum_{i=1}^{n} (y_i-\hat{y}_i)^2$$
RMSE: Khai căn bậc hai của MSE và nó đại diện cho độ lệch giữa giá trị dự báo và thực tế. $$\text{RMSE} = \sqrt{\frac{1}{n}\sum_{i=1}^{n} (y_i-\hat{y}_i)^2)}$$
MAE: Trung bình trị tuyệt đối của sai số giữa giá trị dự báo và thực tế. $$\text{MAE} = \frac{1}{n}\sum_{i=1}^{n} |y_i-\hat{y}_i|$$
MAPE: Trung bình của tỷ lệ phần trăm sai số tuyệt đối giữa giá trị dự báo và thực tế. $$\text{MAPE} = \frac{1}{n}\sum_{i=1}^{n} |\frac{y_i-\hat{y}_i}{y_i}|$$

from sklearn.metrics import mean_absolute_error, mean_squared_error, mean_absolute_percentage_error

# MSE
print('MSE', mean_squared_error(y_true, y_pred, squared=True))

# RMSE
print('RMSE', mean_squared_error(y_true, y_pred, squared=False))

# MAE
print('MAE', mean_absolute_error(y_true, y_pred))

# MAPE
print('MAPE', mean_absolute_percentage_error(y_true, y_pred))

MSE 0.375
RMSE 0.6123724356957945
MAE 0.5
MAPE 0.3273809523809524

2.5.3.3. Model selection#

Model R2 and RMSE always become better when adding more variables, but that is not alway get better model. Need to use metric to assessing “add more variable effection” or use all variable possible is better ?

Method to find all model possible with large dataset is stepwise regression: start at full model and drop the variables that dont contribute meanningfully

adjusted R-square
$\mathrm{AIC}=2P+n\log(\mathrm{RSS}/n)$
- AICc is suitable for small dataset
- BIC is Stronger penalty