2.2.3. Variable Overview#
1. Types of variables
Numeric
discrete
continuous
Categorical
Ordinal
Nominal
Datetime
Date only
Time only
Datetime
Mix-type
2. Variable Characteristics
2.1 Missing values
Cause
Forgotten, lost, not stored
value does not exist
Unknown/not identified values
Missing data mechanisms
Missing completely at random (MCAR)
Missing at random (MAR)
Missing not at random (MNAR)
Imputation Techniques
Numerical Variables
Mean/median imputation
Arbitrary value imputation
End of tail imputation
Categorical Variables
Frequent category
Adding new “missing” category
Both
Complete Case Analysis
Adding new “missing” category
Random sample imputation
2.2 Cardinality for categorical variables
High cardinality problem
Not cover all labels, dominate by a only few labels ( decision tree-based )
Make noise
distribution variables of train and test are not the same, tend to not capture in model, or not recognize with new pattern and overfitting
Difficult to transform in preprocessing
2.3 Rare labels for categorical variables
Lead to overfitting in tree-base models
maybe add noise, cause of overfitting
Should create new_label by grouped all rare label
Or remove rare label if it make noise or be non-representative
rare labels make train and test distribution is not the same (could appear only one of all)
2.4 Outliers
Identify by Extreme value analysis
For normal distribution
outliers = mean +/- 3* std.
For skewed distribution
outliers out of upper/lower boundary
Upper boundary = 75th quantile + (IQR * 1.5)
Lower boundary = 25th quantile - (IQR * 1.5)
extreme case
Upper boundary = 75th quantile + (IQR * 3)
Lower boundary = 25th quantile - (IQR * 3)
2.5 Linear model assumption
Assumption
Linearity: There is a linear relationship between predictors and target
No perfect multicollinearity: there are no perfect /high linear relationship between 2 or more of the predictors
Normally distributed errors: The residuals are random and normally distributed with a mean of 0
Homoscedasticity: at each level of the predictor variables, the variance of the error should be constant
linear models
Linear regression
Logistic regression
Linear discriminant analysis - LDA
…
Evaluate model performance
Check the residuals distribution with zero-mean normal by Q-Q plot, or KS-test
Check there is a linear relationship between X and Y, X and other X
Variance inflation factor (VIF) to check multi-collinearity
2.6 Variable magnitude
What is matters ?
The regression coefficient bị ảnh hưởng trực tiếp từ scale của variable
Variables with a larger magnitude dominate those with a smaller magnitude
Gradient descent, SVM converges fast with scaled variables
Euclidean distance are sensitive to feature magnitude
ML model
ML model are affected
Linear/ Logistic Regression
Neural Networks
SVMs
kNN
Kmeans
LDA
PCA
Not be affected
Tree-base model
Classification tree
Regression tree
Random Forests
Gradient Boosted trees
2.2.3.1. Datatypes detection#
def get_list_datatypes(df, discrete_nunique_max = 10, mixed_detection = []):
# numerical: discrete vs continuous
num = df.select_dtypes(include=[np.number, bool]).columns.tolist()
discrete = [var for var in num if df[var].nunique() <= discrete_nunique_max]
continuous = [var for var in num if var not in discrete]
# categorical
categorical = [var for var in df.columns if var not in (mixed_detection + num)]
res = (discrete, continuous, categorical, mixed_detection)
for t, n in zip(res, ['discrete', 'continuous', 'categorical', 'mixed']) :
print(f'{len(t)} {n} variables: ', ", ".join(t))
return discrete, continuous, categorical, mixed_detection
2.2.3.2. Correlation#
def PairGridCorr(X, y = None, corr = 'pearson'):
def corrdot(*args, **kwargs):
corr_r = args[0].corr(args[1], corr)
corr_text = f"{corr_r:2.2f}".replace("0.", ".")
ax = plt.gca()
ax.set_axis_off()
marker_size = abs(corr_r) * 10000
ax.scatter([.5], [.5], marker_size, [corr_r], alpha=0.6, cmap="coolwarm",
vmin=-1, vmax=1, transform=ax.transAxes)
font_size = abs(corr_r) * 50 + 5
ax.annotate(corr_text, [.5, .5,], xycoords="axes fraction",
ha='center', va='center', fontsize=font_size)
def annotate_colname(x, **kws):
ax = plt.gca()
ax.annotate(x.name, xy=(0.5, 0.9), xycoords=ax.transAxes, fontweight='bold')
sns.set(style='white', font_scale=1.3)
# sns.set_theme( style="whitegrid")
g = sns.PairGrid(X, aspect=1.4, diag_sharey=False)
g.map_lower(sns.scatterplot, hue = y )
g.map_diag(sns.histplot,)
g.map_diag(annotate_colname)
g.map_upper(corrdot)
return g