Variable Overview

2.2.3. Variable Overview#

1. Types of variables

Numeric
- discrete
- continuous
Categorical
- Ordinal
- Nominal
Datetime
- Date only
- Time only
- Datetime
Mix-type

2. Variable Characteristics

2.1 Missing values

Cause
- Forgotten, lost, not stored
- value does not exist
- Unknown/not identified values
Missing data mechanisms
- Missing completely at random (MCAR)
- Missing at random (MAR)
- Missing not at random (MNAR)
Imputation Techniques
- Numerical Variables
  - Mean/median imputation
  - Arbitrary value imputation
  - End of tail imputation
- Categorical Variables
  - Frequent category
  - Adding new “missing” category
- Both
  - Complete Case Analysis
  - Adding new “missing” category
  - Random sample imputation

2.2 Cardinality for categorical variables

High cardinality problem
- Not cover all labels, dominate by a only few labels ( decision tree-based )
- Make noise
- distribution variables of train and test are not the same, tend to not capture in model, or not recognize with new pattern and overfitting
- Difficult to transform in preprocessing

2.3 Rare labels for categorical variables

Lead to overfitting in tree-base models
maybe add noise, cause of overfitting
- Should create new_label by grouped all rare label
- Or remove rare label if it make noise or be non-representative
rare labels make train and test distribution is not the same (could appear only one of all)

2.4 Outliers

Identify by Extreme value analysis
- For normal distribution
  - outliers = mean +/- 3* std.
- For skewed distribution
  - outliers out of upper/lower boundary
    - Upper boundary = 75th quantile + (IQR * 1.5)
    - Lower boundary = 25th quantile - (IQR * 1.5)
  - extreme case
    - Upper boundary = 75th quantile + (IQR * 3)
    - Lower boundary = 25th quantile - (IQR * 3)

2.5 Linear model assumption

Assumption
- Linearity: There is a linear relationship between predictors and target
- No perfect multicollinearity: there are no perfect /high linear relationship between 2 or more of the predictors
- Normally distributed errors: The residuals are random and normally distributed with a mean of 0
- Homoscedasticity: at each level of the predictor variables, the variance of the error should be constant
linear models
- Linear regression
- Logistic regression
- Linear discriminant analysis - LDA
- …
Evaluate model performance
- Check the residuals distribution with zero-mean normal by Q-Q plot, or KS-test
- Check there is a linear relationship between X and Y, X and other X
- Variance inflation factor (VIF) to check multi-collinearity

2.6 Variable magnitude

What is matters ?
- The regression coefficient bị ảnh hưởng trực tiếp từ scale của variable
- Variables with a larger magnitude dominate those with a smaller magnitude
- Gradient descent, SVM converges fast with scaled variables
- Euclidean distance are sensitive to feature magnitude
ML model
- ML model are affected
  - Linear/ Logistic Regression
  - Neural Networks
  - SVMs
  - kNN
  - Kmeans
  - LDA
  - PCA
- Not be affected
  - Tree-base model
    - Classification tree
    - Regression tree
    - Random Forests
    - Gradient Boosted trees

2.2.3.1. Datatypes detection#

def get_list_datatypes(df, discrete_nunique_max = 10, mixed_detection = []):
    # numerical: discrete vs continuous
    num = df.select_dtypes(include=[np.number, bool]).columns.tolist()
    discrete = [var for var in num if df[var].nunique() <= discrete_nunique_max]
    continuous = [var for var in num if var not in discrete]

    # categorical
    categorical = [var for var in df.columns if var not in (mixed_detection + num)]
    res = (discrete, continuous, categorical, mixed_detection)
    
    for t, n in zip(res, ['discrete', 'continuous', 'categorical', 'mixed']) :
        print(f'{len(t)} {n} variables: ', ", ".join(t))
    return discrete, continuous, categorical, mixed_detection

2.2.3.2. Correlation#

def PairGridCorr(X, y = None, corr = 'pearson'):
    def corrdot(*args, **kwargs):
        corr_r = args[0].corr(args[1], corr)
        corr_text = f"{corr_r:2.2f}".replace("0.", ".")
        ax = plt.gca()
        ax.set_axis_off()
        marker_size = abs(corr_r) * 10000
        ax.scatter([.5], [.5], marker_size, [corr_r], alpha=0.6, cmap="coolwarm",
                   vmin=-1, vmax=1, transform=ax.transAxes)
        font_size = abs(corr_r) * 50 + 5
        ax.annotate(corr_text, [.5, .5,],  xycoords="axes fraction",
                    ha='center', va='center', fontsize=font_size)

    def annotate_colname(x, **kws):
        ax = plt.gca()
        ax.annotate(x.name, xy=(0.5, 0.9), xycoords=ax.transAxes, fontweight='bold')

    sns.set(style='white', font_scale=1.3)
    # sns.set_theme( style="whitegrid")
    g = sns.PairGrid(X, aspect=1.4, diag_sharey=False)
    g.map_lower(sns.scatterplot, hue = y )
    g.map_diag(sns.histplot,)
    g.map_diag(annotate_colname)
    g.map_upper(corrdot)
    return g

Variable Overview

Contents

2.2.3. Variable Overview#

2.2.3.1. Datatypes detection#

2.2.3.2. Correlation#