2.2.3. Variable Overview#

1. Types of variables

  • Numeric

    • discrete

    • continuous

  • Categorical

    • Ordinal

    • Nominal

  • Datetime

    • Date only

    • Time only

    • Datetime

  • Mix-type

2. Variable Characteristics

2.1 Missing values

  • Cause

    • Forgotten, lost, not stored

    • value does not exist

    • Unknown/not identified values

  • Missing data mechanisms

    • Missing completely at random (MCAR)

    • Missing at random (MAR)

    • Missing not at random (MNAR)

  • Imputation Techniques

    • Numerical Variables

      • Mean/median imputation

      • Arbitrary value imputation

      • End of tail imputation

    • Categorical Variables

      • Frequent category

      • Adding new “missing” category

    • Both

      • Complete Case Analysis

      • Adding new “missing” category

      • Random sample imputation

2.2 Cardinality for categorical variables

  • High cardinality problem

    • Not cover all labels, dominate by a only few labels ( decision tree-based )

    • Make noise

    • distribution variables of train and test are not the same, tend to not capture in model, or not recognize with new pattern and overfitting

    • Difficult to transform in preprocessing

2.3 Rare labels for categorical variables

  • Lead to overfitting in tree-base models

  • maybe add noise, cause of overfitting

    • Should create new_label by grouped all rare label

    • Or remove rare label if it make noise or be non-representative

  • rare labels make train and test distribution is not the same (could appear only one of all)

2.4 Outliers

  • Identify by Extreme value analysis

    • For normal distribution

      • outliers = mean +/- 3* std.

    • For skewed distribution

      • outliers out of upper/lower boundary

        • Upper boundary = 75th quantile + (IQR * 1.5)

        • Lower boundary = 25th quantile - (IQR * 1.5)

      • extreme case

        • Upper boundary = 75th quantile + (IQR * 3)

        • Lower boundary = 25th quantile - (IQR * 3)

2.5 Linear model assumption

  • Assumption

    • Linearity: There is a linear relationship between predictors and target

    • No perfect multicollinearity: there are no perfect /high linear relationship between 2 or more of the predictors

    • Normally distributed errors: The residuals are random and normally distributed with a mean of 0

    • Homoscedasticity: at each level of the predictor variables, the variance of the error should be constant

  • linear models

    • Linear regression

    • Logistic regression

    • Linear discriminant analysis - LDA

  • Evaluate model performance

    • Check the residuals distribution with zero-mean normal by Q-Q plot, or KS-test

    • Check there is a linear relationship between X and Y, X and other X

    • Variance inflation factor (VIF) to check multi-collinearity

2.6 Variable magnitude

  • What is matters ?

    • The regression coefficient bị ảnh hưởng trực tiếp từ scale của variable

    • Variables with a larger magnitude dominate those with a smaller magnitude

    • Gradient descent, SVM converges fast with scaled variables

    • Euclidean distance are sensitive to feature magnitude

  • ML model

    • ML model are affected

      • Linear/ Logistic Regression

      • Neural Networks

      • SVMs

      • kNN

      • Kmeans

      • LDA

      • PCA

    • Not be affected

      • Tree-base model

        • Classification tree

        • Regression tree

        • Random Forests

        • Gradient Boosted trees

2.2.3.1. Datatypes detection#

def get_list_datatypes(df, discrete_nunique_max = 10, mixed_detection = []):
    # numerical: discrete vs continuous
    num = df.select_dtypes(include=[np.number, bool]).columns.tolist()
    discrete = [var for var in num if df[var].nunique() <= discrete_nunique_max]
    continuous = [var for var in num if var not in discrete]

    # categorical
    categorical = [var for var in df.columns if var not in (mixed_detection + num)]
    res = (discrete, continuous, categorical, mixed_detection)
    
    for t, n in zip(res, ['discrete', 'continuous', 'categorical', 'mixed']) :
        print(f'{len(t)} {n} variables: ', ", ".join(t))
    return discrete, continuous, categorical, mixed_detection

2.2.3.2. Correlation#

def PairGridCorr(X, y = None, corr = 'pearson'):
    def corrdot(*args, **kwargs):
        corr_r = args[0].corr(args[1], corr)
        corr_text = f"{corr_r:2.2f}".replace("0.", ".")
        ax = plt.gca()
        ax.set_axis_off()
        marker_size = abs(corr_r) * 10000
        ax.scatter([.5], [.5], marker_size, [corr_r], alpha=0.6, cmap="coolwarm",
                   vmin=-1, vmax=1, transform=ax.transAxes)
        font_size = abs(corr_r) * 50 + 5
        ax.annotate(corr_text, [.5, .5,],  xycoords="axes fraction",
                    ha='center', va='center', fontsize=font_size)

    def annotate_colname(x, **kws):
        ax = plt.gca()
        ax.annotate(x.name, xy=(0.5, 0.9), xycoords=ax.transAxes, fontweight='bold')

    sns.set(style='white', font_scale=1.3)
    # sns.set_theme( style="whitegrid")
    g = sns.PairGrid(X, aspect=1.4, diag_sharey=False)
    g.map_lower(sns.scatterplot, hue = y )
    g.map_diag(sns.histplot,)
    g.map_diag(annotate_colname)
    g.map_upper(corrdot)
    return g