2.2.7. Discretisation#

  • The process of binning/Transforming continuous variables into discrete variables by creating a set of continuous intervals/bin that span range of value

  • Discretisation can handle the outlier because each outlier may be in the lowest or highest bin.

1. Why need to use Discretisation ?

  • Reducing complexity: Discretization can simplify the feature space by reducing the number of possible values that a feature can take on. This can make it easier to build models and reduce the risk of overfitting.

  • Meeting model assumptions: Some models, such as decision trees and naive Bayes classifiers, are designed to work with categorical variables. Discretizing continuous variables can help these models perform better.

  • Dealing with sparsity: In some datasets, certain values of a continuous variable may be rare or non-existent. Discretization can help address this issue by grouping together similar values and reducing the sparsity of the data.

  • Interpretability: Discretization can make it easier to interpret the relationship between a feature and the target variable. For example, if we discretize age into categories like “young”, “middle-aged”, and “old”, we can more easily see how age affects the outcome variable.

It’s worth noting that discretization is not always necessary or appropriate. In some cases, it may be better to keep continuous variables as they are or to use other techniques, such as binning or scaling, to transform them. The choice of whether to discretize or not depends on the specific problem and the characteristics of the data.

2. Approaches

  • Unsupervised

    • Equal-width

    • Equal-frequency

    • K-means

  • Supervised (need to inform the target variables)

    • Decision Trees

  • Others

    • Domain Knowledge

Plus Encoding: If using linear model and bins may not hold a linear relationship with target, encoding the variable after discretisation may help improve model performance to treat the bins as categories and to one hot encoding, or target guided encodings like mean encoding, weight of evidence, or target guided ordinal encoding.

import plotly.graph_objects as go
def bin_hist(listvar, train, test = None, cols = 3):
    listvar = [listvar] if type(listvar)==str else listvar
    cols = min(cols, len(listvar))
    rows = (len(listvar) // cols) + 1 if ((len(listvar) % cols) != 0) else (len(listvar) // cols)
    fig = make_subplots(rows = rows, cols = cols , subplot_titles=listvar)
    for i, var in enumerate(listvar):     
        fig.add_trace(go.Histogram( x = train[var], name="train", marker_color='#656FF4', bingroup=i, histnorm = 'percent')#, texttemplate="%{y}")
                      , row=i//cols + 1, col=i%cols + 1 )
        if test is not None:
            fig.add_trace(go.Histogram( x = test[var], name="test", marker_color='#F85341', bingroup=i, histnorm = 'percent')#, texttemplate="%{y}")
                          , row=i//cols + 1, col=i%cols + 1 )
        
    fig.update_layout(autosize = True,height=rows*400, barmode='group', bargap=0.2, bargroupgap=0.05, showlegend=False, yaxis_title="percentage")
    fig.update_xaxes(categoryorder='category ascending')
    return fig#.show(renderer="jpeg")
data = pd.read_csv('Datasets/titanic.csv',usecols = ['age', 'fare', 'survived']).dropna()

X_train, X_test, y_train, y_test = train_test_split(
    data[['age', 'fare']],
    data['survived'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape
((731, 2), (314, 2))

2.2.7.1. Equal-width#

Equal width discretisation divides the scope of possible values into N bins of the same width.The width is determined by the range of values in the variable and the number of bins we wish to use to divide the variable:

width = (max value - min value) / N

where N is the number of bins or intervals, that is something to determine experimentally.

  • Does not improve value spread

  • Handles outliers

  • Creates discrete variable

  • Good to combine with categorical encodings

# pandas cut (cut by specific interval)
import matplotlib.pyplot as plt
X_train_copy = X_train.copy()

def gen_intervals(sr, N):
    # now let's capture the lower and upper boundaries
    min_value = int(np.floor( sr.min()))
    max_value = int(np.ceil( sr.max()))

    # let's round the bin width
    inter_value = int(np.round((max_value - min_value) / N))

    # interval list
    intervals = [i for i in range(min_value, max_value+inter_value, inter_value)]
    labels = ['Bin_' + str(i) for i in range(1, len(intervals))]
    return intervals, labels

intervals, labels = gen_intervals(X_train_copy['age'], N = 8)
X_train_copy['Age_disc_labels'] = pd.cut(x=X_train_copy['age'],bins=intervals, labels=labels, include_lowest=True)
X_train_copy['Age_disc'] = pd.cut(x=X_train_copy['age'], bins=intervals, include_lowest=True)

# plot
X_train_copy.groupby('Age_disc')['age'].count().plot.bar()
plt.xticks(rotation=45)
plt.ylabel('Number of observations per bin')

X_train_copy.head(10)
age fare Age_disc_labels Age_disc
943 37.0 9.5875 Bin_4 (30.0, 40.0]
195 16.0 86.5000 Bin_2 (10.0, 20.0]
1257 9.0 15.2458 Bin_1 (-0.001, 10.0]
1266 36.0 24.1500 Bin_4 (30.0, 40.0]
440 48.0 65.0000 Bin_5 (40.0, 50.0]
1113 26.0 13.7750 Bin_3 (20.0, 30.0]
1091 20.0 8.6625 Bin_2 (10.0, 20.0]
907 20.0 9.8250 Bin_2 (10.0, 20.0]
42 59.0 51.4792 Bin_6 (50.0, 60.0]
1131 32.0 8.0500 Bin_4 (30.0, 40.0]
../../../../_images/fbd9c55016a1f78909311d66746446d3c57b251e779fcfacfb3b51c263158aca.png
# sklearn
from sklearn.preprocessing import KBinsDiscretizer

disc = KBinsDiscretizer(n_bins=8, encode='ordinal', strategy='uniform').set_output(transform="pandas")
disc.fit(X_train)

train_t = disc.transform(X_train)
test_t = disc.transform(X_test)

bin_hist('age', train_t, test_t )
../../../../_images/4895478d77361dd97d57ff49a88b31c4c9282414f39f2734b00853742c3ee7f5.jpg

2.2.7.2. Equal-frequency#

  • Equal frequency discretisation divides the scope of possible values of the variable into N bins, where each bin carries the same amount of observations.

  • This is particularly useful for skewed variables as it spreads the observations over the different bins equally. We find the interval boundaries by determining the quantiles.

  • Does not improve value spread

  • Handles outliers

  • Creates discrete variable

  • Good to combine with categorical encodings

# pandas qcut (cut by Quantile base)
N = 10
labels = ['Q{:02.0f}'.format(i) for i in range(1, N+1)]
Age_disccretised, intervals = pd.qcut( X_train['age'], N, labels = labels,
                                      retbins=True, # return bin to use for pd.cut testset
                                      precision=3, duplicates='raise')
X_train_copy = X_train.copy()
X_train_copy['Age_disccretised_bin'] = Age_disccretised
bin_hist('Age_disccretised_bin', X_train_copy )
../../../../_images/b43c0c525bacad3763461829a109c197a87e80bb96f02aa95fb877de01ed611f.jpg
# sklearn 
disc = KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='quantile').set_output(transform="pandas")
disc.fit(X_train)

train_t = disc.transform(X_train)
test_t = disc.transform(X_test)

bin_hist('age', train_t, test_t )
../../../../_images/f2c63df26915fd189753a9a9da821c0aac9fda4b5a47d39ac72074ec9c007f53.jpg

2.2.7.3. K-mean#

  • With K number of clusters was be defined, the method consists in applying k-mean clustering to the continuous variable.

  • Unless you have reasons to believe that the values of the variable are organised in clusters, then use equal width discretisation as an alternative to this method

disc = KBinsDiscretizer(n_bins=4, encode='ordinal', strategy='kmeans')

2.2.7.4. Decision Tree discretesation#

  • Discretisation with Decision Trees consists in using a decision tree to identify the optimal bins.

  • The optimal bins are the limitation outputs from tree model of limited depth (2, 3 or 4) using the variable and target

1. Advantages

  • The output returned by the decision tree is monotonically related to the target.

  • The tree end nodes, or bins in the discretised variable show decreased entropy: that is, the observations within each bin are more similar among themselves than to those of other bins.

2. Limitations

  • Prone over-fitting

  • More importantly, some tuning of the tree parameters needed to obtain the optimal number of splits (e.g., tree depth, minimum number of samples in one partition, maximum number of partitions, and a minimum information gain). This it can be time consuming.

from sklearn.tree import DecisionTreeClassifier, export_graphviz

X_train_copy = X_train[['age']].copy()
X_test_copy = X_test[['age']].copy()

tree_model = DecisionTreeClassifier(max_depth=3)
tree_model.fit(X_train_copy, y_train)

train_t = pd.DataFrame(tree_model.predict_proba(X_train_copy)[:,1], columns = ['age'], index = y_train.index) # probability for class 1
test_t = pd.DataFrame(tree_model.predict_proba(X_test_copy)[:,1], columns = ['age'], index = y_test.index) # probability for class 1

bin_hist('age', train_t, test_t )
../../../../_images/4ce6d7ef570bdcc639d1481622daef05286004caa3061918b1eecd06b5965ec2.jpg
# check monotonic in train and test
y_train.to_frame().groupby(train_t['age']).mean().plot(title = 'Monotonic age - target (train)', ylabel = 'mean target')
y_test.to_frame().groupby(test_t['age']).mean().plot(title = 'Monotonic age - target (test)', ylabel = 'mean target')
# test is not monotonic ==> overfiting
<AxesSubplot: title={'center': 'Monotonic age - target (test)'}, xlabel='age', ylabel='mean target'>
../../../../_images/b7751594f180f5efe31c914662a91276500abe9ee76a39bc4ab42f98ee58b848.png ../../../../_images/d4af81448a1c7ddb64d763380971a786a1f5da8597198c669366082010620c74.png
# optimize the tree model
# choose the depth that generates the best roc-auc

from sklearn.model_selection import cross_val_score
score_ls = []  # here we store the roc auc
score_std_ls = []  # here we store the standard deviation of the roc_auc

for tree_depth in [1, 2, 3, 4]:

    # call the model
    tree_model = DecisionTreeClassifier(max_depth=tree_depth)

    # train the model using 3 fold cross validation
    scores = cross_val_score(tree_model, X_train_copy, y_train, cv=3, scoring='roc_auc')
    
    # save the parameters
    score_ls.append(np.mean(scores))
    score_std_ls.append(np.std(scores))

    
# capture the parameters in a dataframe
temp = pd.concat([pd.Series([1, 2, 3, 4]), pd.Series(
    score_ls), pd.Series(score_std_ls)], axis=1)

temp.columns = ['depth', 'roc_auc_mean', 'roc_auc_std']
temp
depth roc_auc_mean roc_auc_std
0 1 0.532756 0.009545
1 2 0.519149 0.015073
2 3 0.514044 0.007280
3 4 0.519165 0.012087

2.2.7.5. Domain knowledge discretisation#

Frequently, when engineering variables in a business setting, the business experts determine the intervals in which they think the variable should be divided so that it makes sense for the business

# feature_engine
from feature_engine.discretisation import ArbitraryDiscretiser

dis = ArbitraryDiscretiser({'age':[0,20,40,60,200]})
dis.fit(X_train)
dis.transform(X_train)
age fare
943 1 9.5875
195 0 86.5000
1257 0 15.2458
1266 1 24.1500
440 2 65.0000
... ... ...
1290 2 7.0000
846 0 9.5000
952 1 7.7750
614 1 7.8875
745 1 6.9500

731 rows × 2 columns

# pandas
pd.cut(X_train['age'], bins=[0,20,40,60,200], include_lowest=True)
943       (20.0, 40.0]
195     (-0.001, 20.0]
1257    (-0.001, 20.0]
1266      (20.0, 40.0]
440       (40.0, 60.0]
             ...      
1290      (40.0, 60.0]
846     (-0.001, 20.0]
952       (20.0, 40.0]
614       (20.0, 40.0]
745       (20.0, 40.0]
Name: age, Length: 731, dtype: category
Categories (4, interval[float64, right]): [(-0.001, 20.0] < (20.0, 40.0] < (40.0, 60.0] < (60.0, 200.0]]