[Machine Learning] Imbalanced dataset

10 minute read


Imbalance problem

Most machine learning algorithms assume data equally distributed. So when we have a class imbalance, the machine learning classifier tends to be more biased towards the majority class, causing bad classification of the minority class.



SMOTE (Synthetic Monority Oversampling TEchnique)

  • a random example from the minority class is first chosen. Then k of the nearest neighbors for that example are found (typically k=5).
  • A randomly selected neighbor is chosen and a synthetic example is created at a randomly selected point between the two examples in feature space.
  • SMOTE + RandomUnderSampler is fine


Resampling

UnderSampling

img

Oversampling

img


SMOTE

img

img


An issue with SMOTE

img


Borderline SMOTE

img

funMV: 기계학습에서 Imbalanced learning



Setup

import warnings
from warnings import simplefilter
# ignore all future warnings
simplefilter(action='ignore', category=FutureWarning)
!pip install imblearn


import imblearn
print(imblearn.__version__)
0.8.1


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, auc



Make dataset

# define dataset
from sklearn.datasets import make_classification
X_org, y_org = make_classification(n_samples=10000, n_features=2, n_redundant=0,
	                       n_clusters_per_class=1, weights=[0.99], flip_y=0, 
                           random_state=1)
X_org.shape, y_org.shape
((10000, 2), (10000,))


# Imbalanced dataset
len(y_org[y_org==0]), len(y_org[y_org==1])
(9900, 100)
markers = ['o', '+']
for i in range(2):
    xs = X_org[:, 0][y_org == i]
    ys = X_org[:, 1][y_org == i]
    plt.scatter(xs, ys, marker=markers[i], label=str(i), s=3)
plt.legend()
plt.show()

output_12_0


# Utility function for printing result

def print_result(X_train, X_test, y_train, y_test):

    model = RandomForestClassifier(n_estimators=100, max_depth=5)
    print("Shapes: ", X_train.shape, X_test.shape, y_train.shape, y_test.shape)
    model.fit(X_train, y_train)

    y_pred = model.predict(X_test)
    print("Static performance: ", "\n", confusion_matrix(y_test, y_pred))
    print()
    print(classification_report(y_test, y_pred))

    y_pred_proba = model.predict_proba(X_test)
    fpr, tpr, _ = roc_curve(y_test, y_pred_proba[:,1])
    print("AUC score: ", auc(fpr, tpr))
    print()


Classification on original dataset

# classification on original dataset

X, y = X_org.copy(), y_org.copy()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y)
print("original distribution: ", len(y[y==0])/len(y), len(y[y==1])/len(y))
print("Train data distribution: ", len(y_train[y_train==0])/len(y_train), len(y_train[y_train==1])/len(y_train))
print("Test data distribution: ", len(y_test[y_test==0])/len(y_test), len(y_test[y_test==1])/len(y_test))

print_result(X_train, X_test, y_train, y_test)
original distribution:  0.99 0.01
Train data distribution:  0.99 0.01
Test data distribution:  0.99 0.01
Shapes:  (7000, 2) (3000, 2) (7000,) (3000,)
Static performance:  
 [[2969    1]
 [  18   12]]

              precision    recall  f1-score   support

           0       0.99      1.00      1.00      2970
           1       0.92      0.40      0.56        30

    accuracy                           0.99      3000
   macro avg       0.96      0.70      0.78      3000
weighted avg       0.99      0.99      0.99      3000

AUC score:  0.9161672278338945
  • look at Bad Recall1 score. (40%)
    • Because of imbalance. (Think of it as detecting threatening objects such as guns and knives)

Minority class (1)에 대해 현저히 낮은 성능을 보인다.


Resampling

SMOTEed dataset

oversampling

from imblearn.over_sampling import SMOTE
oversample = SMOTE()
X, y = oversample.fit_resample(X_org, y_org)
X.shape, y.shape , len(y[y==0]), len(y[y==1])
((19800, 2), (19800,), 9900, 9900)
markers = ['o', '+']
for i in range(2):
    xs = X[:, 0][y == i]
    ys = X[:, 1][y == i]
    plt.scatter(xs, ys, marker=markers[i], label=str(i), s=3)
plt.legend()
plt.show()

output_20_0


Undersampled dataset

Undersampling

from imblearn.under_sampling import RandomUnderSampler

under = RandomUnderSampler()
X, y = under.fit_resample(X_org, y_org)
print(len(y[y==0]), len(y[y==1]))
100 100
markers = ['o', '+']
for i in range(2):
    xs = X[:, 0][y == i]
    ys = X[:, 1][y == i]
    plt.scatter(xs, ys, marker=markers[i], label=str(i))
plt.legend()
plt.show()

output_23_0


Original paper on SMOTE

  • combine SMOTE and Undersampler()

  • The original paper on SMOTE suggested combining SMOTE with random undersampling of the majority class.
  • We can update the example to first oversample the minority class to have 10 percent the number of examples of the majority class (e.g. about 1,000), then use random undersampling to reduce the number of examples in the majority class to have 50 percent more than the minority class (e.g. about 2,000).
over = SMOTE(sampling_strategy=0.1)
X, y = over.fit_resample(X_org, y_org)
print("oversampled: ", len(y[y==0]), len(y[y==1]))
oversampled:  9900 990
under = RandomUnderSampler(sampling_strategy=0.5)  # ratio
X, y = under.fit_resample(X, y)
print("under-sampled: ", len(y[y==0]), len(y[y==1]))
under-sampled:  1980 990
markers = ['o', '+']
for i in range(2):
    xs = X[:, 0][y == i]
    ys = X[:, 1][y == i]
    plt.scatter(xs, ys, marker=markers[i], label=str(i))
plt.legend()
plt.show()

output_28_0

  • SMOTE creates a line bridge with the majority class.


Borderline-SMOTE

  • ignore noise points (all the neighbors are majority class) and normal minority class points
  • resample only from border points (have both majority and minority classes as neighbors)
  • end up giving more attention to extreme points
# Borderline-SMOTE
from imblearn.over_sampling import BorderlineSMOTE
oversample = BorderlineSMOTE(0.1)
X, y = oversample.fit_resample(X_org, y_org)
print(len(y[y==0]), len(y[y==1]))

markers = ['o', '+']
for i in range(2):
    xs = X[:, 0][y == i]
    ys = X[:, 1][y == i]
    plt.scatter(xs, ys, marker=markers[i], label=str(i))
plt.legend()
plt.show()
9900 990

output_31_1

  • The plot shows that those examples far from the decision boundary are not oversampled. This includes both examples that are easier to classify (those orange points toward the top left of the plot) and those that are overwhelmingly difficult to classify given the strong class overlap (those orange points toward the bottom right of the plot).



Classification on resampled dataset

SMOTEed dataset

Wrong Example

데이터셋에 대해 처음부터 resampling을 진행하면 안된다.

Resampling은 Train set에 대해서만 이루어지고 Test set에는 실제 데이터만이 있어야 한다.

# resampling first - No !

print("*** Resample First *** \n")
oversample = SMOTE()
X, y = oversample.fit_resample(X_org, y_org)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y)
print_result(X_train, X_test, y_train, y_test)
print("Don't be confused...")
*** Resample First *** 

Shapes:  (13860, 2) (5940, 2) (13860,) (5940,)
Static performance:  
 [[2859  111]
 [ 323 2647]]

              precision    recall  f1-score   support

           0       0.90      0.96      0.93      2970
           1       0.96      0.89      0.92      2970

    accuracy                           0.93      5940
   macro avg       0.93      0.93      0.93      5940
weighted avg       0.93      0.93      0.93      5940

AUC score:  0.9809971771587933
Don't be confused...


Correct Example

데이터셋을 train/test set으로 먼저 나눈다.

그러고 난 후 train set에 대해서만 resampling을 진행한다.

# resample only on Train dataset ! - this is right !
print("*** Split First *** \n")
X_train, X_test, y_train, y_test = train_test_split(X_org, y_org, test_size=0.3, stratify=y_org)
oversample = SMOTE()
X_train, y_train = oversample.fit_resample(X_train, y_train)
print_result(X_train, X_test, y_train, y_test)
*** Split First *** 

Shapes:  (13860, 2) (3000, 2) (13860,) (3000,)
Static performance:  
 [[2840  130]
 [   5   25]]

              precision    recall  f1-score   support

           0       1.00      0.96      0.98      2970
           1       0.16      0.83      0.27        30

    accuracy                           0.95      3000
   macro avg       0.58      0.89      0.62      3000
weighted avg       0.99      0.95      0.97      3000

AUC score:  0.9396801346801347

Recall1 이 83% 까지 향상되었다.


Undersampled dataset

Train set 의 수가 너무 작아 Overfitting될 위험이 있다.

# undersampling
print("*** Undersampling *** \n")
X_train, X_test, y_train, y_test = train_test_split(X_org, y_org, test_size=0.3, stratify=y_org)
under = RandomUnderSampler()
X_train, y_train = under.fit_resample(X_train, y_train)
print_result(X_train, X_test, y_train, y_test)
*** Undersampling *** 

Shapes:  (140, 2) (3000, 2) (140,) (3000,)
Static performance:  
 [[2567  403]
 [   2   28]]

              precision    recall  f1-score   support

           0       1.00      0.86      0.93      2970
           1       0.06      0.93      0.12        30

    accuracy                           0.86      3000
   macro avg       0.53      0.90      0.52      3000
weighted avg       0.99      0.86      0.92      3000

AUC score:  0.967996632996633


Use both SMOTE & Undersampling

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RepeatedStratifiedKFold, cross_val_score
# use both SMOTE and Underssampler
X_train, X_test, y_train, y_test = train_test_split(X_org, y_org, test_size=0.3, stratify=y_org)
over = SMOTE(sampling_strategy=0.1)
X_train, y_train = over.fit_resample(X_train, y_train)
under = RandomUnderSampler(sampling_strategy=0.5)
X_train, y_train = under.fit_resample(X_train, y_train)
print_result(X_train, X_test, y_train, y_test)
Shapes:  (2079, 2) (3000, 2) (2079,) (3000,)
Static performance:  
 [[2887   83]
 [   7   23]]

              precision    recall  f1-score   support

           0       1.00      0.97      0.98      2970
           1       0.22      0.77      0.34        30

    accuracy                           0.97      3000
   macro avg       0.61      0.87      0.66      3000
weighted avg       0.99      0.97      0.98      3000

AUC score:  0.9530695847362514


# Use stratified K-Fold 

from sklearn.model_selection import StratifiedKFold

X, y = X_org.copy(), y_org.copy()
cv = StratifiedKFold(n_splits=5, shuffle=True)
score = []

for train_idx, test_idx in cv.split(X_org, y_org): # 5 repetition
    X_train, y_train = X[train_idx], y[train_idx]
    X_test, y_test = X[test_idx], y[test_idx]
   
    over = SMOTE(sampling_strategy=0.1)
    X_train, y_train = over.fit_resample(X_train, y_train)
    under = RandomUnderSampler(sampling_strategy=0.5)  # ratio
    X_train, y_train = under.fit_resample(X_train, y_train)

    print_result(X_train, X_test, y_train, y_test)

Shapes:  (2376, 2) (2000, 2) (2376,) (2000,)
Static performance:  
 [[1920   60]
 [   3   17]]

              precision    recall  f1-score   support

           0       1.00      0.97      0.98      1980
           1       0.22      0.85      0.35        20

    accuracy                           0.97      2000
   macro avg       0.61      0.91      0.67      2000
weighted avg       0.99      0.97      0.98      2000

AUC score:  0.9767424242424242

Shapes:  (2376, 2) (2000, 2) (2376,) (2000,)
Static performance:  
 [[1937   43]
 [   5   15]]

              precision    recall  f1-score   support

           0       1.00      0.98      0.99      1980
           1       0.26      0.75      0.38        20

    accuracy                           0.98      2000
   macro avg       0.63      0.86      0.69      2000
weighted avg       0.99      0.98      0.98      2000

AUC score:  0.9520454545454544

Shapes:  (2376, 2) (2000, 2) (2376,) (2000,)
Static performance:  
 [[1938   42]
 [   4   16]]

              precision    recall  f1-score   support

           0       1.00      0.98      0.99      1980
           1       0.28      0.80      0.41        20

    accuracy                           0.98      2000
   macro avg       0.64      0.89      0.70      2000
weighted avg       0.99      0.98      0.98      2000

AUC score:  0.9510858585858586

Shapes:  (2376, 2) (2000, 2) (2376,) (2000,)
Static performance:  
 [[1939   41]
 [   6   14]]

              precision    recall  f1-score   support

           0       1.00      0.98      0.99      1980
           1       0.25      0.70      0.37        20

    accuracy                           0.98      2000
   macro avg       0.63      0.84      0.68      2000
weighted avg       0.99      0.98      0.98      2000

AUC score:  0.9346464646464646

Shapes:  (2376, 2) (2000, 2) (2376,) (2000,)
Static performance:  
 [[1939   41]
 [   4   16]]

              precision    recall  f1-score   support

           0       1.00      0.98      0.99      1980
           1       0.28      0.80      0.42        20

    accuracy                           0.98      2000
   macro avg       0.64      0.89      0.70      2000
weighted avg       0.99      0.98      0.98      2000

AUC score:  0.9304671717171717


Different values of the k-nearest neighbors selected in the SMOTE

X_org, y_org = make_classification(n_samples=10000, n_features=2, n_redundant=0,
	                       n_clusters_per_class=1, weights=[0.99], flip_y=0, 
                           random_state=1)
k_values = [1,10,20,30,40]

for k in k_values:
    X_train, X_test, y_train, y_test = train_test_split(X_org, y_org, test_size=0.3, stratify=y_org)
    over = SMOTE(sampling_strategy=0.1, k_neighbors=k) # k_neighbors defualt=5
    X_train, y_train = over.fit_resample(X_train, y_train)
    under = RandomUnderSampler(sampling_strategy=0.5)
    X_train, y_train = under.fit_resample(X_train, y_train)
    
    print_result(X_train, X_test, y_train, y_test)

Shapes:  (2079, 2) (3000, 2) (2079,) (3000,)
Static performance:  
 [[2872   98]
 [   6   24]]

              precision    recall  f1-score   support

           0       1.00      0.97      0.98      2970
           1       0.20      0.80      0.32        30

    accuracy                           0.97      3000
   macro avg       0.60      0.88      0.65      3000
weighted avg       0.99      0.97      0.98      3000

AUC score:  0.9604545454545456

Shapes:  (2079, 2) (3000, 2) (2079,) (3000,)
Static performance:  
 [[2908   62]
 [   7   23]]

              precision    recall  f1-score   support

           0       1.00      0.98      0.99      2970
           1       0.27      0.77      0.40        30

    accuracy                           0.98      3000
   macro avg       0.63      0.87      0.69      3000
weighted avg       0.99      0.98      0.98      3000

AUC score:  0.9728451178451178

Shapes:  (2079, 2) (3000, 2) (2079,) (3000,)
Static performance:  
 [[2869  101]
 [   3   27]]

              precision    recall  f1-score   support

           0       1.00      0.97      0.98      2970
           1       0.21      0.90      0.34        30

    accuracy                           0.97      3000
   macro avg       0.60      0.93      0.66      3000
weighted avg       0.99      0.97      0.98      3000

AUC score:  0.92172278338945

Shapes:  (2079, 2) (3000, 2) (2079,) (3000,)
Static performance:  
 [[2902   68]
 [   9   21]]

              precision    recall  f1-score   support

           0       1.00      0.98      0.99      2970
           1       0.24      0.70      0.35        30

    accuracy                           0.97      3000
   macro avg       0.62      0.84      0.67      3000
weighted avg       0.99      0.97      0.98      3000

AUC score:  0.9804545454545455

Shapes:  (2079, 2) (3000, 2) (2079,) (3000,)
Static performance:  
 [[2919   51]
 [   6   24]]

              precision    recall  f1-score   support

           0       1.00      0.98      0.99      2970
           1       0.32      0.80      0.46        30

    accuracy                           0.98      3000
   macro avg       0.66      0.89      0.72      3000
weighted avg       0.99      0.98      0.98      3000

AUC score:  0.9852020202020202


Borderline-SMOTE

from sklearn.model_selection import RepeatedStratifiedKFold
from imblearn.over_sampling import BorderlineSMOTE

cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=1)

for train_idx, test_idx in cv.split(X_org, y_org):
    X_train, y_train = X[train_idx], y[train_idx]
    X_test, y_test = X[test_idx], y[test_idx]
   
    over = BorderlineSMOTE(sampling_strategy=0.1)
    X_train, y_train = over.fit_resample(X_train, y_train)
    under = RandomUnderSampler(sampling_strategy=0.5)  # ratio
    X_train, y_train = under.fit_resample(X_train, y_train)
    
    print_result(X_train, X_test, y_train, y_test)

Shapes:  (2376, 2) (2000, 2) (2376,) (2000,)
Static performance:  
 [[1954   26]
 [   5   15]]

              precision    recall  f1-score   support

           0       1.00      0.99      0.99      1980
           1       0.37      0.75      0.49        20

    accuracy                           0.98      2000
   macro avg       0.68      0.87      0.74      2000
weighted avg       0.99      0.98      0.99      2000

AUC score:  0.9435606060606061

Shapes:  (2376, 2) (2000, 2) (2376,) (2000,)
Static performance:  
 [[1936   44]
 [   0   20]]

              precision    recall  f1-score   support

           0       1.00      0.98      0.99      1980
           1       0.31      1.00      0.48        20

    accuracy                           0.98      2000
   macro avg       0.66      0.99      0.73      2000
weighted avg       0.99      0.98      0.98      2000

AUC score:  0.9959469696969697

Shapes:  (2376, 2) (2000, 2) (2376,) (2000,)
Static performance:  
 [[1960   20]
 [   7   13]]

              precision    recall  f1-score   support

           0       1.00      0.99      0.99      1980
           1       0.39      0.65      0.49        20

    accuracy                           0.99      2000
   macro avg       0.70      0.82      0.74      2000
weighted avg       0.99      0.99      0.99      2000

AUC score:  0.9078030303030303

Shapes:  (2376, 2) (2000, 2) (2376,) (2000,)
Static performance:  
 [[1960   20]
 [   8   12]]

              precision    recall  f1-score   support

           0       1.00      0.99      0.99      1980
           1       0.38      0.60      0.46        20

    accuracy                           0.99      2000
   macro avg       0.69      0.79      0.73      2000
weighted avg       0.99      0.99      0.99      2000

AUC score:  0.9350883838383839

Shapes:  (2376, 2) (2000, 2) (2376,) (2000,)
Static performance:  
 [[1956   24]
 [   5   15]]

              precision    recall  f1-score   support

           0       1.00      0.99      0.99      1980
           1       0.38      0.75      0.51        20

    accuracy                           0.99      2000
   macro avg       0.69      0.87      0.75      2000
weighted avg       0.99      0.99      0.99      2000

AUC score:  0.961590909090909

Leave a comment