[Machine Learning] Classification Performance

7 minute read

Metrics

Static performance

Confusion matrix - accuracy, precision, recall (sensitivity), f1

Dynamic performance

ROC/AUC

Setup

import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
# performance evaluation library
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, auc 
from sklearn.preprocessing import StandardScaler, LabelEncoder
%matplotlib inline

Static performance and Confusion_matrix

Make dataset

# evaluation (prediction) score: score or probability
y_score = np.linspace(99, 60, 20).round(1)
print(y_score)

[99.  96.9 94.9 92.8 90.8 88.7 86.7 84.6 82.6 80.5 78.5 76.4 74.4 72.3
 70.3 68.2 66.2 64.1 62.1 60. ]

# Prediction classes
y_pred=[1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0]
len(y_pred)
y_pred.count(1), y_pred.count(0)

(14, 6)

# Real classes
y_test=[1,1,0,1,0,1,1,1,0,0,1,0,1,1,0,1,0,0,0,0]
y_test.count(1), y_test.count(0)

(10, 10)

pd.DataFrame({'y_test': y_test, 'y_pred': y_pred})

	y_test	y_pred
0	1	1
1	1	1
2	0	1
3	1	1
4	0	1
5	1	1
6	1	1
7	1	1
8	0	1
9	0	1
10	1	1
11	0	1
12	1	1
13	1	1
14	0	0
15	1	0
16	0	0
17	0	0
18	0	0
19	0	0

Confustion Matrix

confusion_matrix(y_test, y_pred)

array([[5, 5],
       [1, 9]], dtype=int64)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.83      0.50      0.62        10
           1       0.64      0.90      0.75        10

    accuracy                           0.70        20
   macro avg       0.74      0.70      0.69        20
weighted avg       0.74      0.70      0.69        20

Precision = 모델이 옳다고 한 것이 실제로 옳은 비율 TP / ( TP + FP )
Recall = 실제 정답인 것들 중, 모델이 정답이라고 예측한 것 TP / TP+FN
f1-score = Precision과 Recall의 조화평균 (2 x Precision x Recall ) / (Precision + Recall)
support는 이 모델에서 응답한 샘플의 수이다
precision_0 = 5/(5+1) = 0.83
precision_1 = 9/(5+9) = 0.64
macro average precision = (0.83 + 0.64)/2 = 0.735
micro average precision = (5+9)/(6+14) = 0.7
weighted average precision = 0.83x10/20 + 0.64x10/20 = 0.735

Dynamic performance

called Ranking-based or Score-based

Make dataset

result = pd.DataFrame(list(zip(y_score, y_pred, y_test)), 
                      columns=['score', 'predict', 'real'])
result['correct'] = (result.predict == result.real)
result.head(20)

	score	predict	real	correct
0	99.0	1	1	True
1	96.9	1	1	True
2	94.9	1	0	False
3	92.8	1	1	True
4	90.8	1	0	False
5	88.7	1	1	True
6	86.7	1	1	True
7	84.6	1	1	True
8	82.6	1	0	False
9	80.5	1	0	False
10	78.5	1	1	True
11	76.4	1	0	False
12	74.4	1	1	True
13	72.3	1	1	True
14	70.3	0	0	True
15	68.2	0	1	False
16	66.2	0	0	True
17	64.1	0	0	True
18	62.1	0	0	True
19	60.0	0	0	True

ROC and AUC

ROC로 성능 평가 (맞춘 순서를 평가)

tpr = TP/P = TP/(TP+FN) : 실제 P 인경우 대비 TP 비율 (= recall)
fpr = FP/N = FP/(FP+TN) : 실제 N 인 경우 대비 FP 비율

# fpr = dict()
# tpr = dict()
# roc_auc = dict()

fpr, tpr, thresholds1 = roc_curve(y_test, y_score)
roc_auc = auc(fpr, tpr)

y_score, y_test

(array([99. , 96.9, 94.9, 92.8, 90.8, 88.7, 86.7, 84.6, 82.6, 80.5, 78.5,
        76.4, 74.4, 72.3, 70.3, 68.2, 66.2, 64.1, 62.1, 60. ]),
 [1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0])

pd.DataFrame([thresholds1, tpr, fpr], index=['threshold','tpr','fpr'])

	0	1	2	3	4	5	6	7	8	9	10	11	12	13
threshold	100.0	99.0	96.9	94.9	92.8	90.8	84.6	80.5	78.5	76.4	72.3	70.3	68.2	60.0
tpr	0.0	0.1	0.2	0.2	0.3	0.3	0.6	0.6	0.7	0.7	0.9	0.9	1.0	1.0
fpr	0.0	0.0	0.0	0.1	0.1	0.2	0.2	0.4	0.4	0.5	0.5	0.6	0.6	1.0

# just to see how many 1 and 0 are in the test set
total_p, total_n  = (np.array(y_test)==1).sum(), (np.array(y_test)==0).sum()
total_p, total_n

(10, 10)

plt.figure(figsize=(6,6))
plt.plot(fpr, tpr, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC')
plt.legend(loc="lower right")

<matplotlib.legend.Legend at 0x2374a4336d0>

output_23_1

3명의 능력 비교

y_real=[[1,0,0,0,0,0,1,1,0,0,1,0,1,1,0,1,0,1,0,0],
        [1,1,0,1,1,0,1,1,0,0,1,0,1,1,0,1,0,0,0,0],
        [1,1,1,1,1,1,0,1,0,1,1,1,0,0,0,0,0,0,0,0]]

y_score, y_real[0]

(array([99. , 96.9, 94.9, 92.8, 90.8, 88.7, 86.7, 84.6, 82.6, 80.5, 78.5,
        76.4, 74.4, 72.3, 70.3, 68.2, 66.2, 64.1, 62.1, 60. ]),
 [1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0])

plt.figure(figsize=(6,6))    
fpr = dict()
tpr = dict()
plt.plot([0, 1], [0, 1], linestyle='--')

my_color = ['r', 'b', 'k']
for i in range(3):
    fpr, tpr, _ = roc_curve(y_real[i], y_score)
    roc_auc = auc(fpr, tpr)
    plt.plot(fpr, tpr, c=my_color[i])

output_27_0

Precision and Recall

Precision = TruePositives / (TruePositives + FalsePositives)
Recall = TruePositives / (TruePositives + FalseNegatives)
Both the precision and the recall are focused on only the positive class (the minority class) and are unconcerned with the true negatives (majority class).
precision-recall curve (PR curve): precision and recall for different probability threshold
Precision-recall curves (PR curves) are recommended for highly skewed domains where ROC curves may provide an excessively optimistic view of the performance.

y_pred

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0]

from sklearn.metrics import average_precision_score
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import plot_precision_recall_curve

precision, recall, thresholds2 = precision_recall_curve(y_test, y_score)

precision, recall, thresholds2[::-1], y_score

(array([0.625     , 0.6       , 0.64285714, 0.61538462, 0.58333333,
        0.63636364, 0.6       , 0.66666667, 0.75      , 0.71428571,
        0.66666667, 0.6       , 0.75      , 0.66666667, 1.        ,
        1.        , 1.        ]),
 array([1. , 0.9, 0.9, 0.8, 0.7, 0.7, 0.6, 0.6, 0.6, 0.5, 0.4, 0.3, 0.3,
        0.2, 0.2, 0.1, 0. ]),
 array([99. , 96.9, 94.9, 92.8, 90.8, 88.7, 86.7, 84.6, 82.6, 80.5, 78.5,
        76.4, 74.4, 72.3, 70.3, 68.2]),
 array([99. , 96.9, 94.9, 92.8, 90.8, 88.7, 86.7, 84.6, 82.6, 80.5, 78.5,
        76.4, 74.4, 72.3, 70.3, 68.2, 66.2, 64.1, 62.1, 60. ]))

thresholds1, thresholds2[::-1]  # little different

(array([100. ,  99. ,  96.9,  94.9,  92.8,  90.8,  84.6,  80.5,  78.5,
         76.4,  72.3,  70.3,  68.2,  60. ]),
 array([99. , 96.9, 94.9, 92.8, 90.8, 88.7, 86.7, 84.6, 82.6, 80.5, 78.5,
        76.4, 74.4, 72.3, 70.3, 68.2]))

auc_score = auc(recall, precision)
plt.plot(recall, precision, label='Precision-Recall curve (area = %0.2f)' % auc_score)
plt.legend(loc="upper right")

<matplotlib.legend.Legend at 0x2374a623f10>

output_33_1

An example

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_curve
import matplotlib.pyplot as plt

X, y = make_classification(n_samples=1000, n_classes=2, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=2)
model = LogisticRegression()
model.fit(X_train, y_train)
y_score = model.predict_proba(X_test)

plt.figure(figsize=(10,6))
# ROC curve
plt.subplot(1,2,1)
fpr, tpr, _ = roc_curve(y_test, y_score[:,1])
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, c=my_color[i])
# PR curve
plt.subplot(1,2,2)
precision, recall, thresholds = precision_recall_curve(y_test, y_score[:,1])
auc_score = auc(recall, precision)
plt.plot(recall, precision, marker='.', label='Logistic (area = %0.2f)' % auc_score)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.legend()
plt.show()
print( thresholds[:10])

output_35_0

[0.0061375  0.00623691 0.0064424  0.00653738 0.00726041 0.00734682
 0.00766567 0.00784434 0.00840961 0.00853147]

(y == 0).sum(), (y == 1).sum()  # balanced

(501, 499)

In general, the higher AUC score, the better model. But, you have to be very careful when there is huge imbalance in the dataset.

Another example with highly imbalanced dataset

from sklearn.datasets import make_classification
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import auc

X, y = make_classification(n_samples=1000, n_classes=2, weights=[0.99, 0.01], random_state=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=2, stratify=y)

print('Dataset: Class0=%d, Class1=%d' % (len(y[y==0]), len(y[y==1])))
print('Train: Class0=%d, Class1=%d' % (len(y_train[y_train==0]), len(y_train[y_train==1])))
print('Test: Class0=%d, Class1=%d' % (len(y_test[y_test==0]), len(y_test[y_test==1])))

Dataset: Class0=985, Class1=15
Train: Class0=492, Class1=8
Test: Class0=493, Class1=7

# roc curve and roc auc on an imbalanced dataset
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from matplotlib import pyplot
 
# plot no skill and model roc curves
def plot_roc_curve(y_test, model_score, auc):
	fpr, tpr, _ = roc_curve(y_test, model_score)
	plt.plot(fpr, tpr, marker='.', label='Logistic (area = %0.2f)' % auc)

	pyplot.xlabel('fpr')
	pyplot.ylabel('tpr')
	pyplot.legend()
	pyplot.show()
 
X, y = make_classification(n_samples=1000, n_classes=2, weights=[0.99, 0.01], random_state=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=2, stratify=y)
model = LogisticRegression(solver='lbfgs')
model.fit(X_train, y_train)
y_score = model.predict_proba(X_test)
model_score = y_score[:, 1]   # prob[yi=1]
roc_auc = roc_auc_score(y_test, model_score)
print('Logistic ROC AUC %.3f' % roc_auc)
plot_roc_curve(y_test, model_score, roc_auc)

Logistic ROC AUC 0.869

output_40_1

def plot_pr_curve(y_test, model_score, auc):
	precision, recall, _ = precision_recall_curve(y_test, model_score)
	plt.plot(recall, precision, marker='.', label='Logistic (area = %0.2f)' % auc)
	pyplot.xlabel('recall')
	pyplot.ylabel('precision')
	pyplot.legend()
	pyplot.show()
 
model = LogisticRegression(solver='lbfgs')
model.fit(X_train, y_train)
y_score = model.predict_proba(X_test)
model_score = y_score[:, 1]   # prob[yi=1]
precision, recall, _ = precision_recall_curve(y_test, model_score)
auc_score = auc(recall, precision)
print('Logistic PR AUC: %.3f' % auc_score)

plot_pr_curve(y_test, model_score, auc_score )

Logistic PR AUC: 0.228

output_41_1

We can see the zig-zag line and close to zero.
Notice that the ROC and PR curves tell a different story.
The PR curve focuses on the positive (minority) class, whereas the ROC curve covers both classes.

Share on

Twitter Facebook LinkedIn

wowo0709

[Machine Learning] Classification Performance

Metrics

Static performance

Dynamic performance

Setup

Static performance and Confusion_matrix

Make dataset

Confustion Matrix

Dynamic performance

Make dataset

ROC and AUC

3명의 능력 비교

Precision and Recall

An example

Another example with highly imbalanced dataset

Share on

Leave a comment

You may also enjoy

[Python] Effective Python CH 2. 리스트와 딕셔너리 - 1

[Python] Effective Python CH 1. 파이썬답게 생각하기 - 2

[Python] Effective Python CH 1. 파이썬답게 생각하기 - 1

[Python] Effective Python 전체 목차

	y_test	y_pred
0	1	1
1	1	1
2	0	1
3	1	1
4	0	1
5	1	1
6	1	1
7	1	1
8	0	1
9	0	1
10	1	1
11	0	1
12	1	1
13	1	1
14	0	0
15	1	0
16	0	0
17	0	0
18	0	0
19	0	0

	y_test	y_pred
0	1	1
1	1	1
2	0	1
3	1	1
4	0	1
5	1	1
6	1	1
7	1	1
8	0	1
9	0	1
10	1	1
11	0	1
12	1	1
13	1	1
14	0	0
15	1	0
16	0	0
17	0	0
18	0	0
19	0	0

	y_test	y_pred
0	1	1
1	1	1
2	0	1
3	1	1
4	0	1
5	1	1
6	1	1
7	1	1
8	0	1
9	0	1
10	1	1
11	0	1
12	1	1
13	1	1
14	0	0
15	1	0
16	0	0
17	0	0
18	0	0
19	0	0