[Machine Learning] Midterm Summary 2

7 minute read

week 6

Imbalance Problem

SMOTE (Synthetic Minority Oversampling TEchnique)

Making imbalance dataset

make_classification: n_samples, n_features, n_redundant, n_clusters_per_classes, weights, flip_y, random_state
- n_samples: number of samples
- n_features: number of features
- n_redundant: number of redundant features
- n_classes: number of classes(labels)
- n_clusters_per_class: number of clusters per class
- weights: proportion of samples assigned to each class
- flip_y: fraction of samples whose class is assigned randomly. Larger values introduce noise in the labels and make the classification task harder.

from sklearn.datasets import make_classification
# Highly imbalanced dataset
X_org, y_org = make_classification(n_samples=10000, n_features=2, n_redundant=0,
	                       n_clusters_per_class=1, weights=[0.99], flip_y=0, 
                           random_state=1)

def print_result(X_train, X_test, y_train, y_test):

    model = RandomForestClassifier(n_estimators=100, max_depth=5)
    print("Shapes: ", X_train.shape, X_test.shape, y_train.shape, y_test.shape)
    model.fit(X_train, y_train)

    y_pred = model.predict(X_test)
    print("Static performance: ", "\n", confusion_matrix(y_test, y_pred))
    print()
    print(classification_report(y_test, y_pred))

    y_pred_proba = model.predict_proba(X_test)
    fpr, tpr, _ = roc_curve(y_test, y_pred_proba[:,1])
    print("AUC score: ", auc(fpr, tpr))
    print()

X, y = X_org.copy(), y_org.copy()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y)
print_result(X_train, X_test, y_train, y_test)

''' Bad recall-1 score!!! Minority class에 대해 현저히 낮은 성능을 보인다. 
Shapes:  (7000, 2) (3000, 2) (7000,) (3000,)
Static performance:  
 [[2969    1]
 [  18   12]]

              precision    recall  f1-score   support

           0       0.99      1.00      1.00      2970
           1       0.92      0.40      0.56        30

    accuracy                           0.99      3000
   macro avg       0.96      0.70      0.78      3000
weighted avg       0.99      0.99      0.99      3000

AUC score:  0.9161672278338945
'''

Smoted dataset (Oversampling, Resampling)

over_sampling.SMOTE: sampling_strategy, k_neighbors, random_state
- fit_resample: X, y

from imblearn.over_sampling import SMOTE

oversample = SMOTE()
X, y = oversample.fit_resample(X_org, y_org)

Undersampled dataset

under_sampling.RandomUnderSampler: sampling_strategy, random_state
- fit_resample: X, y

from imblearn.under_sampling import RandomUnderSampler

under = RandomUnderSampler()
X, y = under.fit_resample(X_org, y_org)

Borderline-SMOTE

Borderline_SMOTE는 Majority class와 Minority class의 샘플의 비율이 비슷한 구역에서만 새로운 sample을 생성함으로써 noise를 무시할 수 있다.

funMV: 기계학습에서 Imbalanced learning

BorderlineSMOTE: sampling_strategy, k_neighbors, m_neighbors, random_state

from imblearn.over_sampling import BorderlineSMOTE

oversample = BorderlineSMOTE(sampling_strategy=0.1)
X, y = oversample.fit_resample(X_org, y_org)

주의할 점!!!

Resampling은 Train set에 대해서만 이루어지고 Test set에는 실제 데이터만이 있어야 한다.

Lab2: Server-based Machine Learning Deployment

# 모델 저장
import pickle

pickle.dump(model, open("파일 경로.pkl", "wb"))
pickle.load(open("파일 경로.pkl", "rb"))

# 웹 페이지 랜더링
from flask import Flask, request, jsonify, render_template

Flask(__name__, template_folder)

one_off.py
- 입력이 들어오면 바로 추론 값을 반환
batch_app.py
- 입력이 들어오면 바로 추론 값을 반환하되, 입력을 따로 저장
- 따로 저장한 입력이 일정 개수 이상이 되면 그 데이터들을 이용해 재학습
- 재학습된 모델을 저장
real_time.py
- 입력이 들어오면 바로 추론 값을 반환하되, 매 입력마다 재학습
- 재학습된 모델을 저장

week 7

OVR (One vs Rest)

n개의 클래스에 대하여 하나의 클래스와 나머지 클래스를 구분하는 것을 n번 반복하여 n개의 클래스를 구분.

OneVsRestClassifier: model

from sklearn.multiclass import OneVsRestClassifier

# SVC uses one-vs-one
classifier = OneVsRestClassifier(SVC(kernel='rbf', C=1000, gamma=0.1, probability=True)) # enable prob estimates
classifier = classifier.fit(X_train, y_train)

SVM

SVC (Classifier)

Hinge loss
SVC: kernel, degree, C, gamma, probability
- kernel: linear, poly, rbf, sigmoid, precomputed (default=rbf)
- degree: degree of the polynomial kernel function (Ignored by all other kernels)
- C: inverse of regularization (값이 커지면 규제가 작아지고 즉, 모델의 복잡도가 증가(노이즈에 민감, 과적합 위험))
- gamma: inverse of deviation (값이 커지면 이웃으로 생각하는 범위가 좁아짐). ‘auto’ = 1 / n_features
- probability: Whether to enable probability estimates.

from sklearn.svm import SVC

svm_clf = SVC(kernel="linear", C=10)
svm_clf.fit(X_train, y_train)
svm_clf = SVC(kernel='rbf', C=1000, gamma=0.1, probability=True)
svm_clf.fit(X_train, y_train)
svm_clf = SVC(kernel='poly', degree=3, C=100, gamma=0.1, probability=True)
svm_clf.fit(X_train, y_train)

C and gamma

For train set

For test set

SVR (Regressor)

Epsilon-insenesitive loss
SVR: kernel, degree, C, gamma, epsilon
- kernel: linear, poly, rbf, sigmoid, precomputed (default=rbf)
- degree: degree of the polynomial kernel function (Ignored by all other kernels)
- C: inverse of regularization (값이 커지면 규제가 작아지고 즉, 모델의 복잡도가 증가(노이즈에 민감, 과적합 위험))
- gamma: inverse of deviation (값이 커지면 이웃으로 생각하는 범위가 좁아짐). ‘auto’ = 1 / n_features
- epsilon: training loss with points predicted within a distance epsilon from the actual value (epslion이 커지면 올바른 예측이라고 여겨지는 street의 너비가 넓어지고 즉, 오차에 관대해진다(모델의 복잡도 감소))

from sklearn.svm import SVR

svr1 = SVR(kernel='linear', epsilon=0.1).fit(x, y)
svr2 = SVR(epsilon=0.01).fit(x, y)
svr3 = SVR(epsilon=1.).fit(x, y)

f, axarr = plt.subplots(1, 3, sharex='col', sharey='row', figsize=(14, 6))

for idx, svr_n, tt in zip(range(3),
                          [svr1, svr2, svr3],
                          ['linear, epsilon=0.1', 'rbf, epsilon=0.01','rbf, epsilon=1.0']):

    axarr[idx].scatter(x, y, s=5, color="blue", label="original")
    yfit = svr_n.predict(x)
    axarr[idx].plot(x, yfit, lw=2, color="red")
    axarr[idx].set_title(tt)

plt.show()

print (svr1.score(x, y), svr2.score(x, y), svr3.score(x, y))

KNN(K Nearest Neighbors)

KNN 은 테스트 시에 새로운 데이터와 인접 데이터들 간의 거리를 계산한다.

따라서 특이하게, 훈련 시에는 하는 것이 없고 테스트 시에 시간이 오래 걸린다.

3-Class classification (k = 15, weights = 'uniform')

KNN: n_neighbors, weights
- n_neighbors: Number of neighbors to use by default for kneighbors queries.
- weights: Weight function use in prediction.

from sklearn.neighbors import KNeighborsClassifier

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
print("kNN score: {:.2f}".format(knn.score(X_test, y_test))) # kNN score: 0.98

Decision Tree

모든 feature를 사용해서 나눠보고 가장 잘 구분하는 feature를 사용하여 최종 구분
그리디 알고리즘: 당장의 최선의 결정을 선택
손실 함수: Impurity (불순도)
- 지니 계수: [0, 0.5] 의 값
- 엔트로피: (-INF, 1] 사이의 값

Are gini index, entropy or classification error measures causing any difference on Decision Tree classification? - Quora

Classifier

DecisionTreeClassifier: max_depth, min_samples_split, min_samples_leaf, max_features, max_leaf_nodes
- max_depth: The maximum depth of tree.
- min_samples_split: The minimum number of samples to split and internal node.
- min_samples_leaf: The minimum number of samples required to be at a leaf node.
- max_features: The number of features to consider when looking for the best split.
- max_leaf_nodes: The maximum number of leaf nodes of tree.

from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(max_depth=2)
clf.fit(X, y)
clf.score(X, y)

tree.plot_tree: tree model, filled

from sklearn import tree

tree.plot_tree(clf, filled=True) # filled=True -> paint to indicate majority class

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

plt.xlim(4, 8.5)
plt.ylim(1.5, 4.5)

markers = ['o', '+', '^']
for i in range(3):
    xs = X[:, 0][y == i]
    ys = X[:, 1][y == i]
    plt.scatter(xs, ys, marker=markers[i])

plt.legend(iris.target_names)
plt.xlabel("Sepal length")
plt.ylabel("Sepal width")


# 결정 트리 경계선: 실선은 루트 노드 점선은 자식 노드
xx = np.linspace(5.45, 5.45, 3)
yy = np.linspace(1.5, 4.5, 3)
plt.plot(xx, yy, '-k') # 검정색 실선

xx = np.linspace(4, 5.45, 3)
yy = np.linspace(2.8, 2.8, 3)
plt.plot(xx, yy, '--b') # 파란색 점선

xx = np.linspace(6.15, 6.15, 3)
yy = np.linspace(1.5, 4.5, 3)
plt.plot(xx, yy, '--r') # 붉은색 점선

Regressor

DecisionTreeRegressor: max_depth, min_samples_split, min_samples_leaf, max_features, max_leaf_nodes

from sklearn.tree import DecisionTreeRegressor

X = iris.data[:,:2]
y = iris.data[:,2]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

tr_reg1 = DecisionTreeRegressor(max_depth=2)
tr_reg1.fit(X_train,y_train)
tr_reg1.score(X_test, y_test)

Ensemble

Voting (Stacking)

Hard Voting: 가장 많은 모델이 예측한 클래스로 최종 결정
Soft Voting: 각 모델이 예측한 각 클래스의 확률을 고려, 각 클래스 별로 평균하여 최종 결정

What Is Voting Classifier In Machine Learning

VotingClassifier: estimators, voting, weights

from sklearn.ensemble import VotingClassifier

clf1 = LogisticRegression(multi_class='multinomial', random_state=1)
clf2 = RandomForestClassifier(n_estimators=50, random_state=1)
clf3 = GaussianNB()

X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
y = np.array([1, 1, 1, 2, 2, 2])

eclf1 = VotingClassifier(estimators=[
         ('lr', clf1), ('rf', clf2), ('gnb', clf3)], voting='hard')
eclf1 = eclf1.fit(X, y)
print(eclf1.predict(X)) # [1 1 1 2 2 2]

RandomForest (Bagging)

중복을 허용한 n개의 샘플을 추출하여 평균을 구하는 작업을 n번 반복
Overfitting 해소, 일반적인 모델 생성
Parallel Session

![Ensemble Learning: Bagging & Boosting

by Fernando López

Towards Data Science](https://miro.medium.com/max/2000/1*zTgGBTQIMlASWm5QuS2UpA.jpeg)

RandomForestClassifier/RandomForestRegressor: n_estimators, max_depth, min_samples_split, min_samples_leaf, max_features, max_leaf_nodes

from sklearn.ensemble import RandomForestClassifier

cancer = load_breast_cancer()
np.random.seed(9)
X_train, X_test, y_train, y_test = train_test_split(
    cancer.data, cancer.target, stratify=cancer.target)

rfc = RandomForestClassifier(n_estimators=500)
rfc.fit(X_train, y_train)
print(rfc.score(X_test, y_test))

# feature_importances_
list(zip(cancer.feature_names, rfc.feature_importances_.round(4)))[:10]
'''
[('mean radius', 0.0),
 ('mean texture', 0.0417),
 ('mean perimeter', 0.0),
 ('mean area', 0.0),
 ('mean smoothness', 0.0),
 ('mean compactness', 0.0),
 ('mean concavity', 0.0),
 ('mean concave points', 0.0426),
 ('mean symmetry', 0.0114),
 ('mean fractal dimension', 0.0)]
'''

Boosting

Bagging은 독립적인 input data를 가지고(복원 추출) 독립적으로 예측하지만, Boosting은 이전 모델이 다음 모델에 영향을 준다.
맞추기 어려운 문제를 맞추는 것에 초점을 둔다.
Serial Session
GradientBoostingClassifier/GradientBoostingRegressor: n_estimators, learning_rate, max_depth, min_samples_split, min_samples_leaf, max_features, max_leaf_nodes

from sklearn.ensemble import GradientBoostingClassifier

gbc = GradientBoostingClassifier(n_estimators=500, learning_rate = 0.15)
gbc.fit(X_train, y_train)
print(gbc.score(X_test, y_test)) # 0.9790209790209791

week 8

Hyperparameter Tuning

A Practical Guide To Hyperparameter Optimization.

Grid Search

GridSearchCV: estimator, param_grid, cv

from sklearn.model_selection import GridSearchCV

params = [{"max_depth":[10, 20, 30], 
           "max_features":[0.3, 0.5, 0.9]}]

# grid search
clf = GridSearchCV(RandomForestRegressor(), params, cv=3, n_jobs=-1)
clf.fit(X_train, y_train)
print("best values: ", clf.best_estimator_)
print("best score: ", clf.best_score_)

# final evaluation on test data
score = clf.score(X_test, y_test)
print("final score: ", score)

'''
best values:  RandomForestRegressor(max_depth=30, max_features=0.9)
best score:  0.9464442332854327
final score:  0.9589943180667339
'''

Randomized Search

RandomizedSearchCV: estimator, param_distributions, cv

from sklearn.model_selection import RandomizedSearchCV

n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
max_features = ['log2', 'sqrt']
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
min_samples_split = [2, 5, 10]
min_samples_leaf = [1, 2, 4]
bootstrap = [True, False]

random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

rf = RandomizedSearchCV(RandomForestRegressor(), random_grid, 
                               cv = 3, n_jobs = -1)
# Fit the random search model
rf.fit(X_train, y_train)

print(rf.best_params_, rf.best_estimator_, rf.best_score_)
score = rf.score(X_test, y_test)
print('final score:', score)

'''
{'n_estimators': 200, 'min_samples_split': 5, 'min_samples_leaf': 2, 'max_features': 'log2', 'max_depth': None, 'bootstrap': False} RandomForestRegressor(bootstrap=False, max_features='log2', min_samples_leaf=2,
                      min_samples_split=5, n_estimators=200) 0.9162766832621637
final score: 0.9365843295319143
'''

Share on

Twitter Facebook LinkedIn

wowo0709