[Machine Learning] Hyperparameter Tuning

9 minute read


Hyperparameter Tuning

A Practical Guide To Hyperparameter Optimization.

  • to choose a set of optimal hyperparameters for a learning algorithm
  • example case: Bike Renting Analysis problem (from kaggle)


Setup

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn import svm
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import make_scorer
%matplotlib inline


Import dataset

!curl -L https://goo.gl/s8qSL5  -o ./bike_train.csv
# !curl https://goo.gl/s8qSL5  -o ../Lab_M2/data/bike_train.csv
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100   139    0   139    0     0    166      0 --:--:-- --:--:-- --:--:--  2355

100   318  100   318    0     0    263      0  0:00:01  0:00:01 --:--:--   263

  0     0    0     0    0     0      0      0 --:--:--  0:00:02 --:--:--     0
 58  643k   58  375k    0     0   140k      0  0:00:04  0:00:02  0:00:02  625k
100  643k  100  643k    0     0   231k      0  0:00:02  0:00:02 --:--:--  911k


train = pd.read_csv("bike_train.csv")
train.dtypes
datetime       object
season          int64
holiday         int64
workingday      int64
weather         int64
temp          float64
atemp         float64
humidity        int64
windspeed     float64
casual          int64
registered      int64
count           int64
dtype: object


Data read and Preprocessing

train = pd.read_csv("bike_train.csv", parse_dates=["datetime"])
train.dtypes
datetime      datetime64[ns]
season                 int64
holiday                int64
workingday             int64
weather                int64
temp                 float64
atemp                float64
humidity               int64
windspeed            float64
casual                 int64
registered             int64
count                  int64
dtype: object


train.head(3)
datetime season holiday workingday weather temp atemp humidity windspeed casual registered count
0 2011-01-01 00:00:00 1 0 0 1 9.84 14.395 81 0.0 3 13 16
1 2011-01-01 01:00:00 1 0 0 1 9.02 13.635 80 0.0 8 32 40
2 2011-01-01 02:00:00 1 0 0 1 9.02 13.635 80 0.0 5 27 32


train.shape     # (10886, 12)
(10886, 12)


train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   datetime    10886 non-null  datetime64[ns]
 1   season      10886 non-null  int64         
 2   holiday     10886 non-null  int64         
 3   workingday  10886 non-null  int64         
 4   weather     10886 non-null  int64         
 5   temp        10886 non-null  float64       
 6   atemp       10886 non-null  float64       
 7   humidity    10886 non-null  int64         
 8   windspeed   10886 non-null  float64       
 9   casual      10886 non-null  int64         
 10  registered  10886 non-null  int64         
 11  count       10886 non-null  int64         
dtypes: datetime64[ns](1), float64(3), int64(8)
memory usage: 1020.7 KB


  • decompose ‘datetime’ feature
train["d-year"] = train["datetime"].dt.year
train["d-month"] = train["datetime"].dt.month
train["d-day"] = train["datetime"].dt.day
train["d-hour"] = train["datetime"].dt.hour
train["d-minute"] = train["datetime"].dt.minute
train["d-second"] = train["datetime"].dt.second
train["d-dayofweek"] = train["datetime"].dt.dayofweek   # monday(0), ... sunday(6)
train[["datetime", "d-year", "d-month", "d-day", "d-hour", 
       "d-minute", "d-second", "d-dayofweek"]].head()
datetime d-year d-month d-day d-hour d-minute d-second d-dayofweek
0 2011-01-01 00:00:00 2011 1 1 0 0 0 5
1 2011-01-01 01:00:00 2011 1 1 1 0 0 5
2 2011-01-01 02:00:00 2011 1 1 2 0 0 5
3 2011-01-01 03:00:00 2011 1 1 3 0 0 5
4 2011-01-01 04:00:00 2011 1 1 4 0 0 5


figure, ((ax1, ax2), (ax3, ax4)) = plt.subplots(nrows=2, ncols=2)
figure.set_size_inches(12, 6)

sns.barplot(data=train, x="d-year", y="count", ax=ax1)
sns.barplot(data=train, x="d-month", y="count", ax=ax2)
sns.barplot(data=train, x="d-day", y="count", ax=ax3)
sns.barplot(data=train, x="d-hour", y="count", ax=ax4)

output_15_1



Data Analysis

average numbers of rentals by hour (weekdays and weekends)

  • pointplot():point estimates and confidence intervals using scatter plot
  • It is important to keep in mind that a point plot shows only the mean (or other estimator) value, but in many cases it may be more informative to show the distribution of values at each level of the categorical variables. In that case, other approaches such as a box or violin plot may be more appropriate.
plt.figure(figsize=(12,4))
# sns.pointplot(data=train, x="d-hour", y="count", hue="workingday")
sns.pointplot(data=train, x="d-hour", y="count")

output_18_1


plt.figure(figsize=(12,4))
sns.pointplot(data=train, x="d-hour", y="count", hue="workingday")

output_19_1


average number of rentals (weekdays)

  • dayofweek: return the day of the week. It is assumed the week starts on Monday, which is denoted by 0 and ends on Sunday which is denoted by 6. This method is available on both Series with datetime values (using the dt accessor) or DatetimeIndex.
print(train.shape)
train[["datetime", "d-dayofweek"]].head()
(10886, 19)
datetime d-dayofweek
0 2011-01-01 00:00:00 5
1 2011-01-01 01:00:00 5
2 2011-01-01 02:00:00 5
3 2011-01-01 03:00:00 5
4 2011-01-01 04:00:00 5


figure, (ax1, ax2) = plt.subplots(nrows=2, ncols=1)
figure.set_size_inches(18, 8)

sns.pointplot(data=train, x="d-hour", y="count", hue="workingday", ax=ax1)
sns.pointplot(data=train, x="d-hour", y="count", hue="d-dayofweek", ax=ax2)

output_22_1


generating a new feature by combining year and month

def concatenate_year_month(datetime):
    return "{0}-{1}".format(datetime.year, datetime.month)

train["d-year_month"] = train["datetime"].apply(concatenate_year_month)

print(train.shape)
train[["datetime", "d-year_month"]].head()
(10886, 20)
datetime d-year_month
0 2011-01-01 00:00:00 2011-1
1 2011-01-01 01:00:00 2011-1
2 2011-01-01 02:00:00 2011-1
3 2011-01-01 03:00:00 2011-1
4 2011-01-01 04:00:00 2011-1


figure, (ax1, ax2) = plt.subplots(nrows=1, ncols=2)
figure.set_size_inches(18, 4)

sns.barplot(data=train, x="d-year", y="count", ax=ax1)
sns.barplot(data=train, x="d-month", y="count", ax=ax2)

figure, ax3 = plt.subplots(nrows=1, ncols=1)
figure.set_size_inches(18, 4)

sns.barplot(data=train, x="d-year_month", y="count", ax=ax3)

output_25_1

output_25_2



Select features to use for training

train.columns
Index(['datetime', 'season', 'holiday', 'workingday', 'weather', 'temp',
       'atemp', 'humidity', 'windspeed', 'casual', 'registered', 'count',
       'd-year', 'd-month', 'd-day', 'd-hour', 'd-minute', 'd-second',
       'd-dayofweek', 'd-year_month'],
      dtype='object')


features = ["season", "holiday", "workingday", "weather",
            "temp", "atemp", "humidity", "windspeed",
            "d-year", "d-hour", "d-dayofweek"]
X = train[features]
y = train['count']
print(X.shape, y.shape)
X.head()
(10886, 11) (10886,)
season holiday workingday weather temp atemp humidity windspeed d-year d-hour d-dayofweek
0 1 0 0 1 9.84 14.395 81 0.0 2011 0 5
1 1 0 0 1 9.02 13.635 80 0.0 2011 1 5
2 1 0 0 1 9.02 13.635 80 0.0 2011 2 5
3 1 0 0 1 9.84 14.395 75 0.0 2011 3 5
4 1 0 0 1 9.84 14.395 75 0.0 2011 4 5



Random Forest model

np.random.seed(11)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

model = RandomForestRegressor(n_estimators= 30)
model.fit(X_train, y_train)
model.score(X_test, y_test)
0.9483758264489849


use logarithmic distribution

from sklearn.metrics import mean_squared_error
np.random.seed(11)
X_train, X_test, y_train, y_test = train_test_split(X, np.log(1+y), test_size = 0.2)   # y -> log(1 + y)

model = RandomForestRegressor(n_estimators= 30)
model.fit(X_train, y_train)
print("score: ", model.score(X_test, y_test))
print("MSE: ", mean_squared_error(y_test, model.predict(X_test)))
score:  0.9583998288123375
MSE:  0.08291535828416392
  • we can see that log(1+y) gives better performance than y.
# list(zip(y_test, model.predict(X_test)))[:10]
# X_train[:5]


Other models: linear and decision tree models

y_train
7560     4.709530
8090     6.304449
30       0.693147
7290     1.945910
8665     5.521461
           ...   
4023     5.913503
7259     5.817111
5200     5.153292
3775     5.192957
10137    6.144186
Name: count, Length: 8708, dtype: float64


model = LinearRegression()
model.fit(X_train, y_train)
print(model.score(X_test, y_test))

model = DecisionTreeRegressor()
model.fit(X_train, y_train)
print(model.score(X_test, y_test))
0.4725950198506106
0.9152295630854631


change the performance score (rmse -> rmsle)

  • robust to outliers
  • RMSLE incurs a larger penalty for the underestimation of the Actual variable than the Overestimation.
  • This is especially useful for business cases where the underestimation of the target variable is not acceptable but overestimation can be tolerated.
  • to be used in GridSearch() (The less, the better.)
  • \[\sqrt{\frac{1}{n} \sum_{i=1}^n (\log(p_i + 1) - \log(a_i+1))^2 }\]
def rmsle(predict, actual):
    predict = np.array(predict)
    actual = np.array(actual)
    
    predict = np.log(predict + 1)
    actual = np.log(actual + 1)
    
    difference_square_mean = np.square(predict - actual).mean()
    score = np.sqrt(difference_square_mean)
    return score

rmsle_scorer = make_scorer(rmsle)    #  make your own scoring function
rmsle_scorer
make_scorer(rmsle)


selecting hyperparameters

  • Grid search
  • Random search
# Grid Search for the random forest

n_estimators = 30

max_depth_list = [10, 20, 30, 50, 100]
max_features_list = [0.1, 0.3, 0.5, 0.7,  0.9] # 사용할 특성의 비율

hyperparameters_list = []

for max_depth in max_depth_list:
    for max_features in max_features_list:
        model = RandomForestRegressor(n_estimators=n_estimators,
                                      max_depth=max_depth,
                                      max_features=max_features,
                                      random_state=11,
                                      n_jobs=-1)

        score = cross_val_score(model, X_train, y_train, cv=5,
                                scoring=rmsle_scorer).mean()

        hyperparameters_list.append({
            'rmsle': score,
            'n_estimators': n_estimators,
            'max_depth': max_depth,
            'max_features': max_features,
        })

        print("Score = {0:.5f}".format(score))

hyperparameters_list
Score = 0.16782
...
Score = 0.08654

[{'rmsle': 0.1678247426392651,
  'n_estimators': 30,
  'max_depth': 10,
  'max_features': 0.1},
...
 {'rmsle': 0.08653948703002194,
  'n_estimators': 30,
  'max_depth': 100,
  'max_features': 0.9}]


hyperparameters_list = pd.DataFrame.from_dict(hyperparameters_list)  # make dataframe from dictionary
hyperparameters_list = hyperparameters_list.sort_values(by="rmsle")

print(hyperparameters_list.shape)
hyperparameters_list.head()
(25, 4)
rmsle n_estimators max_depth max_features
19 0.086539 30 50 0.9
24 0.086539 30 100 0.9
14 0.086558 30 30 0.9
9 0.086676 30 20 0.9
8 0.087629 30 20 0.7


  • 2 stages: random selection and fine tuning
# Random selection

import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score

hyperparameters_list = []

n_estimators = 30
num_epoch = 10

for epoch in range(num_epoch):
    max_depth = np.random.randint(low=2, high=100)
    max_features = np.random.uniform(low=0.1, high=1.0)

    model = RandomForestRegressor(n_estimators=n_estimators,
                                  max_depth=max_depth,
                                  max_features=max_features,
                                  random_state=37,
                                  n_jobs=-1)

    score = cross_val_score(model, X_train, y_train, cv=5,
                            scoring=rmsle_scorer).mean()

    hyperparameters_list.append({
        'rmsle': score,
        'n_estimators': n_estimators,
        'max_depth': max_depth,
        'max_features': max_features,
    })

    print("Score = {0:.5f}".format(score))

hyperparameters_list = pd.DataFrame.from_dict(hyperparameters_list)
hyperparameters_list = hyperparameters_list.sort_values(by="rmsle")

print(hyperparameters_list.shape)
hyperparameters_list.head()
Score = 0.08673
Score = 0.12227
Score = 0.12227
Score = 0.14013
Score = 0.09152
Score = 0.18747
Score = 0.08869
Score = 0.08962
Score = 0.08869
Score = 0.08761
(10, 4)
rmsle n_estimators max_depth max_features
0 0.086729 30 76 0.804363
9 0.087607 30 82 0.708922
6 0.088695 30 79 0.575713
8 0.088695 30 74 0.577014
7 0.089616 30 12 0.700990


Fine tuning

# fine search

import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score

hyperparameters_list = []

n_estimators = 30
num_epoch = 10

for epoch in range(num_epoch):
    max_depth = np.random.randint(low=30, high=90)
    max_features = np.random.uniform(low=0.5, high=1.0)

    model = RandomForestRegressor(n_estimators=n_estimators,
                                  max_depth=max_depth,
                                  max_features=max_features,
                                  random_state=37,
                                  n_jobs=-1)

    score = cross_val_score(model, X_train, y_train, cv=5,
                            scoring=rmsle_scorer).mean()

    hyperparameters_list.append({
        'rmsle': score,
        'n_estimators': n_estimators,
        'max_depth': max_depth,
        'max_features': max_features,
    })

    print("Score = {0:.5f}".format(score))

hyperparameters_list = pd.DataFrame.from_dict(hyperparameters_list)
hyperparameters_list = hyperparameters_list.sort_values(by="rmsle")

print(hyperparameters_list.shape)
hyperparameters_list
Score = 0.08640
Score = 0.08869
Score = 0.08658
Score = 0.08673
Score = 0.08761
Score = 0.08673
Score = 0.08640
Score = 0.08869
Score = 0.08658
Score = 0.08761
(10, 4)
rmsle n_estimators max_depth max_features
0 0.086401 30 59 0.847356
6 0.086401 30 39 0.830026
8 0.086583 30 83 0.930264
2 0.086583 30 54 0.949200
3 0.086729 30 48 0.765320
5 0.086729 30 64 0.728243
4 0.087607 30 85 0.661769
9 0.087607 30 65 0.688243
1 0.088695 30 89 0.598500
7 0.088695 30 65 0.578285


Final Selection

# final selection of hyperparameters (최종모델 선택)

from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_estimators=300,
                              max_depth=65,
                              max_features=0.9309,
                              random_state=37,
                              n_jobs=-1)
model.fit(X_train, y_train)
score = cross_val_score(model, X_test, y_test, cv=5,
                        scoring=rmsle_scorer).mean()
print("Score = {0:.5f}".format(score))
Score = 0.09163


Most significant features

model.feature_importances_    # he higher, the more important the feature.
array([0.02895014, 0.00159298, 0.03771286, 0.01074947, 0.04924549,
       0.02964683, 0.02133926, 0.01119758, 0.03024815, 0.75071977,
       0.02859746])


df = pd.DataFrame({'feature':features,'importance':model.feature_importances_ }) 
df=df.sort_values('importance', ascending=False)
x = df.feature
y = df.importance
ypos = np.arange(len(x))
plt.figure(figsize=(10,7))
plt.barh(x, y)
plt.yticks(ypos, x)
plt.xlabel('Importance')
plt.ylabel('Variable')
plt.xlim(0, 1)
plt.ylim(-1, len(x))
plt.show()

output_54_0



GridSearchCV()

  • 그리드 탐색을 하며 동시에 교차 검증 수행
  • fit, model, score 함수를 제공하며 fit 를 호출할 때 여러 파라미터 조합에 대해 교차 검증을 수행
  • Exhaustive search over specified parameter values for an estimator.
  • it implements a “fit” and a “score” method. It also implements “predict”, “predict_proba”, “decision_function”, “transform” and “inverse_transform” if they are implemented in the estimator used.
  • The parameters of the estimator used to apply these methods are optimized by cross-validated grid-search over a parameter grid.
  • 그리드 탐색과 교차 검증을 동시에 수행
from sklearn.model_selection import GridSearchCV

params = [{"max_depth":[10, 20, 30], 
           "max_features":[0.3, 0.5, 0.9]}]

# grid search
clf = GridSearchCV(RandomForestRegressor(), params, cv=3, n_jobs=-1)
clf.fit(X_train, y_train)
print("best values: ", clf.best_estimator_)
print("best score: ", clf.best_score_)

# final evaluation on test data
score = clf.score(X_test, y_test)
print("final score: ", score)
best values:  RandomForestRegressor(max_depth=30, max_features=0.9)
best score:  0.9464442332854327
final score:  0.9589943180667339



RandomizedSearchCV() function

  • In contrast to GridSearchCV, not all parameter values are tried out, but rather a fixed number of parameter settings is sampled from the specified distributions. The number of parameter settings that are tried is given by n_iter.
from sklearn.model_selection import RandomizedSearchCV

n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
max_features = ['log2', 'sqrt']
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
min_samples_split = [2, 5, 10]
min_samples_leaf = [1, 2, 4]
bootstrap = [True, False]

random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

rf = RandomizedSearchCV(RandomForestRegressor(), random_grid, 
                               cv = 3, n_jobs = -1)
# Fit the random search model
rf.fit(X_train, y_train)

print(rf.best_params_, rf.best_estimator_, rf.best_score_)
score = rf.score(X_test, y_test)
print(score)
{'n_estimators': 200, 'min_samples_split': 5, 'min_samples_leaf': 2, 'max_features': 'log2', 'max_depth': None, 'bootstrap': False} RandomForestRegressor(bootstrap=False, max_features='log2', min_samples_leaf=2,
                      min_samples_split=5, n_estimators=200) 0.9162766832621637
0.9365843295319143


Leave a comment