[Machine Learning]end-to-end machine learning process

8 minute read


The full code is at the bottom!!!


Setup

Import necessary libraries.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import urllib.request
import tarfile
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.model_selection import train_test_split
# from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score


Import dataset

Import California housing dataset and convert it to pandas dataframe format

url = "https://raw.githubusercontent.com/ageron/handson-ml2/master/datasets/housing/housing.tgz"
urllib.request.urlretrieve(url, "housing.tgz")
tar = tarfile.open("housing.tgz")
tar.extractall()
tar.close()
housing = pd.read_csv("housing.csv")

housing


longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY
... ... ... ... ... ... ... ... ... ... ...
20635 -121.09 39.48 25.0 1665.0 374.0 845.0 330.0 1.5603 78100.0 INLAND
20636 -121.21 39.49 18.0 697.0 150.0 356.0 114.0 2.5568 77100.0 INLAND
20637 -121.22 39.43 17.0 2254.0 485.0 1007.0 433.0 1.7000 92300.0 INLAND
20638 -121.32 39.43 18.0 1860.0 409.0 741.0 349.0 1.8672 84700.0 INLAND
20639 -121.24 39.37 16.0 2785.0 616.0 1387.0 530.0 2.3886 89400.0 INLAND

20640 rows × 10 columns


Data Analysis

housing.describe()
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
count 20640.000000 20640.000000 20640.000000 20640.000000 20433.000000 20640.000000 20640.000000 20640.000000 20640.000000
mean -119.569704 35.631861 28.639486 2635.763081 537.870553 1425.476744 499.539680 3.870671 206855.816909
std 2.003532 2.135952 12.585558 2181.615252 421.385070 1132.462122 382.329753 1.899822 115395.615874
min -124.350000 32.540000 1.000000 2.000000 1.000000 3.000000 1.000000 0.499900 14999.000000
25% -121.800000 33.930000 18.000000 1447.750000 296.000000 787.000000 280.000000 2.563400 119600.000000
50% -118.490000 34.260000 29.000000 2127.000000 435.000000 1166.000000 409.000000 3.534800 179700.000000
75% -118.010000 37.710000 37.000000 3148.000000 647.000000 1725.000000 605.000000 4.743250 264725.000000
max -114.310000 41.950000 52.000000 39320.000000 6445.000000 35682.000000 6082.000000 15.000100 500001.000000


housing.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB



Split dataset to Training/Test Set

X = housing.drop('median_house_value',axis=1)
y = housing['median_house_value']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    shuffle=True, 
                                                    random_state=42)

    
X_train.head(10)
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income ocean_proximity
14196 -117.03 32.71 33.0 3126.0 627.0 2300.0 623.0 3.2596 NEAR OCEAN
8267 -118.16 33.77 49.0 3382.0 787.0 1314.0 756.0 3.8125 NEAR OCEAN
17445 -120.48 34.66 4.0 1897.0 331.0 915.0 336.0 4.1563 NEAR OCEAN
14265 -117.11 32.69 36.0 1421.0 367.0 1418.0 355.0 1.9425 NEAR OCEAN
2271 -119.80 36.78 43.0 2382.0 431.0 874.0 380.0 3.5542 INLAND
17848 -121.86 37.42 20.0 5032.0 808.0 2695.0 801.0 6.6227 <1H OCEAN
6252 -117.97 34.04 28.0 1686.0 417.0 1355.0 388.0 2.5192 <1H OCEAN
9389 -122.53 37.91 37.0 2524.0 398.0 999.0 417.0 7.9892 NEAR BAY
6113 -117.90 34.13 5.0 1126.0 316.0 819.0 311.0 1.5000 <1H OCEAN
6061 -117.79 34.02 5.0 18690.0 2862.0 9427.0 2777.0 6.4266 <1H OCEAN


Data Preprocessing

강의 내용과 다르게, 하나의 ‘district’에 대한 데이터를 하나의 ‘house’에 대한 데이터로 변환하지 않는다.

이는 샘플이 하나의 ‘district’ 단위이기 때문에 가구 당 평균치로 환산하는 것이 오히려 이 데이터에서는 성능을 떨어트릴 수 있기 때문이다.

대신, NaN 값이 있는 ‘total bedroom’ feature의 값을 대체한 뒤에 numerical data는 scaling을, categorical data인 ‘ocean proximity’ 컬럼은 one-hot encoding한다.

Imputing

‘total_bedrooms’ feature가 결측치를 갖기 때문에 상관도가 높고 결측치가 없는 다른 특성을 이용해 결측치를 채운다.

X_train.corr()['total_bedrooms'].sort_values(ascending=False)
total_bedrooms        1.000000
households            0.980255
total_rooms           0.930489
population            0.878932
longitude             0.063064
median_income        -0.009141
latitude             -0.059998
housing_median_age   -0.320624
Name: total_bedrooms, dtype: float64
class MyImputer():
    def __init__(self):
        self.proportion = 0
        
    def fit(self,features,labels,reset=True):
        tot_feature, tot_label = 0, 0
        for feature,label in zip(features,labels):
            if not np.isnan(feature) and not np.isnan(label):  
                tot_feature += feature
                tot_label += label
                
        if reset: self.proportion = tot_feature / tot_label
        else: self.proportion = (tot_feature / tot_label + self.proportion) / 2
        
        return
        
    def transform(self,features,labels):
        imputed_features = []
        for feature,label in zip(features,labels):
            if np.isnan(feature) and not np.isnan(label):
                imputed_features.append(round(label * self.proportion))
            else:
                imputed_features.append(feature)
                
        return imputed_features
                
    def fit_transform(self,features,labels,reset=True):
        self.fit(features,labels,reset)
        return self.transform(features,labels)
imputer = MyImputer()

# correlation이 높은 'households' feature를 사용해 결측치 보간
X_train['total_bedrooms'] = imputer.fit_transform(X_train['total_bedrooms'], X_train['households'])
X_train.head(10)
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income ocean_proximity
14196 -117.03 32.71 33.0 3126.0 627.0 2300.0 623.0 3.2596 NEAR OCEAN
8267 -118.16 33.77 49.0 3382.0 787.0 1314.0 756.0 3.8125 NEAR OCEAN
17445 -120.48 34.66 4.0 1897.0 331.0 915.0 336.0 4.1563 NEAR OCEAN
14265 -117.11 32.69 36.0 1421.0 367.0 1418.0 355.0 1.9425 NEAR OCEAN
2271 -119.80 36.78 43.0 2382.0 431.0 874.0 380.0 3.5542 INLAND
17848 -121.86 37.42 20.0 5032.0 808.0 2695.0 801.0 6.6227 <1H OCEAN
6252 -117.97 34.04 28.0 1686.0 417.0 1355.0 388.0 2.5192 <1H OCEAN
9389 -122.53 37.91 37.0 2524.0 398.0 999.0 417.0 7.9892 NEAR BAY
6113 -117.90 34.13 5.0 1126.0 316.0 819.0 311.0 1.5000 <1H OCEAN
6061 -117.79 34.02 5.0 18690.0 2862.0 9427.0 2777.0 6.4266 <1H OCEAN


Scaling

각 특성의 값이 0과 1 사이에 오도록 스케일링

num_columns = list(X_train.columns[:-1])
print(num_columns)
num_attribs = X_train.drop('ocean_proximity', axis=1)
['longitude', 'latitude', 'housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 'households', 'median_income']
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaled_num_attribs = scaler.fit_transform(num_attribs)

scaled_num_attribs[:10]
array([[0.72908367, 0.01702128, 0.62745098, 0.0794547 , 0.09714463,
        0.06437961, 0.10228581, 0.19032151],
       [0.61653386, 0.12978723, 0.94117647, 0.08596572, 0.12197393,
        0.0367443 , 0.12415721, 0.22845202],
       [0.38545817, 0.22446809, 0.05882353, 0.04819675, 0.05121043,
        0.02556125, 0.05508962, 0.25216204],
       [0.72111554, 0.01489362, 0.68627451, 0.03609034, 0.05679702,
        0.03965918, 0.05821411, 0.09948828],
       [0.45318725, 0.45      , 0.82352941, 0.06053207, 0.06672874,
        0.02441212, 0.06232528, 0.21063847],
       [0.24800797, 0.51808511, 0.37254902, 0.12793123, 0.12523277,
        0.07545055, 0.13155731, 0.42225624],
       [0.63545817, 0.15851064, 0.52941176, 0.04283026, 0.06455618,
        0.03789344, 0.06364085, 0.13926015],
       [0.1812749 , 0.57021277, 0.70588235, 0.06414365, 0.0616077 ,
        0.02791558, 0.0684098 , 0.51649632],
       [0.64243028, 0.16808511, 0.07843137, 0.02858742, 0.04888268,
        0.0228706 , 0.05097846, 0.06897146],
       [0.65338645, 0.15638298, 0.07843137, 0.47530393, 0.4439789 ,
        0.26413296, 0.45650386, 0.40873229]])


Encoding

Categorical data에 대해 one-hot encoding 수행

cat_attribes = X_train['ocean_proximity']
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder()
encoded_cat_attribs = encoder.fit_transform(X_train['ocean_proximity'].values.reshape(-1,1)).toarray()

one_hot_columns = list(*encoder.categories_)
print(one_hot_columns)
encoded_cat_attribs[:10]
['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN']

array([[0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 1.],
       [0., 1., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0.],
       [1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.]])


Put together

preprocessed_X_train_array = np.hstack([scaled_num_attribs,encoded_cat_attribs])

columns = num_columns + one_hot_columns
preprocessed_X_train = pd.DataFrame(preprocessed_X_train_array, columns=columns)   

preprocessed_X_train[:10]
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income <1H OCEAN INLAND ISLAND NEAR BAY NEAR OCEAN
0 0.729084 0.017021 0.627451 0.079455 0.097145 0.064380 0.102286 0.190322 0.0 0.0 0.0 0.0 1.0
1 0.616534 0.129787 0.941176 0.085966 0.121974 0.036744 0.124157 0.228452 0.0 0.0 0.0 0.0 1.0
2 0.385458 0.224468 0.058824 0.048197 0.051210 0.025561 0.055090 0.252162 0.0 0.0 0.0 0.0 1.0
3 0.721116 0.014894 0.686275 0.036090 0.056797 0.039659 0.058214 0.099488 0.0 0.0 0.0 0.0 1.0
4 0.453187 0.450000 0.823529 0.060532 0.066729 0.024412 0.062325 0.210638 0.0 1.0 0.0 0.0 0.0
5 0.248008 0.518085 0.372549 0.127931 0.125233 0.075451 0.131557 0.422256 1.0 0.0 0.0 0.0 0.0
6 0.635458 0.158511 0.529412 0.042830 0.064556 0.037893 0.063641 0.139260 1.0 0.0 0.0 0.0 0.0
7 0.181275 0.570213 0.705882 0.064144 0.061608 0.027916 0.068410 0.516496 0.0 0.0 0.0 1.0 0.0
8 0.642430 0.168085 0.078431 0.028587 0.048883 0.022871 0.050978 0.068971 1.0 0.0 0.0 0.0 0.0
9 0.653386 0.156383 0.078431 0.475304 0.443979 0.264133 0.456504 0.408732 1.0 0.0 0.0 0.0 0.0



Model Training

from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor()
model.fit(preprocessed_X_train, y_train)

housing_predictions = model.predict(preprocessed_X_train)
train_rmse = mean_squared_error(y_train, housing_predictions)**(1/2)
train_r2 = r2_score(y_train, housing_predictions)
train_score = model.score(preprocessed_X_train, housing_predictions)

train_rmse, train_r2, train_score
(18040.92581263233, 0.975652280886504, 1.0)


Model Evaluation

# correlation이 높은 'households' feature를 사용해 결측치 보간
X_test['total_bedrooms'] = imputer.transform(X_test['total_bedrooms'], X_test['households'])


columns = X_test.columns[:-1]
num_attribs, cat_attribs = X_test.drop('ocean_proximity', axis=1), X_test['ocean_proximity']


scaled_num_attribs = scaler.transform(num_attribs)


one_hot_columns = list(*encoder.categories_)
encoded_cat_attribs = encoder.transform(X_test['ocean_proximity'].values.reshape(-1,1)).toarray()


preprocessed_X_test = pd.DataFrame(np.hstack([scaled_num_attribs,encoded_cat_attribs]),columns=list(columns)+list(one_hot_columns))


final_predictions = model.predict(preprocessed_X_test)

test_rmse = mean_squared_error(y_test, final_predictions)**(1/2)
test_r2 = r2_score(y_test, final_predictions)
test_score = model.score(preprocessed_X_test, y_test)

test_rmse, test_r2, test_score
(48969.517591501055, 0.8170026539070696, 0.8170026539070696)



Full code

'''1. Setup'''
import numpy as np
import pandas as pd
%matplotlib inline
import urllib.request
import tarfile
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor


'''2. Import dataset'''
url = "https://raw.githubusercontent.com/ageron/handson-ml2/master/datasets/housing/housing.tgz"
urllib.request.urlretrieve(url, "housing.tgz")
tar = tarfile.open("housing.tgz")
tar.extractall()
tar.close()
housing = pd.read_csv("housing.csv")


'''3. Split Train/Test dataset'''
X = housing.drop('median_house_value',axis=1)
y = housing['median_house_value']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    shuffle=True, 
                                                    random_state=42)


'''4. Data Preprocessing'''
# Imputing
class MyImputer():
    def __init__(self):
        self.proportion = 0
        
    def fit(self,features,labels,reset=True):
        tot_feature, tot_label = 0, 0
        for feature,label in zip(features,labels):
            if not np.isnan(feature) and not np.isnan(label):  
                tot_feature += feature
                tot_label += label
                
        if reset: self.proportion = tot_feature / tot_label
        else: self.proportion = (tot_feature / tot_label + self.proportion) / 2
        
        return
        
    def transform(self,features,labels):
        imputed_features = []
        for feature,label in zip(features,labels):
            if np.isnan(feature) and not np.isnan(label):
                imputed_features.append(round(label * self.proportion))
            else:
                imputed_features.append(feature)
                
        return imputed_features
                
    def fit_transform(self,features,labels,reset=True):
        self.fit(features,labels,reset)
        return self.transform(features,labels)
    
imputer = MyImputer()
X_train['total_bedrooms'] = imputer.fit_transform(X_train['total_bedrooms'], X_train['households'])

# Scaling
num_columns = list(X_train.columns[:-1])
num_attribs = X_train.drop('ocean_proximity', axis=1)

scaler = MinMaxScaler()
scaled_num_attribs = scaler.fit_transform(num_attribs)

# Encoding
cat_attribes = X_train['ocean_proximity']

encoder = OneHotEncoder()
encoded_cat_attribs = encoder.fit_transform(X_train['ocean_proximity'].values.reshape(-1,1)).toarray()

one_hot_columns = list(*encoder.categories_)

# Put together
preprocessed_X_train_array = np.hstack([scaled_num_attribs,encoded_cat_attribs])

columns = num_columns + one_hot_columns
preprocessed_X_train = pd.DataFrame(preprocessed_X_train_array, columns=columns) 



'''5. Model Training'''
model = RandomForestRegressor()
model.fit(preprocessed_X_train, y_train)



'''6. Model Evaluation'''
# Imputing
X_test['total_bedrooms'] = imputer.transform(X_test['total_bedrooms'], X_test['households'])

# Scaling/Encoding
num_columns = list(X_test.columns[:-1])
num_attribs, cat_attribs = X_test.drop('ocean_proximity', axis=1), X_test['ocean_proximity']

scaled_num_attribs = scaler.transform(num_attribs)

encoded_cat_attribs = encoder.transform(X_test['ocean_proximity'].values.reshape(-1,1)).toarray()
one_hot_columns = list(*encoder.categories_)

columns = num_columns + one_hot_columns
preprocessed_X_test = pd.DataFrame(np.hstack([scaled_num_attribs,encoded_cat_attribs]),columns=columns)

# Prediction
final_predictions = model.predict(preprocessed_X_test)

test_rmse = mean_squared_error(y_test, final_predictions)**(1/2)
test_score = model.score(preprocessed_X_test, y_test)

print("rmse: {}\ttest score: {}%".format(round(test_rmse,2), round(test_score*100,2)))
rmse: 48841.13	test score: 81.8%


Discussion

간단하게 전체 코드의 흐름을 설명하겠습니다.

1. Setup

필요한 모듈/클래스들을 import 합니다.

  • Basics: numpy, pandas
  • Import dataset: urllib.request, tarfile
  • Spliting: train_test_split
  • Preprocessing
    • Imputing: (Customized)
    • Scaling: MinMaxScaler
    • Encoding: OntHotEncoder
  • Model: RandomForestRegressor
  • Evaluation: mean_squared_error


2. Import dataset

california housing dataset을 가져옵니다.


3. Split Train/Test dataset

가져온 housing 데이터셋을 train/test 데이터셋으로 분리합니다. Train set과 Test set의 분포를 비슷하게 가져가기 위해 shuffle=True로 설정합니다.

test_size 는 전체 데이터의 20%로 설정합니다.


4. Data Preprocessing

4.1 Imputing

기존에 존재하는 Imputing 클래스를 사용하지 않고 직접 구현했습니다.

  1. 먼저 corr() 메서드로 결측치를 보간할 feature와 correlation이 가장 높은 feature를 찾습니다.
  2. 결측치가 있는 feature와 앞서 구한 feature를 이용해 두 feature 사이의 proportion을 구합니다.
  3. 앞서 구한 proportion과 feature를 이용해 결측치가 있는 feature의 값을 보간합니다.

이로써 결측치를 가치있는 값으로 대체할 수 있습니다.

4.2 Scaling

Standard scaling과 Min-max scaling 중 회귀에 조금 더 적합한 Min-max scaling을 선택했습니다.

4.3 Encoding

categorical feature인 ‘ocean_proximity’ 를 원-핫 인코딩합니다.


5. Model Training

모델을 선택하고 fit() 메서드로 훈련합니다.

모델로는 RandomForestRegressor 를 선택했습니다.


6. Model Evaluation

Test set에 대해 앞에서 수행했던 전처리를 똑같이 수행하고, 훈련된 모델로 예측 및 평가를 진행합니다.

Leave a comment