[Machine Learning]end-to-end machine learning process
The full code is at the bottom!!!
Setup
Import necessary libraries.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import urllib.request
import tarfile
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.model_selection import train_test_split
# from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
Import dataset
Import California housing dataset and convert it to pandas dataframe format
url = "https://raw.githubusercontent.com/ageron/handson-ml2/master/datasets/housing/housing.tgz"
urllib.request.urlretrieve(url, "housing.tgz")
tar = tarfile.open("housing.tgz")
tar.extractall()
tar.close()
housing = pd.read_csv("housing.csv")
housing
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | |
---|---|---|---|---|---|---|---|---|---|---|
0 | -122.23 | 37.88 | 41.0 | 880.0 | 129.0 | 322.0 | 126.0 | 8.3252 | 452600.0 | NEAR BAY |
1 | -122.22 | 37.86 | 21.0 | 7099.0 | 1106.0 | 2401.0 | 1138.0 | 8.3014 | 358500.0 | NEAR BAY |
2 | -122.24 | 37.85 | 52.0 | 1467.0 | 190.0 | 496.0 | 177.0 | 7.2574 | 352100.0 | NEAR BAY |
3 | -122.25 | 37.85 | 52.0 | 1274.0 | 235.0 | 558.0 | 219.0 | 5.6431 | 341300.0 | NEAR BAY |
4 | -122.25 | 37.85 | 52.0 | 1627.0 | 280.0 | 565.0 | 259.0 | 3.8462 | 342200.0 | NEAR BAY |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
20635 | -121.09 | 39.48 | 25.0 | 1665.0 | 374.0 | 845.0 | 330.0 | 1.5603 | 78100.0 | INLAND |
20636 | -121.21 | 39.49 | 18.0 | 697.0 | 150.0 | 356.0 | 114.0 | 2.5568 | 77100.0 | INLAND |
20637 | -121.22 | 39.43 | 17.0 | 2254.0 | 485.0 | 1007.0 | 433.0 | 1.7000 | 92300.0 | INLAND |
20638 | -121.32 | 39.43 | 18.0 | 1860.0 | 409.0 | 741.0 | 349.0 | 1.8672 | 84700.0 | INLAND |
20639 | -121.24 | 39.37 | 16.0 | 2785.0 | 616.0 | 1387.0 | 530.0 | 2.3886 | 89400.0 | INLAND |
20640 rows × 10 columns
Data Analysis
housing.describe()
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | |
---|---|---|---|---|---|---|---|---|---|
count | 20640.000000 | 20640.000000 | 20640.000000 | 20640.000000 | 20433.000000 | 20640.000000 | 20640.000000 | 20640.000000 | 20640.000000 |
mean | -119.569704 | 35.631861 | 28.639486 | 2635.763081 | 537.870553 | 1425.476744 | 499.539680 | 3.870671 | 206855.816909 |
std | 2.003532 | 2.135952 | 12.585558 | 2181.615252 | 421.385070 | 1132.462122 | 382.329753 | 1.899822 | 115395.615874 |
min | -124.350000 | 32.540000 | 1.000000 | 2.000000 | 1.000000 | 3.000000 | 1.000000 | 0.499900 | 14999.000000 |
25% | -121.800000 | 33.930000 | 18.000000 | 1447.750000 | 296.000000 | 787.000000 | 280.000000 | 2.563400 | 119600.000000 |
50% | -118.490000 | 34.260000 | 29.000000 | 2127.000000 | 435.000000 | 1166.000000 | 409.000000 | 3.534800 | 179700.000000 |
75% | -118.010000 | 37.710000 | 37.000000 | 3148.000000 | 647.000000 | 1725.000000 | 605.000000 | 4.743250 | 264725.000000 |
max | -114.310000 | 41.950000 | 52.000000 | 39320.000000 | 6445.000000 | 35682.000000 | 6082.000000 | 15.000100 | 500001.000000 |
housing.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 longitude 20640 non-null float64
1 latitude 20640 non-null float64
2 housing_median_age 20640 non-null float64
3 total_rooms 20640 non-null float64
4 total_bedrooms 20433 non-null float64
5 population 20640 non-null float64
6 households 20640 non-null float64
7 median_income 20640 non-null float64
8 median_house_value 20640 non-null float64
9 ocean_proximity 20640 non-null object
dtypes: float64(9), object(1)
memory usage: 1.6+ MB
Split dataset to Training/Test Set
X = housing.drop('median_house_value',axis=1)
y = housing['median_house_value']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
shuffle=True,
random_state=42)
X_train.head(10)
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | ocean_proximity | |
---|---|---|---|---|---|---|---|---|---|
14196 | -117.03 | 32.71 | 33.0 | 3126.0 | 627.0 | 2300.0 | 623.0 | 3.2596 | NEAR OCEAN |
8267 | -118.16 | 33.77 | 49.0 | 3382.0 | 787.0 | 1314.0 | 756.0 | 3.8125 | NEAR OCEAN |
17445 | -120.48 | 34.66 | 4.0 | 1897.0 | 331.0 | 915.0 | 336.0 | 4.1563 | NEAR OCEAN |
14265 | -117.11 | 32.69 | 36.0 | 1421.0 | 367.0 | 1418.0 | 355.0 | 1.9425 | NEAR OCEAN |
2271 | -119.80 | 36.78 | 43.0 | 2382.0 | 431.0 | 874.0 | 380.0 | 3.5542 | INLAND |
17848 | -121.86 | 37.42 | 20.0 | 5032.0 | 808.0 | 2695.0 | 801.0 | 6.6227 | <1H OCEAN |
6252 | -117.97 | 34.04 | 28.0 | 1686.0 | 417.0 | 1355.0 | 388.0 | 2.5192 | <1H OCEAN |
9389 | -122.53 | 37.91 | 37.0 | 2524.0 | 398.0 | 999.0 | 417.0 | 7.9892 | NEAR BAY |
6113 | -117.90 | 34.13 | 5.0 | 1126.0 | 316.0 | 819.0 | 311.0 | 1.5000 | <1H OCEAN |
6061 | -117.79 | 34.02 | 5.0 | 18690.0 | 2862.0 | 9427.0 | 2777.0 | 6.4266 | <1H OCEAN |
Data Preprocessing
강의 내용과 다르게, 하나의 ‘district’에 대한 데이터를 하나의 ‘house’에 대한 데이터로 변환하지 않는다.
이는 샘플이 하나의 ‘district’ 단위이기 때문에 가구 당 평균치로 환산하는 것이 오히려 이 데이터에서는 성능을 떨어트릴 수 있기 때문이다.
대신, NaN 값이 있는 ‘total bedroom’ feature의 값을 대체한 뒤에 numerical data는 scaling을, categorical data인 ‘ocean proximity’ 컬럼은 one-hot encoding한다.
Imputing
‘total_bedrooms’ feature가 결측치를 갖기 때문에 상관도가 높고 결측치가 없는 다른 특성을 이용해 결측치를 채운다.
X_train.corr()['total_bedrooms'].sort_values(ascending=False)
total_bedrooms 1.000000
households 0.980255
total_rooms 0.930489
population 0.878932
longitude 0.063064
median_income -0.009141
latitude -0.059998
housing_median_age -0.320624
Name: total_bedrooms, dtype: float64
class MyImputer():
def __init__(self):
self.proportion = 0
def fit(self,features,labels,reset=True):
tot_feature, tot_label = 0, 0
for feature,label in zip(features,labels):
if not np.isnan(feature) and not np.isnan(label):
tot_feature += feature
tot_label += label
if reset: self.proportion = tot_feature / tot_label
else: self.proportion = (tot_feature / tot_label + self.proportion) / 2
return
def transform(self,features,labels):
imputed_features = []
for feature,label in zip(features,labels):
if np.isnan(feature) and not np.isnan(label):
imputed_features.append(round(label * self.proportion))
else:
imputed_features.append(feature)
return imputed_features
def fit_transform(self,features,labels,reset=True):
self.fit(features,labels,reset)
return self.transform(features,labels)
imputer = MyImputer()
# correlation이 높은 'households' feature를 사용해 결측치 보간
X_train['total_bedrooms'] = imputer.fit_transform(X_train['total_bedrooms'], X_train['households'])
X_train.head(10)
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | ocean_proximity | |
---|---|---|---|---|---|---|---|---|---|
14196 | -117.03 | 32.71 | 33.0 | 3126.0 | 627.0 | 2300.0 | 623.0 | 3.2596 | NEAR OCEAN |
8267 | -118.16 | 33.77 | 49.0 | 3382.0 | 787.0 | 1314.0 | 756.0 | 3.8125 | NEAR OCEAN |
17445 | -120.48 | 34.66 | 4.0 | 1897.0 | 331.0 | 915.0 | 336.0 | 4.1563 | NEAR OCEAN |
14265 | -117.11 | 32.69 | 36.0 | 1421.0 | 367.0 | 1418.0 | 355.0 | 1.9425 | NEAR OCEAN |
2271 | -119.80 | 36.78 | 43.0 | 2382.0 | 431.0 | 874.0 | 380.0 | 3.5542 | INLAND |
17848 | -121.86 | 37.42 | 20.0 | 5032.0 | 808.0 | 2695.0 | 801.0 | 6.6227 | <1H OCEAN |
6252 | -117.97 | 34.04 | 28.0 | 1686.0 | 417.0 | 1355.0 | 388.0 | 2.5192 | <1H OCEAN |
9389 | -122.53 | 37.91 | 37.0 | 2524.0 | 398.0 | 999.0 | 417.0 | 7.9892 | NEAR BAY |
6113 | -117.90 | 34.13 | 5.0 | 1126.0 | 316.0 | 819.0 | 311.0 | 1.5000 | <1H OCEAN |
6061 | -117.79 | 34.02 | 5.0 | 18690.0 | 2862.0 | 9427.0 | 2777.0 | 6.4266 | <1H OCEAN |
Scaling
각 특성의 값이 0과 1 사이에 오도록 스케일링
num_columns = list(X_train.columns[:-1])
print(num_columns)
num_attribs = X_train.drop('ocean_proximity', axis=1)
['longitude', 'latitude', 'housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 'households', 'median_income']
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaled_num_attribs = scaler.fit_transform(num_attribs)
scaled_num_attribs[:10]
array([[0.72908367, 0.01702128, 0.62745098, 0.0794547 , 0.09714463,
0.06437961, 0.10228581, 0.19032151],
[0.61653386, 0.12978723, 0.94117647, 0.08596572, 0.12197393,
0.0367443 , 0.12415721, 0.22845202],
[0.38545817, 0.22446809, 0.05882353, 0.04819675, 0.05121043,
0.02556125, 0.05508962, 0.25216204],
[0.72111554, 0.01489362, 0.68627451, 0.03609034, 0.05679702,
0.03965918, 0.05821411, 0.09948828],
[0.45318725, 0.45 , 0.82352941, 0.06053207, 0.06672874,
0.02441212, 0.06232528, 0.21063847],
[0.24800797, 0.51808511, 0.37254902, 0.12793123, 0.12523277,
0.07545055, 0.13155731, 0.42225624],
[0.63545817, 0.15851064, 0.52941176, 0.04283026, 0.06455618,
0.03789344, 0.06364085, 0.13926015],
[0.1812749 , 0.57021277, 0.70588235, 0.06414365, 0.0616077 ,
0.02791558, 0.0684098 , 0.51649632],
[0.64243028, 0.16808511, 0.07843137, 0.02858742, 0.04888268,
0.0228706 , 0.05097846, 0.06897146],
[0.65338645, 0.15638298, 0.07843137, 0.47530393, 0.4439789 ,
0.26413296, 0.45650386, 0.40873229]])
Encoding
Categorical data에 대해 one-hot encoding 수행
cat_attribes = X_train['ocean_proximity']
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
encoded_cat_attribs = encoder.fit_transform(X_train['ocean_proximity'].values.reshape(-1,1)).toarray()
one_hot_columns = list(*encoder.categories_)
print(one_hot_columns)
encoded_cat_attribs[:10]
['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN']
array([[0., 0., 0., 0., 1.],
[0., 0., 0., 0., 1.],
[0., 0., 0., 0., 1.],
[0., 0., 0., 0., 1.],
[0., 1., 0., 0., 0.],
[1., 0., 0., 0., 0.],
[1., 0., 0., 0., 0.],
[0., 0., 0., 1., 0.],
[1., 0., 0., 0., 0.],
[1., 0., 0., 0., 0.]])
Put together
preprocessed_X_train_array = np.hstack([scaled_num_attribs,encoded_cat_attribs])
columns = num_columns + one_hot_columns
preprocessed_X_train = pd.DataFrame(preprocessed_X_train_array, columns=columns)
preprocessed_X_train[:10]
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | <1H OCEAN | INLAND | ISLAND | NEAR BAY | NEAR OCEAN | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.729084 | 0.017021 | 0.627451 | 0.079455 | 0.097145 | 0.064380 | 0.102286 | 0.190322 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
1 | 0.616534 | 0.129787 | 0.941176 | 0.085966 | 0.121974 | 0.036744 | 0.124157 | 0.228452 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
2 | 0.385458 | 0.224468 | 0.058824 | 0.048197 | 0.051210 | 0.025561 | 0.055090 | 0.252162 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
3 | 0.721116 | 0.014894 | 0.686275 | 0.036090 | 0.056797 | 0.039659 | 0.058214 | 0.099488 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
4 | 0.453187 | 0.450000 | 0.823529 | 0.060532 | 0.066729 | 0.024412 | 0.062325 | 0.210638 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
5 | 0.248008 | 0.518085 | 0.372549 | 0.127931 | 0.125233 | 0.075451 | 0.131557 | 0.422256 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
6 | 0.635458 | 0.158511 | 0.529412 | 0.042830 | 0.064556 | 0.037893 | 0.063641 | 0.139260 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
7 | 0.181275 | 0.570213 | 0.705882 | 0.064144 | 0.061608 | 0.027916 | 0.068410 | 0.516496 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
8 | 0.642430 | 0.168085 | 0.078431 | 0.028587 | 0.048883 | 0.022871 | 0.050978 | 0.068971 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
9 | 0.653386 | 0.156383 | 0.078431 | 0.475304 | 0.443979 | 0.264133 | 0.456504 | 0.408732 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
Model Training
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
model.fit(preprocessed_X_train, y_train)
housing_predictions = model.predict(preprocessed_X_train)
train_rmse = mean_squared_error(y_train, housing_predictions)**(1/2)
train_r2 = r2_score(y_train, housing_predictions)
train_score = model.score(preprocessed_X_train, housing_predictions)
train_rmse, train_r2, train_score
(18040.92581263233, 0.975652280886504, 1.0)
Model Evaluation
# correlation이 높은 'households' feature를 사용해 결측치 보간
X_test['total_bedrooms'] = imputer.transform(X_test['total_bedrooms'], X_test['households'])
columns = X_test.columns[:-1]
num_attribs, cat_attribs = X_test.drop('ocean_proximity', axis=1), X_test['ocean_proximity']
scaled_num_attribs = scaler.transform(num_attribs)
one_hot_columns = list(*encoder.categories_)
encoded_cat_attribs = encoder.transform(X_test['ocean_proximity'].values.reshape(-1,1)).toarray()
preprocessed_X_test = pd.DataFrame(np.hstack([scaled_num_attribs,encoded_cat_attribs]),columns=list(columns)+list(one_hot_columns))
final_predictions = model.predict(preprocessed_X_test)
test_rmse = mean_squared_error(y_test, final_predictions)**(1/2)
test_r2 = r2_score(y_test, final_predictions)
test_score = model.score(preprocessed_X_test, y_test)
test_rmse, test_r2, test_score
(48969.517591501055, 0.8170026539070696, 0.8170026539070696)
Full code
'''1. Setup'''
import numpy as np
import pandas as pd
%matplotlib inline
import urllib.request
import tarfile
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
'''2. Import dataset'''
url = "https://raw.githubusercontent.com/ageron/handson-ml2/master/datasets/housing/housing.tgz"
urllib.request.urlretrieve(url, "housing.tgz")
tar = tarfile.open("housing.tgz")
tar.extractall()
tar.close()
housing = pd.read_csv("housing.csv")
'''3. Split Train/Test dataset'''
X = housing.drop('median_house_value',axis=1)
y = housing['median_house_value']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
shuffle=True,
random_state=42)
'''4. Data Preprocessing'''
# Imputing
class MyImputer():
def __init__(self):
self.proportion = 0
def fit(self,features,labels,reset=True):
tot_feature, tot_label = 0, 0
for feature,label in zip(features,labels):
if not np.isnan(feature) and not np.isnan(label):
tot_feature += feature
tot_label += label
if reset: self.proportion = tot_feature / tot_label
else: self.proportion = (tot_feature / tot_label + self.proportion) / 2
return
def transform(self,features,labels):
imputed_features = []
for feature,label in zip(features,labels):
if np.isnan(feature) and not np.isnan(label):
imputed_features.append(round(label * self.proportion))
else:
imputed_features.append(feature)
return imputed_features
def fit_transform(self,features,labels,reset=True):
self.fit(features,labels,reset)
return self.transform(features,labels)
imputer = MyImputer()
X_train['total_bedrooms'] = imputer.fit_transform(X_train['total_bedrooms'], X_train['households'])
# Scaling
num_columns = list(X_train.columns[:-1])
num_attribs = X_train.drop('ocean_proximity', axis=1)
scaler = MinMaxScaler()
scaled_num_attribs = scaler.fit_transform(num_attribs)
# Encoding
cat_attribes = X_train['ocean_proximity']
encoder = OneHotEncoder()
encoded_cat_attribs = encoder.fit_transform(X_train['ocean_proximity'].values.reshape(-1,1)).toarray()
one_hot_columns = list(*encoder.categories_)
# Put together
preprocessed_X_train_array = np.hstack([scaled_num_attribs,encoded_cat_attribs])
columns = num_columns + one_hot_columns
preprocessed_X_train = pd.DataFrame(preprocessed_X_train_array, columns=columns)
'''5. Model Training'''
model = RandomForestRegressor()
model.fit(preprocessed_X_train, y_train)
'''6. Model Evaluation'''
# Imputing
X_test['total_bedrooms'] = imputer.transform(X_test['total_bedrooms'], X_test['households'])
# Scaling/Encoding
num_columns = list(X_test.columns[:-1])
num_attribs, cat_attribs = X_test.drop('ocean_proximity', axis=1), X_test['ocean_proximity']
scaled_num_attribs = scaler.transform(num_attribs)
encoded_cat_attribs = encoder.transform(X_test['ocean_proximity'].values.reshape(-1,1)).toarray()
one_hot_columns = list(*encoder.categories_)
columns = num_columns + one_hot_columns
preprocessed_X_test = pd.DataFrame(np.hstack([scaled_num_attribs,encoded_cat_attribs]),columns=columns)
# Prediction
final_predictions = model.predict(preprocessed_X_test)
test_rmse = mean_squared_error(y_test, final_predictions)**(1/2)
test_score = model.score(preprocessed_X_test, y_test)
print("rmse: {}\ttest score: {}%".format(round(test_rmse,2), round(test_score*100,2)))
rmse: 48841.13 test score: 81.8%
Discussion
간단하게 전체 코드의 흐름을 설명하겠습니다.
1. Setup
필요한 모듈/클래스들을 import 합니다.
- Basics: numpy, pandas
- Import dataset: urllib.request, tarfile
- Spliting: train_test_split
- Preprocessing
- Imputing: (Customized)
- Scaling: MinMaxScaler
- Encoding: OntHotEncoder
- Model: RandomForestRegressor
- Evaluation: mean_squared_error
2. Import dataset
california housing dataset을 가져옵니다.
3. Split Train/Test dataset
가져온 housing 데이터셋을 train/test 데이터셋으로 분리합니다.
Train set과 Test set의 분포를 비슷하게 가져가기 위해 shuffle=True
로 설정합니다.
test_size
는 전체 데이터의 20%로 설정합니다.
4. Data Preprocessing
4.1 Imputing
기존에 존재하는 Imputing 클래스를 사용하지 않고 직접 구현했습니다.
- 먼저
corr()
메서드로 결측치를 보간할 feature와 correlation이 가장 높은 feature를 찾습니다. - 결측치가 있는 feature와 앞서 구한 feature를 이용해 두 feature 사이의 proportion을 구합니다.
- 앞서 구한 proportion과 feature를 이용해 결측치가 있는 feature의 값을 보간합니다.
이로써 결측치를 가치있는 값으로 대체할 수 있습니다.
4.2 Scaling
Standard scaling과 Min-max scaling 중 회귀에 조금 더 적합한 Min-max scaling
을 선택했습니다.
4.3 Encoding
categorical feature인 ‘ocean_proximity’ 를 원-핫 인코딩합니다.
5. Model Training
모델을 선택하고 fit()
메서드로 훈련합니다.
모델로는 RandomForestRegressor
를 선택했습니다.
6. Model Evaluation
Test set에 대해 앞에서 수행했던 전처리를 똑같이 수행하고, 훈련된 모델로 예측 및 평가를 진행합니다.
Leave a comment