[Machine Learning] Dimension Reduction

6 minute read

Dimensionality Reduction

Feature Elimination: You reduce the feature space by eliminating features. This has a disadvantage though, as you gain no information from those features that you have dropped.
Feature Selection: You apply some statistical tests in order to rank them according to their importance and then select a subset of features for your work. This again suffers from information loss and is less stable as different test gives different importance score to features.
Feature Extraction: You create new independent features, where each new independent feature is a combination of each of the old independent features. These techniques can further be divided into linear and non-linear dimensionality reduction techniques.
tSNE and PCA are feature extraction
https://www.datacamp.com/community/tutorials/introduction-t-sne?utm_source=adwords_ppc&utm_campaignid=1455363063&utm_adgroupid=65083631748&utm_device=c&utm_keyword=&utm_matchtype=b&utm_network=g&utm_adpostion=&utm_creative=278443377086&utm_targetid=aud-299261629574:dsa-429603003980&utm_loc_interest_ms=&utm_loc_physical_ms=1009871&gclid=CjwKCAjwtNf6BRAwEiwAkt6UQn9Fh31RQWu68b19VdBqQhZWcl_EiKf-R1fW_5heDab7jEZLOYWqOxoCvHoQAvD_BwE

PCA (Principal Component Analysis)

분산(variance)이 가장 큰 방향으로 데이터를 사상(projection)

breastcancer example
use some important data only (about 30%), SelectPercentile
tSNE

Setup

import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier 

from sklearn.feature_selection import SelectPercentile, chi2
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA 
from sklearn.manifold import TSNE
from matplotlib import pyplot as plt
import seaborn as sns
%matplotlib inline

cancer = load_breast_cancer()
X_all = cancer.data
y = cancer.target 
X_all = StandardScaler().fit_transform(X_all)

X_all.shape

(569, 30)

cancer.feature_names

array(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error',
       'fractal dimension error', 'worst radius', 'worst texture',
       'worst perimeter', 'worst area', 'worst smoothness',
       'worst compactness', 'worst concavity', 'worst concave points',
       'worst symmetry', 'worst fractal dimension'], dtype='<U23')

30개의 특성을 모두 사용한 경우

rfc = RandomForestClassifier(n_estimators=200)

cross_val_score(rfc, X_all, y, cv=5).mean().round(4)

0.9631

Feature Selection

Feature Selection: importance of feature selection

It enables the machine learning algorithm to train faster.
It reduces the complexity of a model and makes it easier to interpret.
It improves the accuracy of a model if the right subset is chosen.
It reduces Overfitting.

SelectPercentile()

SelectPercentile(score_func, percentile): Select features according to a percentile of the highest scores.
- score_func : callable Function taking two arrays X and y, and returning a pair of arrays (scores, pvalues) or a single array with scores. Default is f_classif. The default function only works with classification tasks.
- percentile : int, optional, default=10, Percent of features to keep.

Chi-squared statistics

상관관계를 계산하여 우연히 어떤 관계가 발생한 것인지 아니면 충분히 연관성이 있는지 알려주는 방법.
Chi2 test:
- 두 범주형 변수간의 상관관계를 측정하는 통계적 기법
- observed value (input feature) 가 expected value (expected output)와 얼마나 관련이 있는지 결정
- problem of feature selection.
- X^2 = sum[(Oi - Ei)^2 / Ei], where
  - Oi: observed frq in each category (input)
  - Ei: expected freq (label)
  - k: number of categories
  - sum[(관측값 - 기댓값)^2 / 기댓값]
- When two features are independent, the observed count is close to the expected count, thus we will have smaller Chi-Square value.
- So high Chi-Square value indicates that the hypothesis of independence is incorrect.
- In simple words, higher the Chi-Square value the feature is more dependent on the response and it can be selected for model training.

6개의 특성만 사용하는 경우

from sklearn.feature_selection import SelectPercentile, chi2
fs = SelectPercentile(chi2, percentile = 20) # 20%만 사용
sc = StandardScaler()
X_P = fs.fit_transform(cancer.data, y)
X_P = sc.fit_transform(X_P)

fs.get_support()   # 20% - 6개의 특성 선택

array([False, False,  True,  True, False, False, False, False, False,
       False, False, False, False,  True, False, False, False, False,
       False, False,  True, False,  True,  True, False, False, False,
       False, False, False])

cancer.feature_names[fs.get_support()]

array(['mean perimeter', 'mean area', 'area error', 'worst radius',
       'worst perimeter', 'worst area'], dtype='<U23')

cross_val_score(rfc, X_P, y).mean().round(4)

0.9315

2개의 특성만 사용하는 경우

# 상위 6%의 유효한 특성만 선택 )
fs = SelectPercentile(chi2, percentile = 6)
X_P = fs.fit_transform(cancer.data, y)
X_P = sc.fit_transform(X_P)
cancer.feature_names[fs.get_support()]

array(['mean area', 'worst area'], dtype='<U23')

cross_val_score(rfc, X_P, y).mean().round(4)

0.9174

m = ['v', 'o']
c = ['r','b']
plt.figure(figsize=(8,6))
for i in range(len(y)):
    plt.scatter(cancer.data[:,0][i],cancer.data[:,1][i], marker=m[y[i]], c=c[y[i]], s=5)
plt.show()

output_26_0

m = ['v', 'o']
c = ['r','b']
plt.figure(figsize=(8,6))
for i in range(len(y)):
    plt.scatter(X_P[:,0][i],X_P[:,1][i], marker=m[y[i]], c=c[y[i]], s=5)
plt.show()

output_27_0

Feature Extraction

PCA(n_components): Dimension reduction using Principal components analysis
linear method

PCA로 2개의 차원만 사용하는 경우

pca = PCA(n_components=2)
pca_result = pca.fit_transform(X_all)

pca_result #  after dimensionality reduction, there usually isn’t a particular 
           # meaning assigned to each principal component. The new components are 
           # just the two main dimensions of variation.

array([[ 9.19283683,  1.94858307],
       [ 2.3878018 , -3.76817174],
       [ 5.73389628, -1.0751738 ],
       ...,
       [ 1.25617928, -1.90229671],
       [10.37479406,  1.67201011],
       [-5.4752433 , -0.67063679]])

m = ['v', 'o']
c = ['r','b']
plt.figure(figsize=(8,6))
for i in range(len(y)):
    plt.scatter(pca_result[:,0][i],pca_result[:,1][i], marker=m[y[i]], c=c[y[i]], s=5)
plt.show()

output_32_0

pca.components_.round(3) # 기존의 30개의 특성에 각각 어떤 가중치를 곱했는지 파악

array([[ 0.219,  0.104,  0.228,  0.221,  0.143,  0.239,  0.258,  0.261,
         0.138,  0.064,  0.206,  0.017,  0.211,  0.203,  0.015,  0.17 ,
         0.154,  0.183,  0.042,  0.103,  0.228,  0.104,  0.237,  0.225,
         0.128,  0.21 ,  0.229,  0.251,  0.123,  0.132],
       [-0.234, -0.06 , -0.215, -0.231,  0.186,  0.152,  0.06 , -0.035,
         0.19 ,  0.367, -0.106,  0.09 , -0.089, -0.152,  0.204,  0.233,
         0.197,  0.13 ,  0.184,  0.28 , -0.22 , -0.045, -0.2  , -0.219,
         0.172,  0.144,  0.098, -0.008,  0.142,  0.275]])

pca.explained_variance_ratio_, sum(pca.explained_variance_ratio_) # 각 주성분 요소들이 얼마나 데이터를
                                                                  # 잘 설명하는지 파악

(array([0.44272026, 0.18971182]), 0.6324320765155942)

cross_val_score(rfc, pca_result, y, cv=5).mean().round(4)

0.9315

PCA로 6개의 차원만 사용하는 경우

pca = PCA(n_components=6)
pca_result = pca.fit_transform(X_all)
cross_val_score(rfc, pca_result, y, cv=5).mean().round(4)

0.949

앞의 selectPercentile 보다 성능이 개선됨

tSNE

고차원의 데이터 분포와 비슷하도록(KL divergence가 최소가 되도록) 저차원의 데이터를 mapping. T-distribution 사용.

TSNE(n_components, perplexity, n_iter)
non-linear method
고차원의 데이터를 저차원으로 축소. 데이터 시각화에 주로 사용.
고차원 공간에서 유클리드 거리를 데이터 포인트의 유사성을 표현하는 조건부 확률로 변환하는 방법

tSNE visualization

n_components: Dimension of the embedded space
perplexity: float, optional (default: 30) : The perplexity is related to the number of nearest neighbors that is used in other manifold learning algorithms. Larger datasets usually require a larger perplexity. Consider selecting a value between 5 and 50. Different values can result in significanlty different results.(당혹, 곤혹), 데이터 점 xi의 유효한 근방의 개수의 척도

tsne = TSNE(n_components=2, verbose=1, perplexity=40, n_iter=1000)
tsne_results = tsne.fit_transform(cancer.data)

[t-SNE] Computing 121 nearest neighbors...
[t-SNE] Indexed 569 samples in 0.002s...
[t-SNE] Computed neighbors for 569 samples in 0.019s...
[t-SNE] Computed conditional probabilities for sample 569 / 569
[t-SNE] Mean sigma: 33.679708
[t-SNE] KL divergence after 250 iterations with early exaggeration: 49.179726
[t-SNE] KL divergence after 1000 iterations: 0.216705

m = ['v','o']
c = ['r','b']
plt.figure(figsize=(8,6))
for i in range(len(y)):
    plt.scatter(tsne_results[:,0][i],tsne_results[:,1][i], marker=m[y[i]], c=c[y[i]], s=5)
plt.show()

output_42_0

Scaling and tSNE visulaization

tsne = TSNE(n_components=2, verbose=1, perplexity=40, n_iter=1000)
tsne_results = tsne.fit_transform(X_all)

[t-SNE] Computing 121 nearest neighbors...
[t-SNE] Indexed 569 samples in 0.002s...
[t-SNE] Computed neighbors for 569 samples in 0.043s...
[t-SNE] Computed conditional probabilities for sample 569 / 569
[t-SNE] Mean sigma: 1.522404
[t-SNE] KL divergence after 250 iterations with early exaggeration: 63.951508
[t-SNE] KL divergence after 1000 iterations: 0.852838

m = ['v','o']
c = ['r','b']
plt.figure(figsize=(8,6))
for i in range(len(y)):
    plt.scatter(tsne_results[:,0][i],tsne_results[:,1][i], marker=m[y[i]], c=c[y[i]], s=5)
plt.show()

output_45_0

MNIST dataset dimension reduction

from tensorflow.keras.datasets import mnist
import numpy as np
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = x_train.astype('float32') / 255.
x_test = x_test.astype('float32') / 255.

plt.imshow(x_train[0])

output_48_1

# 784차원 데이터
x_train = x_train.reshape((len(x_train), np.prod(x_train.shape[1:])))
x_test = x_test.reshape((len(x_test), np.prod(x_test.shape[1:])))
print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)

(60000, 784) (10000, 784) (60000,) (10000,)

PCA

pca = PCA(n_components = 2) # 2차원으로 축소
pca_result = pca.fit_transform(x_train) 

plt.figure(figsize=(8,6))
plt.scatter(pca_result[:, 0], pca_result[:, 1], c=y_train, cmap='jet', alpha=0.5, s=3)
plt.colorbar()
plt.show()

output_50_0

TSNE

x_train = x_train[:6000]   # too big to computer tSNE
y_train = y_train[:6000]

tsne = TSNE(n_components = 2, verbose=1, perplexity=40, n_iter=1000)
tsne_result = tsne.fit_transform(x_train)

plt.figure(figsize=(8,6))
plt.scatter(tsne_result[:, 0], tsne_result[:, 1], c=y_train, s=5)
plt.colorbar()
plt.show()

[t-SNE] Computing 121 nearest neighbors...
[t-SNE] Indexed 6000 samples in 0.004s...
[t-SNE] Computed neighbors for 6000 samples in 0.939s...
[t-SNE] Computed conditional probabilities for sample 1000 / 6000
[t-SNE] Computed conditional probabilities for sample 2000 / 6000
[t-SNE] Computed conditional probabilities for sample 3000 / 6000
[t-SNE] Computed conditional probabilities for sample 4000 / 6000
[t-SNE] Computed conditional probabilities for sample 5000 / 6000
[t-SNE] Computed conditional probabilities for sample 6000 / 6000
[t-SNE] Mean sigma: 2.277370
[t-SNE] KL divergence after 250 iterations with early exaggeration: 81.262482
[t-SNE] KL divergence after 1000 iterations: 1.476199

output_51_1

PCA보다 시간은 훨씬 오래 걸리지만 그만큼 높은 성능을 보인다.

Share on

Twitter Facebook LinkedIn

wowo0709