[Machine Learning] Linear Regression 3 (Regularization)

11 minute read


이번 포스팅에서는 회귀 문제와 함께 가중치 규제인 Ridge(L2)Lasso(L1) 규제에 대해 알아보도록 하겠습니다.


Setup

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

from sklearn.linear_model import LinearRegression, Ridge, Lasso



Linear Regression model (without Regularization)

  • n_features - number of features to be considered
  • noise - deviation from straight line
  • n_samples - number of samples


Make dataset

from sklearn.datasets import make_regression
X,y = make_regression(n_features=1, noise=10, n_samples=1000, random_state=42)   
X[:5], y[:5]
(array([[-1.75873949],
        [ 1.03184454],
        [-0.48760622],
        [ 0.18645431],
        [ 0.72576662]]),
 array([-32.77038605,   3.50459106, -17.93030767,  -3.99020124,
         13.10526434]))
plt.xlabel('Feature - X')
plt.ylabel('Target - y')
plt.scatter(X, y, s=5)

output_7_1


Regression

lr = LinearRegression()
lr.fit(X, y)
lr.coef_, lr.intercept_
(array([16.63354605]), 0.04526205905820929)
# Predicting using trained model
y_pred = lr.predict(X)


Plotting

  • Blue dots represent maps to actual target data
  • Orange dots represent predicted data
plt.scatter(X, y, s=5, label='training')
plt.scatter(X, y_pred, s=5, label='prediction')
plt.xlabel('Feature - X')
plt.ylabel('Target - y')
plt.legend()
plt.show()

output_12_0



Regularized Regression Methods

여기서 살펴볼 Lasso(L1)Ridge(L2) 규제는 가중치 규제에 해당합니다.

가중치 규제가 필요한 이유는 무엇일까요?

이를 이해하기 위해서는 먼저 Overfitting(과대적합)Underfitting(과소적합)에 대해 알아야 합니다.


Overfitting/Underfitting

image-20210922211723077

쉽게 말하면, 과대적합과 과소적합은 다음과 같습니다.

  • 과대적합: 모델이 훈련 데이터에만 적응하여 높은 성능을 보이고, 테스트 데이터에 있어서는 낮은 성능을 보이는 경우 (분산이 크다 (high variance))
  • 과소적합: 모델이 훈련 데이터에 대한 적응도 부족하여 훈련 데이터와 테스트 데이터 모두에 있어서 낮은 성능을 보이는 경우 (편향이 크다 (high bias))

image-20210922212335941

이 중 더 중요한 이슈는 무엇일까요? 바로 과대 적합입니다.


과소 적합은 더 많은 데이터셋을 모으거나 모델의 복잡도를 높이는 등 명확한 해결책이 제시되어 있지만, 과대 적합은 그것을 해결하는 데 명확한 정답이 존재하지 않습니다.

따라서 모델을 학습시키는 데 과대 적합 문제와 더 자주 마주하게 되고, 그것을 해결하기 위한 방법들도 많이 생겨나게 되었죠.

  • 더 다양한 데이터셋 수집 (More various data)
  • 모델을 단순화 (Simplify the model)
  • 일부 특성만을 이용 (Feature selection)
  • 데이터 보강 (Data augmentation)
  • 가중치 규제 (Regularization)
  • 조기 종료 (Early stopping)
  • 드롭아웃 (Dropouts)

이 중 여기서 살펴볼 방법은 가중치 규제입니다.


Regularization

image-20210922212431808

첫번째 규제는 Ridge(L2) 규제입니다. 이는 가중치의 L2 term을 최소화하는 것으로, 영향이 너무 큰 가중치를 줄이는 역할을 합니다.

두번째 규제는 Lasso(L1) 규제입니다. 이는 가중치의 L1 term을 최소화하는 것으로, 모든 가중치의 영향을 동등하게 줄입니다. 이 때 값이 작은 가중치들이 먼저 사라지기 때문에, 영향이 작은 가중치를 무시하는 역할을 합니다.

세번째 규제는 Elastic net 규제로, L1과 L2 규제를 모두 사용하는 규제입니다.


L1과 L2 규제의 경우 규제의 정도를 조절하기 위한 하이퍼파라미터 alpha가 존재하고, 엘라스틱 넷의 경우 gamma가 추가로 존재합니다.


그럼 이제 이 가중치 규제를 사용하면 예측이 어떤 식으로 변화하는 지 살펴보겠습니다.


Make outliers

outliers = y[950:] - 600; outliers
array([-620.72518918, -607.24456936, -602.35967987, -589.77927836,
       -606.97474711, -602.5249083 , -617.53354476, -581.41160958,
       -568.58829982, -588.48103465, -576.37267804, -608.20115427,
       -627.62758019, -629.16862648, -600.74874687, -603.71586107,
       -597.22691815, -589.89284288, -587.08855694, -612.90456844,
       -607.72930237, -606.38449017, -592.78147515, -564.34789926,
       -579.47960861, -596.20989757, -608.437806  , -595.54249235,
       -605.14184967, -585.14253937, -602.46852941, -591.20272709,
       -576.61995697, -624.29969481, -633.11859313, -584.0344489 ,
       -580.06411958, -602.36414388, -600.03658325, -598.97777085,
       -600.7449772 , -588.33620239, -610.44463741, -620.9629963 ,
       -613.84011222, -622.5064205 , -586.40905438, -591.93411712,
       -562.64219745, -611.96087644])
import numpy as np
y_out = np.append(y[:950], outliers)
plt.scatter(X, y_out, s=5)

output_17_1


Regression without Regularization

lr = LinearRegression()
lr.fit(X, y_out)
y_out_pred = lr.predict(X)
plt.scatter(X, y_out, s=5, label='actual')
plt.scatter(X, y_out_pred, s=5, label='prediction with outliers')
plt.scatter(X, y_pred,s=5, c='k', label='prediction without outlier')
plt.legend() 
plt.title('Linear Regression')
Text(0.5, 1.0, 'Linear Regression')

output_20_1

lr.coef_, lr.intercept_
(array([14.75586098]), -29.918438428236247)


With Ridge(L2) Regularization

  • 영향이 너무 큰 가중치들을 줄인다.
from sklearn.linear_model import Ridge

ridge = Ridge(alpha=1000)
ridge.fit(X, y_out)
y_ridge_pred = ridge.predict(X)
plt.scatter(X, y_out,s=5,label='actual')
plt.scatter(X, y_out_pred,s=5, c='r' , label='LinearRegression with outliers')
plt.scatter(X, y_ridge_pred,s=5,c='k', label='RidgeRegression with outliers')
plt.legend()
plt.title('Linear Regression')
Text(0.5, 1.0, 'Linear Regression')

output_24_1

ridge.coef_, ridge.intercept_    # 기울기 coefficient(w) 가 값이 훨씬 작아짐.
(array([7.21930478]), -29.77274130320161)


With Lasso(L1) Regularization

  • 영향이 적은 가중치들을 무시한다.
from sklearn.linear_model import Lasso

lasso = Lasso(alpha=1000)
lasso.fit(X, y_out)
y_lasso_pred = lasso.predict(X)
plt.scatter(X, y_out,s=5,label='actual')
plt.scatter(X, y_out_pred,s=5, c='r' , label='LinearRegression with outliers')
plt.scatter(X, y_ridge_pred,s=5,c='k', label='RidgeRegression with outliers')
plt.scatter(X, y_lasso_pred,s=5,c='y', label='LassoRegression with outliers')
plt.legend()
plt.title('Linear Regression')
Text(0.5, 1.0, 'Linear Regression')

output_28_1

lasso.coef_, lasso.intercept_
(array([0.]), -29.63317730014139)



Effects of alpha using Ridge on Coefficients

여기서는 가중치 규제의 하이퍼파라미터 alpha의 변화에 따라 예측이 어떻게 달라지는 지 보겠습니다.

Make dataset

X, y, w = make_regression(n_samples=1000, n_features=10, coef=True,
                          random_state=42, bias=3.5)
# w: The coefficient of the underlying linear model. It is returned only if coef is True.
w
array([32.12551734, 76.33080772, 33.6926875 ,  9.42759779,  5.16621758,
       58.28693612, 29.43481665,  7.18075454, 10.30191944, 75.31997019])


0.001 ~ 10^5 사이의 200개 표본 alpha를 생성합니다.

alphas = np.logspace(-3, 5, 200)
alphas[:20], alphas[-20:]
(array([0.001     , 0.00109699, 0.00120338, 0.00132009, 0.00144812,
        0.00158857, 0.00174263, 0.00191164, 0.00209705, 0.00230043,
        0.00252354, 0.00276829, 0.00303677, 0.00333129, 0.00365438,
        0.00400881, 0.0043976 , 0.00482411, 0.00529198, 0.00580523]),
 array([ 17225.85965399,  18896.52339691,  20729.21779595,  22739.65752358,
         24945.0813523 ,  27364.39997075,  30018.35813576,  32929.71255097,
         36123.42699709,  39626.88638701,  43470.13158125,  47686.11697714,
         52310.99308056,  57384.41648302,  62949.88990222,  69055.13520162,
         75752.50258772,  83099.41949353,  91158.88299751, 100000.        ]))


릿지 규제를 적용합니다.

coefs = []
for a in alphas:
    ridge = Ridge(alpha=a) # 값이 큰 w 값을 작게!
    ridge.fit(X, y)
    coefs.append(ridge.coef_)
w
array([32.12551734, 76.33080772, 33.6926875 ,  9.42759779,  5.16621758,
       58.28693612, 29.43481665,  7.18075454, 10.30191944, 75.31997019])
  • alpha가 작을 때
coefs[:5]
[array([32.12547819, 76.33071941, 33.69264671,  9.42759217,  5.16620902,
        58.28688033, 29.43478769,  7.18074889, 10.30191419, 75.31989693]),
 array([32.12547439, 76.33071085, 33.69264275,  9.42759163,  5.16620819,
        58.28687492, 29.43478488,  7.18074834, 10.30191368, 75.31988982]),
 array([32.12547022, 76.33070145, 33.69263841,  9.42759103,  5.16620728,
        58.28686898, 29.4347818 ,  7.18074774, 10.30191312, 75.31988203]),
 array([32.12546566, 76.33069115, 33.69263365,  9.42759038,  5.16620628,
        58.28686247, 29.43477841,  7.18074708, 10.30191251, 75.31987348]),
 array([32.12546064, 76.33067984, 33.69262843,  9.42758966,  5.16620518,
        58.28685533, 29.43477471,  7.18074636, 10.30191184, 75.3198641 ])]
  • alpha가 클 때

alpha 값이 커지면 w 값이 굉장히 작아지는 것을 볼 수 있습니다.

coefs[-5:]
[array([0.36905501, 0.94856311, 0.38307061, 0.18174143, 0.03080575,
        0.87029109, 0.43753906, 0.11359349, 0.22092661, 1.11708021]),
 array([0.3367566 , 0.86565699, 0.34953102, 0.1659322 , 0.02807343,
        0.79439856, 0.3993917 , 0.10368337, 0.2017385 , 1.01966879]),
 array([0.30725841, 0.78992058, 0.31890168, 0.15147756, 0.02558406,
        0.72504039, 0.36452758, 0.09462722, 0.18418953, 0.93064419]),
 array([0.28032211, 0.72074665, 0.29093448, 0.1382649 , 0.02331595,
        0.66166797, 0.33267124, 0.08635322, 0.16814427, 0.84930233]),
 array([0.2557289 , 0.65757725, 0.26540173, 0.12619038, 0.02124935,
        0.60377642, 0.30356917, 0.07879532, 0.15347768, 0.77499522])]


alpha 값에 따른 가중치들의 값을 그래프로 그려보겠습니다.

ax = plt.gca()
# Get the current Axes instance on the current figure matching the given keyword 
# args, or create one.

ax.plot(alphas, coefs)
ax.set_xscale('log')
plt.xlabel('alpha')
plt.ylabel('weights')
plt.title('Ridge coefficients as a function of the regularization')
plt.show()

output_42_0

  • Conclusion
    • 큰 가중치들을 작게 만드는 것을 볼 수 있다.
    • As alpha tends toward zero the coefficients found by Ridge regression stabilize towards the randomly sampled vector w (similar to LinearRegression).
    • For big alpha (strong regularisation) the coefficients are smaller (eventually converging at 0) leading to a simpler and biased solution.



Effects of alpha using Lasso on Coefficients

라쏘 규제의 가중치 그래프도 그려보겠습니다.

# lasso 
X, y, w = make_regression(n_samples=1000, n_features=10, coef=True,
                          random_state=42, bias=3.5)

alphas = np.logspace(-3, 5, 200)
coefs = []
for a in alphas:
    lasso = Lasso(max_iter=10000, alpha=a) # 작은 값을 무시!
    lasso.fit(X, y)
    coefs.append(lasso.coef_)

ax = plt.gca()
# Get the current Axes instance on the current figure matching the given keyword 
# args, or create one.

ax.plot(alphas, coefs)
ax.set_xscale('log')
plt.xlabel('alpha')
plt.ylabel('weights')
plt.title('Lasso coefficients as a function of the regularization')
plt.show()

output_45_0

  • Conclusion
    • 작은 가중치들을 무시하는 것을 볼 수 있다.



Example

  • https://www.analyticsvidhya.com/blog/2016/01/ridge-lasso-regression-python-complete-tutorial/
  • RSS (residual sum of square): sum of square of error


Setup

import numpy as np
import pandas as pd
import random
import matplotlib.pyplot as plt
%matplotlib inline
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 12, 10


Make dataset

#Define input array with angles from 60deg to 300deg converted to radians
x = np.array([i*np.pi/180 for i in range(60,300,4)])
np.random.seed(10)  #Setting seed for reproducibility
y = np.sin(x) + np.random.normal(0,0.15,len(x))
data = pd.DataFrame(np.column_stack([x,y]),columns=['x','y'])
plt.plot(data['x'],data['y'],'.')

output_52_1


Polynomial(Non-Linear) regression

with powers of x from 1 to 15

가중치 규제의 영향을 보기 위해 비선형 회귀를 진행합니다.

for i in range(2,16):  # power of 1 is already there
    colname = 'x_%d'%i      # new var will be x_power
    data[colname] = data['x'] ** i
data.head()
x y x_2 x_3 x_4 x_5 x_6 x_7 x_8 x_9 x_10 x_11 x_12 x_13 x_14 x_15
0 1 1.1 1.1 1.1 1.2 1.3 1.3 1.4 1.4 1.5 1.6 1.7 1.7 1.8 1.9 2
1 1.1 1 1.2 1.4 1.6 1.7 1.9 2.2 2.4 2.7 3 3.4 3.8 4.2 4.7 5.3
2 1.2 0.7 1.4 1.7 2 2.4 2.8 3.3 3.9 4.7 5.5 6.6 7.8 9.3 11 13
3 1.3 0.95 1.6 2 2.5 3.1 3.9 4.9 6.2 7.8 9.8 12 16 19 24 31
4 1.3 1.1 1.8 2.3 3.1 4.1 5.4 7.2 9.6 13 17 22 30 39 52 69


from sklearn.linear_model import LinearRegression

def linear_regression(data, power, models_to_plot):
    # initialize predictors:
    predictors=['x']
    if power>=2:
        predictors.extend(['x_%d'%i for i in range(2,power+1)])
    
    # Fit the model
    linreg = LinearRegression(normalize=True)
    linreg.fit(data[predictors],data['y'])
    y_pred = linreg.predict(data[predictors])
    # Check if a plot is to be made for the entered power
    if power in models_to_plot:
        plt.subplot(models_to_plot[power])
        plt.tight_layout()
        plt.plot(data['x'],y_pred)
        plt.plot(data['x'],data['y'],'.')
        plt.title('Plot for power: %d'%power)
    
    # Return the result in pre-defined format
    rss = sum((y_pred-data['y'])**2)
    ret = [rss]
    ret.extend([linreg.intercept_])
    ret.extend(linreg.coef_)
    return ret
#Initialize a dataframe to store the results:
col = ['rss','intercept'] + ['coef_x_%d'%i for i in range(1,16)]
ind = ['model_pow_%d'%i for i in range(1,16)]
coef_matrix_simple = pd.DataFrame(index=ind, columns=col)

# Define the powers to plot
models_to_plot = {1:231,3:232,6:233,9:234,12:235,15:236}

#Iterate through all powers and assimilate results
for i in range(1,16):
    coef_matrix_simple.iloc[i-1,0:i+2] = linear_regression(data, power=i, models_to_plot=models_to_plot)

output_57_0

  • As the model complexity increases, the models tends to fit even smaller deviations in the training data set, possibly leading to overfitting.
  • The size of coefficients increase exponentially with increase in model complexity.
  • What does a large coefficient signify? It means that we’re putting a lot of emphasis on that feature, i.e. the particular feature is a good predictor for the outcome. When it becomes too large, the algorithm starts modelling intricate relations to estimate the output and ends up overfitting to the particular training data.


Ridge

from sklearn.linear_model import Ridge
def ridge_regression(data, predictors, alpha, models_to_plot={}):
    #Fit the model
    ridgereg = Ridge(alpha=alpha,normalize=True)
    ridgereg.fit(data[predictors],data['y'])
    y_pred = ridgereg.predict(data[predictors])
    
    #Check if a plot is to be made for the entered alpha

    if alpha in models_to_plot:
        plt.subplot(models_to_plot[alpha])
        plt.tight_layout()
        plt.plot(data['x'],y_pred)
        plt.plot(data['x'],data['y'],'.')
        plt.title('Plot for alpha: %.3g'%alpha)
    
    #Return the result in pre-defined format
    rss = sum((y_pred-data['y'])**2)
    ret = [rss]
    ret.extend([ridgereg.intercept_])
    ret.extend(ridgereg.coef_)
    return ret
#Initialize predictors to be set of 15 powers of x
predictors=['x']
predictors.extend(['x_%d'%i for i in range(2,16)])

#Set the different values of alpha to be tested
alpha_ridge = [1e-15, 1e-10, 1e-8, 1e-4, 1e-3,1e-2, 1, 5, 10, 20]

#Initialize the dataframe for storing coefficients.
col = ['rss','intercept'] + ['coef_x_%d'%i for i in range(1,16)]
ind = ['alpha_%.2g'%alpha_ridge[i] for i in range(0,10)]
coef_matrix_ridge = pd.DataFrame(index=ind, columns=col)

models_to_plot = {1e-15:231, 1e-10:232, 1e-4:233, 1e-3:234, 1e-2:235, 1:236}
for i in range(10):
    coef_matrix_ridge.iloc[i,] = ridge_regression(data, predictors, alpha_ridge[i], models_to_plot)

output_61_1

# Set the display format to be scientific for ease of analysis
pd.options.display.float_format = '{:,.2g}'.format
coef_matrix_ridge
rss intercept coef_x_1 coef_x_2 coef_x_3 coef_x_4 coef_x_5 coef_x_6 coef_x_7 coef_x_8 coef_x_9 coef_x_10 coef_x_11 coef_x_12 coef_x_13 coef_x_14 coef_x_15
alpha_1e-15 0.87 94 -3e+02 3.8e+02 -2.3e+02 65 0.56 -4.3 0.39 0.2 -0.028 -0.0069 0.0012 0.00019 -5.6e-05 4.1e-06 -7.8e-08
alpha_1e-10 0.92 11 -29 31 -15 2.9 0.17 -0.091 -0.011 0.002 0.00064 2.4e-05 -2e-05 -4.2e-06 2.2e-07 2.3e-07 -2.3e-08
alpha_1e-08 0.95 1.3 -1.5 1.7 -0.68 0.039 0.016 0.00016 -0.00036 -5.4e-05 -2.9e-07 1.1e-06 1.9e-07 2e-08 3.9e-09 8.2e-10 -4.6e-10
alpha_0.0001 0.96 0.56 0.55 -0.13 -0.026 -0.0028 -0.00011 4.1e-05 1.5e-05 3.7e-06 7.4e-07 1.3e-07 1.9e-08 1.9e-09 -1.3e-10 -1.5e-10 -6.2e-11
alpha_0.001 1 0.82 0.31 -0.087 -0.02 -0.0028 -0.00022 1.8e-05 1.2e-05 3.4e-06 7.3e-07 1.3e-07 1.9e-08 1.7e-09 -1.5e-10 -1.4e-10 -5.2e-11
alpha_0.01 1.4 1.3 -0.088 -0.052 -0.01 -0.0014 -0.00013 7.2e-07 4.1e-06 1.3e-06 3e-07 5.6e-08 9e-09 1.1e-09 4.3e-11 -3.1e-11 -1.5e-11
alpha_1 5.6 0.97 -0.14 -0.019 -0.003 -0.00047 -7e-05 -9.9e-06 -1.3e-06 -1.4e-07 -9.3e-09 1.3e-09 7.8e-10 2.4e-10 6.2e-11 1.4e-11 3.2e-12
alpha_5 14 0.55 -0.059 -0.0085 -0.0014 -0.00024 -4.1e-05 -6.9e-06 -1.1e-06 -1.9e-07 -3.1e-08 -5.1e-09 -8.2e-10 -1.3e-10 -2e-11 -3e-12 -4.2e-13
alpha_10 18 0.4 -0.037 -0.0055 -0.00095 -0.00017 -3e-05 -5.2e-06 -9.2e-07 -1.6e-07 -2.9e-08 -5.1e-09 -9.1e-10 -1.6e-10 -2.9e-11 -5.1e-12 -9.1e-13
alpha_20 23 0.28 -0.022 -0.0034 -0.0006 -0.00011 -2e-05 -3.6e-06 -6.6e-07 -1.2e-07 -2.2e-08 -4e-09 -7.5e-10 -1.4e-10 -2.5e-11 -4.7e-12 -8.7e-13


Lasso

from sklearn.linear_model import Lasso
def lasso_regression(data, predictors, alpha, models_to_plot={}):
    #Fit the model
    lassoreg = Lasso(alpha=alpha,normalize=True, max_iter=1e5)
    lassoreg.fit(data[predictors],data['y'])
    y_pred = lassoreg.predict(data[predictors])
    
    #Check if a plot is to be made for the entered alpha
    if alpha in models_to_plot:
        plt.subplot(models_to_plot[alpha])
        plt.tight_layout()
        plt.plot(data['x'],y_pred)
        plt.plot(data['x'],data['y'],'.')
        plt.title('Plot for alpha: %.3g'%alpha)
    
    #Return the result in pre-defined format
    rss = sum((y_pred-data['y'])**2)
    ret = [rss]
    ret.extend([lassoreg.intercept_])
    ret.extend(lassoreg.coef_)
    return ret
#Initialize predictors to all 15 powers of x
predictors=['x']
predictors.extend(['x_%d'%i for i in range(2,16)])

#Define the alpha values to test
alpha_lasso = [1e-15, 1e-10, 1e-8, 1e-5,1e-4, 1e-3,1e-2, 1, 5, 10]

#Initialize the dataframe to store coefficients
col = ['rss','intercept'] + ['coef_x_%d'%i for i in range(1,16)]
ind = ['alpha_%.2g'%alpha_lasso[i] for i in range(0,10)]
coef_matrix_lasso = pd.DataFrame(index=ind, columns=col)

#Define the models to plot
models_to_plot = {1e-10:231, 1e-5:232,1e-4:233, 1e-3:234, 1e-2:235, 1:236}

#Iterate over the 10 alpha values:
for i in range(10):
    coef_matrix_lasso.iloc[i,] = lasso_regression(data, predictors, alpha_lasso[i], models_to_plot)

output_65_1

coef_matrix_lasso
rss intercept coef_x_1 coef_x_2 coef_x_3 coef_x_4 coef_x_5 coef_x_6 coef_x_7 coef_x_8 coef_x_9 coef_x_10 coef_x_11 coef_x_12 coef_x_13 coef_x_14 coef_x_15
alpha_1e-15 0.96 0.22 1.1 -0.37 0.00089 0.0016 -0.00012 -6.4e-05 -6.3e-06 1.4e-06 7.8e-07 2.1e-07 4e-08 5.4e-09 1.8e-10 -2e-10 -9.2e-11
alpha_1e-10 0.96 0.22 1.1 -0.37 0.00088 0.0016 -0.00012 -6.4e-05 -6.3e-06 1.4e-06 7.8e-07 2.1e-07 4e-08 5.4e-09 1.8e-10 -2e-10 -9.2e-11
alpha_1e-08 0.96 0.22 1.1 -0.37 0.00077 0.0016 -0.00011 -6.4e-05 -6.3e-06 1.4e-06 7.8e-07 2.1e-07 4e-08 5.3e-09 2e-10 -1.9e-10 -9.3e-11
alpha_1e-05 0.96 0.5 0.6 -0.13 -0.038 -0 0 0 0 7.7e-06 1e-06 7.7e-08 0 0 0 -0 -7e-11
alpha_0.0001 1 0.9 0.17 -0 -0.048 -0 -0 0 0 9.5e-06 5.1e-07 0 0 0 -0 -0 -4.4e-11
alpha_0.001 1.7 1.3 -0 -0.13 -0 -0 -0 0 0 0 0 0 1.5e-08 7.5e-10 0 0 0
alpha_0.01 3.6 1.8 -0.55 -0.00056 -0 -0 -0 -0 -0 -0 -0 0 0 0 0 0 0
alpha_1 37 0.038 -0 -0 -0 -0 -0 -0 -0 -0 -0 -0 -0 -0 -0 -0 -0
alpha_5 37 0.038 -0 -0 -0 -0 -0 -0 -0 -0 -0 -0 -0 -0 -0 -0 -0
alpha_10 37 0.038 -0 -0 -0 -0 -0 -0 -0 -0 -0 -0 -0 -0 -0 -0 -0



정리

  • Conclusion
    • Ridge: It includes all (or none) of the features in the model. Thus, the major advantage of ridge regression is coefficient shrinkage and reducing model complexity.
    • Lasso: Along with shrinking coefficients, lasso performs feature selection as well. (Remember the ‘selection‘ in the lasso full-form) As we observed, some of the coefficients become exactly zero, which is equivalent to the particular feature being excluded from the model.
    • 계수 alpha 는 Ridge보다 Lasso에서 더 큰 영향을 보인다. Ridge에서는 10e-2 ~ 10e-4 값을 일반적으로 사용하는 데 반해, Lasso에서는 10e-2 이상의 값에서는 표현력을 잃는다.
  • Typical use cases:

    • Ridge: It is majorly used to prevent overfitting. Since it includes all the features, it is not very useful in case of exorbitantly high #features, say in millions, as it will pose computational challenges.
    • Lasso: Since it provides sparse solutions, it is generally the model of choice (or some variant of this concept) for modelling cases where the # of features are in millions or more. In such a case, getting a sparse solution is of great computational advantage as the features with zero coefficients can simply be ignored.

Leave a comment