Lasso regression

Lasso regression, also known as L1 regularization, is a linear regression technique that incorporates regularization to prevent overfitting and improve the model's interpretability. It adds a penalty term to the standard least squares objective function, which encourages the model to select only a subset of the available features by shrinking the coefficients of irrelevant or less important features towards zero.

Here's how lasso regression works:

Standard Linear Regression: In linear regression, we aim to find a linear relationship between the input features (predictors) and the target variable. The model estimates the coefficients (weights) for each feature to minimize the sum of squared differences between the predicted and actual target values.
Lasso Regularization: In lasso regression, an additional penalty term is added to the least squares objective function. This penalty term is the sum of the absolute values of the coefficients multiplied by a regularization parameter (lambda or alpha). Mathematically, the objective function for lasso regression can be written as:where RSS is the residual sum of squares and the second term represents the penalty.
Minimize: (1/2) * RSS + lambda * sum(|coefficients|)
Coefficient Shrinkage: The presence of the penalty term in the objective function causes some of the coefficient estimates to shrink towards zero. As a result, lasso regression performs feature selection by driving the coefficients of irrelevant or less important features to zero, effectively eliminating them from the model.
Regularization Parameter: The regularization parameter (lambda or alpha) controls the degree of regularization applied. A higher value of lambda increases the penalty, resulting in more aggressive coefficient shrinkage and feature elimination. Conversely, a lower value of lambda reduces the penalty, allowing more features to have non-zero coefficients.
Sparse Models: One key advantage of lasso regression is that it tends to produce sparse models, meaning that it automatically selects a subset of the most relevant features. This can be valuable in scenarios where there are many predictors, and only a few are truly informative.
Interpretability: Another benefit of lasso regression is its interpretability. By shrinking the coefficients towards zero, it assigns less importance to irrelevant features, making the model easier to interpret and understand.

Lasso regression has various applications, including feature selection, variable importance ranking, and regularization in high-dimensional data. It strikes a balance between prediction accuracy and model simplicity, making it a useful tool in situations where model interpretability and feature selection are important considerations.

Here is a code for running a lasso regression analysis.

# -*- coding: utf-8 -*-
"""
Created on Mon Dec 14 16:26:46 2015

@author: jrose01
"""

# from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LassoLarsCV

# Load the dataset
data = pd.read_csv("tree_addhealth.csv")

# upper-case all DataFrame column names
data.columns = map(str.upper, data.columns)

# Data Management
data_clean = data.dropna()
recode1 = {1: 1, 2: 0}
data_clean['MALE'] = data_clean['BIO_SEX'].map(recode1)

# select predictor variables and target variable as separate data sets  
predvar = data_clean[['MALE', 'HISPANIC', 'WHITE', 'BLACK', 'NAMERICAN', 'ASIAN',
                      'AGE', 'ALCEVR1', 'ALCPROBS1', 'MAREVER1', 'COCEVER1', 'INHEVER1', 'CIGAVAIL', 'DEP1',
                      'ESTEEM1', 'VIOL1', 'PASSIST', 'DEVIANT1', 'GPA1', 'EXPEL1', 'FAMCONCT', 'PARACTV',
                      'PARPRES']]

target = data_clean.SCHCONN1

# standardize predictors to have mean=0 and sd=1
predictors = predvar.copy()
from sklearn import preprocessing

predictors['MALE'] = preprocessing.scale(predictors['MALE'].astype('float64'))
predictors['HISPANIC'] = preprocessing.scale(predictors['HISPANIC'].astype('float64'))
predictors['WHITE'] = preprocessing.scale(predictors['WHITE'].astype('float64'))
predictors['NAMERICAN'] = preprocessing.scale(predictors['NAMERICAN'].astype('float64'))
predictors['ASIAN'] = preprocessing.scale(predictors['ASIAN'].astype('float64'))
predictors['AGE'] = preprocessing.scale(predictors['AGE'].astype('float64'))
predictors['ALCEVR1'] = preprocessing.scale(predictors['ALCEVR1'].astype('float64'))
predictors['ALCPROBS1'] = preprocessing.scale(predictors['ALCPROBS1'].astype('float64'))
predictors['MAREVER1'] = preprocessing.scale(predictors['MAREVER1'].astype('float64'))
predictors['COCEVER1'] = preprocessing.scale(predictors['COCEVER1'].astype('float64'))
predictors['INHEVER1'] = preprocessing.scale(predictors['INHEVER1'].astype('float64'))
predictors['CIGAVAIL'] = preprocessing.scale(predictors['CIGAVAIL'].astype('float64'))
predictors['DEP1'] = preprocessing.scale(predictors['DEP1'].astype('float64'))
predictors['ESTEEM1'] = preprocessing.scale(predictors['ESTEEM1'].astype('float64'))
predictors['VIOL1'] = preprocessing.scale(predictors['VIOL1'].astype('float64'))
predictors['PASSIST'] = preprocessing.scale(predictors['PASSIST'].astype('float64'))
predictors['DEVIANT1'] = preprocessing.scale(predictors['DEVIANT1'].astype('float64'))
predictors['GPA1'] = preprocessing.scale(predictors['GPA1'].astype('float64'))
predictors['EXPEL1'] = preprocessing.scale(predictors['EXPEL1'].astype('float64'))
predictors['FAMCONCT'] = preprocessing.scale(predictors['FAMCONCT'].astype('float64'))
predictors['PARACTV'] = preprocessing.scale(predictors['PARACTV'].astype('float64'))
predictors['PARPRES'] = preprocessing.scale(predictors['PARPRES'].astype('float64'))

# split data into train and test sets
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target,
                                                              test_size=.3, random_state=123)

# specify the lasso regression model
model = LassoLarsCV(cv=10, precompute=False).fit(pred_train, tar_train)

# print variable names and regression coefficients
dict(zip(predictors.columns, model.coef_))

# plot coefficient progression
m_log_alphas = -np.log10(model.alphas_)
ax = plt.gca()
plt.plot(m_log_alphas, model.coef_path_.T)
plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k',
            label='alpha CV')
plt.ylabel('Regression Coefficients')
plt.xlabel('-log(alpha)')
plt.title('Regression Coefficients Progression for Lasso Paths')

# plot mean square error for each fold
m_log_alphascv = -np.log10(model.cv_alphas_)
plt.figure()
plt.plot(m_log_alphascv, model.cv_mse_path_, ':')
plt.plot(m_log_alphascv, model.cv_mse_path_.mean(axis=-1), 'k',
         label='Average across the folds', linewidth=2)
plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k',
            label='alpha CV')
plt.legend()
plt.xlabel('-log(alpha)')
plt.ylabel('Mean squared error')
plt.title('Mean squared error on each fold')

# MSE from training and test data
from sklearn.metrics import mean_squared_error

train_error = mean_squared_error(tar_train, model.predict(pred_train))
test_error = mean_squared_error(tar_test, model.predict(pred_test))
print('training data MSE')
print(train_error)
print('test data MSE')
print(test_error)

# R-square from training and test data
rsquared_train = model.score(pred_train, tar_train)
rsquared_test = model.score(pred_test, tar_test)
print('training data R-square')
print(rsquared_train)
print('test data R-square')
print(rsquared_test)

'공부하며 성장하기 > 인공지능 AI' 카테고리의 다른 글

AMP(Automatic Mixed Precision) 쉽게 적용하기 (0)	2023.06.14
K-means Clustering (0)	2023.05.21
Random Forest (0)	2023.05.20
Decision Tree (0)	2023.05.19
초기 신경망 이론과 모델 (0)	2022.10.29

'공부하며 성장하기 > 인공지능 AI' 카테고리의 다른 글

티스토리툴바