ML: Supervised Learning - Regression Models

Introduction to Regression Models

Regression is one of the most fundamental techniques in supervised learning, where the goal is to predict a continuous target variable based on input features. Unlike classification, where outputs are discrete labels, regression outputs numerical values. It’s used in countless real-world applications such as predicting house prices, stock market trends, and customer spending behavior.

Key Assumptions of Regression Models

  • Linearity: The relationship between independent and dependent variables should be linear.
  • Independence: Observations should be independent of each other.
  • Homoscedasticity: The variance of residuals should be constant across all levels of the independent variable.
  • Normality: Residuals should be normally distributed.

Simple and Multiple Linear Regression

Linear regression is the most straightforward regression technique. It assumes a linear relationship between input variables (X) and output (Y), which can be expressed as:

Y = b0 + b1X1 + b2X2 + … + bnXn + ε

Where b0 is the intercept, b1...bn are coefficients, and ε is the error term.

Implementing Linear Regression with Scikit-learn

from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Load dataset
data = load_diabetes()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)

# Train model
model = LinearRegression()
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Evaluation
print(f"R² Score: {r2_score(y_test, y_pred):.2f}")
print(f"RMSE: {mean_squared_error(y_test, y_pred, squared=False):.2f}")

Polynomial Regression

Linear regression often fails to capture more complex relationships. This is where polynomial regression comes in, which extends linear regression by introducing polynomial features.

Implementing Polynomial Regression in Python

from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
import numpy as np
import matplotlib.pyplot as plt

# Generate synthetic data
np.random.seed(42)
X = np.sort(2 * np.random.rand(100, 1), axis=0)
y = 2 + X + X**2 + 0.5 * np.random.randn(100, 1)

# Train polynomial regression model
poly_model = make_pipeline(PolynomialFeatures(degree=2), LinearRegression())
poly_model.fit(X, y)

# Predictions
X_test = np.linspace(0, 2, 100).reshape(-1, 1)
y_pred = poly_model.predict(X_test)

# Plot
plt.scatter(X, y, label="Data")
plt.plot(X_test, y_pred, color='red', label="Polynomial Regression")
plt.legend()
plt.show()

Ridge and Lasso Regression for Regularization

Regularization techniques help prevent overfitting by adding penalty terms to the regression model.

  • Ridge Regression (L2 Regularization): Adds the squared magnitude of coefficients as a penalty term.
  • Lasso Regression (L1 Regularization): Adds the absolute value of coefficients as a penalty term, which can shrink some coefficients to zero (feature selection).

Implementing Ridge and Lasso Regression

from sklearn.linear_model import Ridge, Lasso

ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)

lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)

print(f"Ridge R²: {ridge.score(X_test, y_test):.2f}")
print(f"Lasso R²: {lasso.score(X_test, y_test):.2f}")

Model Evaluation Metrics

Evaluating regression models is crucial to understand their performance. Some commonly used metrics are:

MetricFormulaInterpretation
Mean Squared Error (MSE)(\frac{1}{n} \sum (y_i - \hat{y}_i)^2)Penalizes large errors heavily
Root Mean Squared Error (RMSE)(\sqrt{MSE})Easier to interpret as it has the same unit as Y
R² Score(1 - \frac{SS_{res}}{SS_{tot}})Explains variance captured by the model
Adjusted R²Adjusts R² for number of predictorsMore reliable for multiple regression

Hands-On Exercises

Exercise 1: Implementing Linear Regression

Objective: Train and evaluate a linear regression model using Scikit-learn.

Exercise 2: Implementing Polynomial Regression

Objective: Train and compare polynomial regression models.

Steps:

  1. Generate a synthetic dataset with non-linear relationships.
  2. Transform features using polynomial features.
  3. Train polynomial regression models of different degrees.
  4. Compare performance using RMSE and R².

Summary

  • Explored different types of regression models.
  • Implemented regularization techniques to prevent overfitting.
  • Evaluated model performance using key metrics.

References

  • Géron, A. (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow. O’Reilly Media.
  • Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning. Springer.
  • Scikit-Learn Documentation: https://scikit-learn.org/stable/