ML: Supervised Learning - Regression Models
Introduction to Regression Models
Regression is one of the most fundamental techniques in supervised learning, where the goal is to predict a continuous target variable based on input features. Unlike classification, where outputs are discrete labels, regression outputs numerical values. It’s used in countless real-world applications such as predicting house prices, stock market trends, and customer spending behavior.
Key Assumptions of Regression Models
- Linearity: The relationship between independent and dependent variables should be linear.
- Independence: Observations should be independent of each other.
- Homoscedasticity: The variance of residuals should be constant across all levels of the independent variable.
- Normality: Residuals should be normally distributed.
Simple and Multiple Linear Regression
Linear regression is the most straightforward regression technique. It assumes a linear relationship between input variables (X) and output (Y), which can be expressed as:
Y = b0 + b1X1 + b2X2 + … + bnXn + ε
Where b0 is the intercept, b1...bn are coefficients, and ε is the error term.
Implementing Linear Regression with Scikit-learn
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Load dataset
data = load_diabetes()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)
# Train model
model = LinearRegression()
model.fit(X_train, y_train)
# Predictions
y_pred = model.predict(X_test)
# Evaluation
print(f"R² Score: {r2_score(y_test, y_pred):.2f}")
print(f"RMSE: {mean_squared_error(y_test, y_pred, squared=False):.2f}")Polynomial Regression
Linear regression often fails to capture more complex relationships. This is where polynomial regression comes in, which extends linear regression by introducing polynomial features.
Implementing Polynomial Regression in Python
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
import numpy as np
import matplotlib.pyplot as plt
# Generate synthetic data
np.random.seed(42)
X = np.sort(2 * np.random.rand(100, 1), axis=0)
y = 2 + X + X**2 + 0.5 * np.random.randn(100, 1)
# Train polynomial regression model
poly_model = make_pipeline(PolynomialFeatures(degree=2), LinearRegression())
poly_model.fit(X, y)
# Predictions
X_test = np.linspace(0, 2, 100).reshape(-1, 1)
y_pred = poly_model.predict(X_test)
# Plot
plt.scatter(X, y, label="Data")
plt.plot(X_test, y_pred, color='red', label="Polynomial Regression")
plt.legend()
plt.show()Ridge and Lasso Regression for Regularization
Regularization techniques help prevent overfitting by adding penalty terms to the regression model.
- Ridge Regression (L2 Regularization): Adds the squared magnitude of coefficients as a penalty term.
- Lasso Regression (L1 Regularization): Adds the absolute value of coefficients as a penalty term, which can shrink some coefficients to zero (feature selection).
Implementing Ridge and Lasso Regression
from sklearn.linear_model import Ridge, Lasso
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)
print(f"Ridge R²: {ridge.score(X_test, y_test):.2f}")
print(f"Lasso R²: {lasso.score(X_test, y_test):.2f}")Model Evaluation Metrics
Evaluating regression models is crucial to understand their performance. Some commonly used metrics are:
| Metric | Formula | Interpretation |
|---|---|---|
| Mean Squared Error (MSE) | (\frac{1}{n} \sum (y_i - \hat{y}_i)^2) | Penalizes large errors heavily |
| Root Mean Squared Error (RMSE) | (\sqrt{MSE}) | Easier to interpret as it has the same unit as Y |
| R² Score | (1 - \frac{SS_{res}}{SS_{tot}}) | Explains variance captured by the model |
| Adjusted R² | Adjusts R² for number of predictors | More reliable for multiple regression |
Hands-On Exercises
Exercise 1: Implementing Linear Regression
Objective: Train and evaluate a linear regression model using Scikit-learn.
Exercise 2: Implementing Polynomial Regression
Objective: Train and compare polynomial regression models.
Steps:
- Generate a synthetic dataset with non-linear relationships.
- Transform features using polynomial features.
- Train polynomial regression models of different degrees.
- Compare performance using RMSE and R².
Summary
- Explored different types of regression models.
- Implemented regularization techniques to prevent overfitting.
- Evaluated model performance using key metrics.
References
- Géron, A. (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow. O’Reilly Media.
- Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning. Springer.
- Scikit-Learn Documentation: https://scikit-learn.org/stable/