ML: Supervised Learning - Classification Models
Introduction Classification Models
Classification is one of the most important tasks in supervised learning, allowing models to categorize data into predefined classes. It is widely used in applications such as spam detection, medical diagnosis, fraud detection, and sentiment analysis. In this module, we explore fundamental classification algorithms, their applications, and how to evaluate their performance effectively.
Logistic Regression
Logistic Regression is a fundamental classification algorithm that estimates the probability of a given input belonging to a particular class using the sigmoid function. Unlike linear regression, it is designed for classification tasks, producing outputs between 0 and 1.
- The sigmoid function maps predictions to probabilities, ensuring values remain within the [0,1] range.
- Decision boundary: The model makes predictions by setting a threshold (e.g., 0.5) for classification.
- Handling imbalanced datasets: Techniques like class weighting and oversampling can help improve performance on skewed datasets.
Decision Trees and Random Forests
Decision Trees are simple yet powerful models that split data based on feature thresholds, forming a tree-like structure.
- Decision Making: At each node, the best feature is selected based on impurity measures such as Gini Impurity or Entropy.
- Overfitting & Pruning: Decision Trees tend to overfit, requiring pruning techniques like max depth restriction or minimum samples per split.
Random Forests improve upon Decision Trees by combining multiple trees (ensemble learning), reducing overfitting and improving generalization.
- Bagging (Bootstrap Aggregating): The model creates multiple trees from random data subsets and averages their predictions.
- Feature Importance: Random Forests provide feature importance scores, useful for understanding which features impact predictions the most.
Support Vector Machines (SVM)
Support Vector Machines (SVM) classify data by finding the optimal hyperplane that best separates different classes.
- Hyperplanes & Support Vectors: The algorithm finds the maximum margin between classes by using support vectors (critical data points).
- Kernel Trick: For non-linearly separable data, SVM uses kernel functions (e.g., polynomial, radial basis function) to map data into higher dimensions.
- Parameter Tuning: Choosing the right kernel, regularization (C), and gamma value is crucial for performance.
K-Nearest Neighbors (KNN)
KNN is a non-parametric, instance-based learning algorithm that classifies data based on the majority vote of its nearest neighbors.
- Classification by Majority Voting: The model assigns a label based on the most common class among its k-nearest neighbors.
- Choosing K: A smaller K leads to high variance (overfitting), while a larger K increases bias (underfitting).
- Distance Metrics: Common distance measures include Euclidean distance, Manhattan distance, and Minkowski distance.
Model Evaluation Metrics
To assess classification model performance, various evaluation metrics are used:
- Confusion Matrix: A table showing True Positives (TP), False Positives (FP), True Negatives (TN), and False Negatives (FN).
- Precision & Recall: Precision measures how many predicted positives were correct, while recall measures how many actual positives were captured.
- F1-score: The harmonic mean of precision and recall, useful for imbalanced datasets.
- ROC Curve & AUC: The Receiver Operating Characteristic curve plots the True Positive Rate against the False Positive Rate, and AUC measures the area under the curve.
Hands-On Exercise
Exercise 1: Implementing Logistic Regression in Scikit-learn
Objective: Train a logistic regression model on a classification dataset and evaluate its performance.
Steps:
- Load the dataset:
from sklearn.datasets import load_breast_cancer data = load_breast_cancer() X, y = data.data, data.target - Split data into training and test sets:
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) - Train a logistic regression model:
from sklearn.linear_model import LogisticRegression model = LogisticRegression(max_iter=1000) model.fit(X_train, y_train) - Evaluate model performance:
from sklearn.metrics import classification_report y_pred = model.predict(X_test) print(classification_report(y_test, y_pred))
Exercise 2: Implementing Decision Trees and Random Forests
Objective: Compare Decision Trees and Random Forests on a dataset.
Steps:
- Train a Decision Tree model:
from sklearn.tree import DecisionTreeClassifier dt_model = DecisionTreeClassifier() dt_model.fit(X_train, y_train) - Train a Random Forest model:
from sklearn.ensemble import RandomForestClassifier rf_model = RandomForestClassifier(n_estimators=100) rf_model.fit(X_train, y_train) - Evaluate and compare both models:
from sklearn.metrics import accuracy_score dt_pred = dt_model.predict(X_test) rf_pred = rf_model.predict(X_test) print(f"Decision Tree Accuracy: {accuracy_score(y_test, dt_pred) * 100:.2f}%") print(f"Random Forest Accuracy: {accuracy_score(y_test, rf_pred) * 100:.2f}%")
Summary
- Covered major classification algorithms: Logistic Regression, Decision Trees, Random Forests, SVM, and KNN.
- Explored evaluation techniques such as Confusion Matrix, Precision-Recall, and ROC-AUC.
- Hands-on implementation of classification models using Scikit-learn.
References
- Géron, A. (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow. O’Reilly Media.
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
- Scikit-Learn Documentation: https://scikit-learn.org/stable/