ML: Supervised Learning - Classification Models

Introduction Classification Models

Classification is one of the most important tasks in supervised learning, allowing models to categorize data into predefined classes. It is widely used in applications such as spam detection, medical diagnosis, fraud detection, and sentiment analysis. In this module, we explore fundamental classification algorithms, their applications, and how to evaluate their performance effectively.

Logistic Regression

Logistic Regression is a fundamental classification algorithm that estimates the probability of a given input belonging to a particular class using the sigmoid function. Unlike linear regression, it is designed for classification tasks, producing outputs between 0 and 1.

The sigmoid function maps predictions to probabilities, ensuring values remain within the [0,1] range.
Decision boundary: The model makes predictions by setting a threshold (e.g., 0.5) for classification.
Handling imbalanced datasets: Techniques like class weighting and oversampling can help improve performance on skewed datasets.

Decision Trees and Random Forests

Decision Trees are simple yet powerful models that split data based on feature thresholds, forming a tree-like structure.

Decision Making: At each node, the best feature is selected based on impurity measures such as Gini Impurity or Entropy.
Overfitting & Pruning: Decision Trees tend to overfit, requiring pruning techniques like max depth restriction or minimum samples per split.

Random Forests improve upon Decision Trees by combining multiple trees (ensemble learning), reducing overfitting and improving generalization.

Bagging (Bootstrap Aggregating): The model creates multiple trees from random data subsets and averages their predictions.
Feature Importance: Random Forests provide feature importance scores, useful for understanding which features impact predictions the most.

Support Vector Machines (SVM)

Support Vector Machines (SVM) classify data by finding the optimal hyperplane that best separates different classes.

Hyperplanes & Support Vectors: The algorithm finds the maximum margin between classes by using support vectors (critical data points).
Kernel Trick: For non-linearly separable data, SVM uses kernel functions (e.g., polynomial, radial basis function) to map data into higher dimensions.
Parameter Tuning: Choosing the right kernel, regularization (C), and gamma value is crucial for performance.

K-Nearest Neighbors (KNN)

KNN is a non-parametric, instance-based learning algorithm that classifies data based on the majority vote of its nearest neighbors.

Classification by Majority Voting: The model assigns a label based on the most common class among its k-nearest neighbors.
Choosing K: A smaller K leads to high variance (overfitting), while a larger K increases bias (underfitting).
Distance Metrics: Common distance measures include Euclidean distance, Manhattan distance, and Minkowski distance.

Model Evaluation Metrics

To assess classification model performance, various evaluation metrics are used:

Confusion Matrix: A table showing True Positives (TP), False Positives (FP), True Negatives (TN), and False Negatives (FN).
Precision & Recall: Precision measures how many predicted positives were correct, while recall measures how many actual positives were captured.
F1-score: The harmonic mean of precision and recall, useful for imbalanced datasets.
ROC Curve & AUC: The Receiver Operating Characteristic curve plots the True Positive Rate against the False Positive Rate, and AUC measures the area under the curve.

Hands-On Exercise

Exercise 1: Implementing Logistic Regression in Scikit-learn

Objective: Train a logistic regression model on a classification dataset and evaluate its performance.

Steps:

Load the dataset:

from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
X, y = data.data, data.target

Split data into training and test sets:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Train a logistic regression model:

from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

Evaluate model performance:

from sklearn.metrics import classification_report
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

Exercise 2: Implementing Decision Trees and Random Forests

Objective: Compare Decision Trees and Random Forests on a dataset.

Steps:

Train a Decision Tree model:

from sklearn.tree import DecisionTreeClassifier
dt_model = DecisionTreeClassifier()
dt_model.fit(X_train, y_train)

Train a Random Forest model:

from sklearn.ensemble import RandomForestClassifier
rf_model = RandomForestClassifier(n_estimators=100)
rf_model.fit(X_train, y_train)

Evaluate and compare both models:

from sklearn.metrics import accuracy_score
dt_pred = dt_model.predict(X_test)
rf_pred = rf_model.predict(X_test)
print(f"Decision Tree Accuracy: {accuracy_score(y_test, dt_pred) * 100:.2f}%")
print(f"Random Forest Accuracy: {accuracy_score(y_test, rf_pred) * 100:.2f}%")

Summary

Covered major classification algorithms: Logistic Regression, Decision Trees, Random Forests, SVM, and KNN.
Explored evaluation techniques such as Confusion Matrix, Precision-Recall, and ROC-AUC.
Hands-on implementation of classification models using Scikit-learn.

References

Géron, A. (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow. O’Reilly Media.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
Scikit-Learn Documentation: https://scikit-learn.org/stable/

K3S: Storage Management

Nginx: Reverse Proxy and Load Balancing

Datascience

Rizki Sasri Dwitama

Title here

ML: Supervised Learning - Classification Models

Introduction Classification Models

Logistic Regression

Decision Trees and Random Forests

Support Vector Machines (SVM)

K-Nearest Neighbors (KNN)

Model Evaluation Metrics

Hands-On Exercise

Exercise 1: Implementing Logistic Regression in Scikit-learn

Exercise 2: Implementing Decision Trees and Random Forests

Summary

References

ML: Supervised Learning - Classification Models

Introduction Classification Models#

Logistic Regression#

Decision Trees and Random Forests#

Support Vector Machines (SVM)#

K-Nearest Neighbors (KNN)#

Model Evaluation Metrics#

Hands-On Exercise#

Exercise 1: Implementing Logistic Regression in Scikit-learn#

Exercise 2: Implementing Decision Trees and Random Forests#

Summary#

References#

Introduction Classification Models

Logistic Regression

Decision Trees and Random Forests

Support Vector Machines (SVM)

K-Nearest Neighbors (KNN)

Model Evaluation Metrics

Hands-On Exercise

Exercise 1: Implementing Logistic Regression in Scikit-learn

Exercise 2: Implementing Decision Trees and Random Forests

Summary

References