ML: Unsupervised Learning
Introduction to Unsupervised Learning
Unsupervised learning is a machine learning approach where models learn from unlabeled data to uncover patterns and structures. Unlike supervised learning, which relies on labeled data, unsupervised learning identifies inherent groupings or anomalies within the dataset. This technique is widely used in fields like market segmentation, fraud detection, and recommendation systems.
K-Means Clustering
K-Means is a widely used clustering algorithm that partitions data into K clusters based on similarity.
- How It Works: K-Means initializes centroids randomly, assigns points to the nearest centroid, and iteratively updates centroids until convergence.
- Choosing the Right K: The optimal number of clusters can be determined using techniques like the Elbow Method and Silhouette Score.
- Implementation: K-Means is easily implemented using Scikit-learn for data analysis.
- Use Case: Customer segmentation in marketing helps businesses group customers based on purchasing behavior.
Hierarchical Clustering
Hierarchical clustering is a method of grouping data into a hierarchy of nested clusters.
- Types:
- Agglomerative: A bottom-up approach where each data point starts as its own cluster and merges progressively.
- Divisive: A top-down approach where all data points start in a single cluster and split recursively.
- Dendrograms: A tree-like diagram that helps visualize cluster relationships and determine the optimal number of clusters.
- Implementation: Scikit-learn provides functions to apply hierarchical clustering.
- Use Case: Document clustering to organize similar text documents.
Principal Component Analysis (PCA) for Dimensionality Reduction
PCA is a technique for reducing the dimensionality of high-dimensional datasets while preserving variance.
- Why Use PCA? High-dimensional data can be difficult to visualize and computationally expensive. PCA projects data onto fewer dimensions while retaining key information.
- How It Works: PCA calculates eigenvectors and eigenvalues of the covariance matrix to identify principal components.
- Implementation: Scikit-learn simplifies PCA application.
- Use Case: Reducing thousands of gene expression data points into two or three dimensions for analysis.
Anomaly Detection with One-Class SVM
Anomaly detection identifies rare or suspicious patterns in data, often used in fraud detection.
- Concept: One-Class SVM is trained only on normal data and detects deviations as anomalies.
- Implementation: Scikit-learn allows easy application of One-Class SVM to detect fraudulent transactions or cyber threats.
- Use Case: Identifying fraudulent credit card transactions.
Hands-On Exercises
Exercise 1: Implementing K-Means Clustering
Objective: Perform K-Means clustering on a dataset.
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
# Load dataset
iris = load_iris()
X = iris.data
# Standardize data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Apply K-Means
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X_scaled)
# Plot clusters
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=kmeans.labels_, cmap='viridis')
plt.title("K-Means Clustering")
plt.show()Exercise 2: Applying PCA for Dimensionality Reduction
Objective: Reduce dataset dimensionality and visualize it.
from sklearn.decomposition import PCA
import seaborn as sns
# Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
# Plot PCA-transformed data
sns.scatterplot(x=X_pca[:, 0], y=X_pca[:, 1], hue=kmeans.labels_, palette='viridis')
plt.title("PCA Visualization")
plt.show()Exercise 3: Detecting Anomalies with One-Class SVM
Objective: Identify anomalies in a dataset.
from sklearn.svm import OneClassSVM
import numpy as np
# Generate synthetic dataset
X_outliers = np.random.uniform(low=-4, high=4, size=(20, 2))
X_normal = np.random.normal(size=(200, 2))
X_train = np.vstack([X_normal, X_outliers])
# Train One-Class SVM
oc_svm = OneClassSVM(gamma='auto')
oc_svm.fit(X_train)
labels = oc_svm.predict(X_train)
# Visualize anomalies
plt.scatter(X_train[:, 0], X_train[:, 1], c=labels, cmap='coolwarm')
plt.title("Anomaly Detection with One-Class SVM")
plt.show()Summary
- Learned the fundamentals of unsupervised learning.
- Implemented clustering techniques (K-Means, Hierarchical Clustering).
- Applied PCA for dimensionality reduction.
- Used One-Class SVM for anomaly detection.
References
- Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning. Springer.
- Scikit-Learn Documentation: https://scikit-learn.org/stable/
- TensorFlow Documentation: https://www.tensorflow.org/