ML: Data Preprocessing and Feature Engineering

Handling Missing Data and Outliers

Data is messy—kind of like my workspace, but worse. Missing values can throw off machine learning models, and outliers can turn your carefully crafted algorithm into an absolute joke. So let’s fix them before they ruin our day.

  • Identifying missing data using Pandas: Missing values can creep into datasets like uninvited guests at a party. Using Pandas, we can detect them:
    import pandas as pd
    df = pd.read_csv("data.csv")
    print(df.isnull().sum())
  • Strategies for handling missing data:
    • Deletion: Drop rows or columns with missing values (not recommended unless you enjoy destroying valuable data).
    • Imputation: Fill in missing values using mean, median, or mode.
    df.fillna(df.mean(), inplace=True)  # Mean imputation
  • Detecting outliers using statistical methods:
    • Interquartile Range (IQR): Detects outliers using quartiles.
    • Z-score: Flags data points that are way off the normal range.
    from scipy import stats
    df = df[(stats.zscore(df['feature']) < 3)]  # Removes outliers beyond 3 standard deviations
  • Using visualization techniques: Because sometimes you just need to see the chaos.
    import matplotlib.pyplot as plt
    df.boxplot()
    plt.show()

Feature Scaling and Normalization

Raw data comes in all shapes and sizes, and machine learning models? Well, they are divas. They demand everything to be on the same scale, or else they throw tantrums.

  • Standardization (Z-score normalization): Centers data around zero.
    from sklearn.preprocessing import StandardScaler
    scaler = StandardScaler()
    df_scaled = scaler.fit_transform(df[['feature1', 'feature2']])
  • Min-Max Scaling: Squeezes values between 0 and 1.
    from sklearn.preprocessing import MinMaxScaler
    scaler = MinMaxScaler()
    df_scaled = scaler.fit_transform(df[['feature1', 'feature2']])
  • Robust Scaling: Perfect for datasets filled with outliers.
    from sklearn.preprocessing import RobustScaler
    scaler = RobustScaler()
    df_scaled = scaler.fit_transform(df[['feature1', 'feature2']])

Encoding Categorical Variables

Not everything in life is numeric—sometimes you have categories like “Cat”, “Dog”, and “Parrot” that need to be converted into numbers. Machines don’t understand words, after all.

  • One-hot encoding: Converts categories into binary vectors (0s and 1s).
    pd.get_dummies(df, columns=['category_column'])
  • Label encoding: Assigns a unique number to each category (useful but risky for ML models that assume numerical relationships).
    from sklearn.preprocessing import LabelEncoder
    encoder = LabelEncoder()
    df['category_encoded'] = encoder.fit_transform(df['category_column'])
  • Target encoding and Frequency encoding: More advanced techniques for high-cardinality features.
    df['target_encoded'] = df.groupby('category_column')['target'].transform('mean')

Feature Selection and Dimensionality Reduction

The more features you have, the more complex your model becomes—and complexity is the enemy of performance. Let’s keep it simple.

  • Feature Selection:
    • ANOVA and Chi-square tests: Identify statistically significant features.
    • Mutual Information: Measures the dependency between variables.
  • Dimensionality Reduction:
    • Principal Component Analysis (PCA): Reduces dimensions while preserving variance.
      from sklearn.decomposition import PCA
      pca = PCA(n_components=2)
      df_pca = pca.fit_transform(df[['feature1', 'feature2', 'feature3']])
    • Linear Discriminant Analysis (LDA): Works well for classification tasks.
    • t-SNE: Helps visualize high-dimensional data.

Hands-On Exercises

Exercise 1: Handling Missing Data

Objective: Learn to detect and handle missing data using Pandas.

import pandas as pd
# Load data
df = pd.read_csv("dataset.csv")
print(df.isnull().sum())
# Fill missing values
df.fillna(df.mean(), inplace=True)

Exercise 2: Feature Scaling

Objective: Apply different feature scaling techniques using Scikit-learn.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df[['feature1', 'feature2']])

Exercise 3: Dimensionality Reduction with PCA

Objective: Reduce feature dimensionality using PCA.

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
pca = PCA(n_components=2)
df_pca = pca.fit_transform(df[['feature1', 'feature2', 'feature3']])
# Plot
plt.scatter(df_pca[:, 0], df_pca[:, 1])
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("PCA Visualization")
plt.show()

Summary

  • We tackled missing data and outliers like pros.
  • We scaled features so our models don’t throw a fit.
  • We encoded categorical variables to make them ML-friendly.
  • We used feature selection and dimensionality reduction to simplify our dataset.

References