Pandas: Data Cleaning and Preprocessing

Module 3: Data Cleaning and Preprocessing

Handling Missing Data

Ah, missing data. The bane of every data analyst’s existence. Because clearly, whoever collected this data just gave up halfway through. Here’s how you deal with it:

  • Identifying Missing Values:
import pandas as pd
df = pd.read_csv("messy_data.csv")
print(df.isnull().sum())  # Count missing values per column
print(df.notnull().sum()) # Count non-missing values per column

The .isnull().sum() method gives a count of missing (NaN) values per column, while .notnull().sum() does the opposite—counting non-null values.

  • Removing Missing Values:
df_cleaned = df.dropna()  # Drops any row with missing data

This removes all rows containing at least one NaN. Be careful, though—you might lose valuable data if entire rows get nuked.

  • Filling Missing Values:
df_filled = df.fillna(df.mean())  # Replace NaNs with column mean

Rather than deleting data, this replaces NaNs with the column’s mean. Other options include df.fillna(df.median()) or df.fillna("Unknown") for categorical data.

  • Interpolating Missing Data:
df_interpolated = df.interpolate()  # Fill gaps using interpolation

Interpolation fills in missing values based on surrounding data trends—great for time series data where you want a smooth transition between known values.

Data Type Conversions

Because apparently, “123” isn’t always a number. Pandas lets you fix that.

  • Checking Data Types:
print(df.dtypes)

This will show you what types each column contains—integers, floats, objects (a.k.a. strings), etc.

  • Converting Data Types:
df["date_column"] = pd.to_datetime(df["date_column"])

If your dates are stored as strings, this converts them into actual datetime objects, making them usable for time-based operations.

  • Handling Categorical Data:
df["category_column"] = df["category_column"].astype("category")

Converting text-based categorical data into Pandas’ category type saves memory and speeds up operations.

Renaming and Reordering Columns

Because column names like “Unnamed: 12” and “A” aren’t helping anyone.

  • Renaming Columns:
df.rename(columns={"old_name": "new_name"}, inplace=True)

This updates column names without modifying the data itself.

  • Reordering Columns:
df = df[["new_col1", "new_col2", "new_col3"]]

This manually reorders columns in a more logical way.

  • Standardizing Column Names:
df.columns = df.columns.str.lower().str.replace(" ", "_")

This ensures consistent column names, replacing spaces with underscores and converting everything to lowercase.

Removing Duplicates and Filtering Data

Because why have one copy of bad data when you can have twenty?

  • Identifying Duplicates:
print(df.duplicated().sum())

This counts the number of duplicate rows.

  • Removing Duplicates:
df = df.drop_duplicates()

This removes any duplicate rows, keeping only the first occurrence.

  • Filtering Data:
df_filtered = df[df["age"] > 30]

This selects rows where the age column is greater than 30.

Handling Outliers

Because sometimes, a single data point can ruin everything.

  • Identifying Outliers:
print(df.describe())

The .describe() method provides a summary of numerical columns, which helps spot extreme values.

  • Using IQR to Detect Outliers:
Q1 = df["column"].quantile(0.25)
Q3 = df["column"].quantile(0.75)
IQR = Q3 - Q1
df_outliers = df[(df["column"] < (Q1 - 1.5 * IQR)) | (df["column"] > (Q3 + 1.5 * IQR))]

Interquartile range (IQR) is a robust method to detect outliers. Anything beyond 1.5 times the IQR is flagged as an outlier.

  • Capping Outliers:
df["column"] = df["column"].clip(lower=Q1, upper=Q3)

Instead of removing outliers, this method caps them at reasonable limits.

Hands-On Exercise

  1. Handle Missing Data: Load a dataset, check for missing values, and apply .dropna() and .fillna().
  2. Convert Data Types: Convert date strings to datetime format and categorical data to category type.
  3. Rename and Reorder Columns: Modify column names and reorder them for sanity’s sake.
  4. Remove Duplicates and Filter Data: Identify and remove duplicates, then filter data based on conditions.
  5. Detect and Handle Outliers: Use statistical methods to detect and handle outliers without losing your mind.

References