Pandas: Hands-on Projects

Data Cleaning

Because real-world data is messy, let’s fix it:

  • Load a real-world dataset (Airbnb, COVID-19 data, customer transactions)
  • Identify missing values using .isnull().sum()
  • Handle missing data with .dropna(), .fillna(), and interpolation
  • Convert data types (strings to datetime, categorical types for efficiency)
  • Remove duplicates and filter outliers

Example Code:

import pandas as pd

df = pd.read_csv("airbnb_listings.csv")
df.drop_duplicates(inplace=True)
df["price"] = pd.to_numeric(df["price"], errors="coerce")
df.fillna(df.median(), inplace=True)

Exploratory Data Analysis (EDA)

Before jumping into models, explore your data like a detective:

  • Load customer behavior data
  • Compute summary statistics with .describe()
  • Visualize distributions using histograms and box plots
  • Identify patterns and correlations with .groupby(), .corr()
  • Generate key business insights (e.g., which customers spend the most?)

Example Code:

import seaborn as sns
import matplotlib.pyplot as plt

sns.histplot(df["price"], bins=50)
plt.show()
print(df.corr())

Time Series Forecasting

Time-based data requires a different approach:

  • Load a financial or stock market dataset
  • Convert date columns to datetime and set as index
  • Perform resampling (.resample("M").mean()) and moving averages
  • Identify trends and seasonality with rolling statistics
  • Use Pandas for basic forecasting techniques

Example Code:

df["date"] = pd.to_datetime(df["date"])
df.set_index("date", inplace=True)
df["rolling_mean"] = df["price"].rolling(window=12).mean()
df["price"].plot()
df["rolling_mean"].plot()
plt.legend(["Original", "Rolling Average"])
plt.show()

Data Merging

In reality, data is never in one nice file. Merge it!

  • Load multiple datasets from CSV, SQL, or APIs
  • Merge datasets using .merge(), .concat(), and .join()
  • Handle duplicate and inconsistent records after merging
  • Create a unified dataset for analysis

Example Code:

df1 = pd.read_csv("sales.csv")
df2 = pd.read_csv("customers.csv")
merged_df = df1.merge(df2, on="customer_id", how="left")

Data Visualization

A good chart can explain what 1000 rows of data cannot.

  • Use Pandas and Seaborn for visualizations
  • Generate histograms, box plots, and correlation heatmaps
  • Customize plots with titles, labels, and legends
  • Combine multiple visualizations for reports

Example Code:

sns.heatmap(df.corr(), annot=True, cmap="coolwarm")
plt.show()

Hands-On Exercise

  1. Perform Data Cleaning: Work on a raw dataset with missing and inconsistent values.
  2. Conduct EDA: Summarize, visualize, and extract insights from a dataset.
  3. Analyze Time Series Data: Identify trends, resample data, and apply rolling averages.
  4. Merge and Transform Datasets: Combine multiple data sources and clean merged data.
  5. Develop a Visualization Dashboard: Create a Pandas-based data dashboard with key insights.

References