Pandas: Hands-on Projects
Data Cleaning#
Because real-world data is messy, let’s fix it:
- Load a real-world dataset (Airbnb, COVID-19 data, customer transactions)
- Identify missing values using
.isnull().sum() - Handle missing data with
.dropna(), .fillna(), and interpolation - Convert data types (strings to datetime, categorical types for efficiency)
- Remove duplicates and filter outliers
Example Code:#
import pandas as pd
df = pd.read_csv("airbnb_listings.csv")
df.drop_duplicates(inplace=True)
df["price"] = pd.to_numeric(df["price"], errors="coerce")
df.fillna(df.median(), inplace=True)
Exploratory Data Analysis (EDA)#
Before jumping into models, explore your data like a detective:
- Load customer behavior data
- Compute summary statistics with
.describe() - Visualize distributions using histograms and box plots
- Identify patterns and correlations with
.groupby(), .corr() - Generate key business insights (e.g., which customers spend the most?)
Example Code:#
import seaborn as sns
import matplotlib.pyplot as plt
sns.histplot(df["price"], bins=50)
plt.show()
print(df.corr())
Time Series Forecasting#
Time-based data requires a different approach:
- Load a financial or stock market dataset
- Convert date columns to datetime and set as index
- Perform resampling (
.resample("M").mean()) and moving averages - Identify trends and seasonality with rolling statistics
- Use Pandas for basic forecasting techniques
Example Code:#
df["date"] = pd.to_datetime(df["date"])
df.set_index("date", inplace=True)
df["rolling_mean"] = df["price"].rolling(window=12).mean()
df["price"].plot()
df["rolling_mean"].plot()
plt.legend(["Original", "Rolling Average"])
plt.show()
Data Merging#
In reality, data is never in one nice file. Merge it!
- Load multiple datasets from CSV, SQL, or APIs
- Merge datasets using
.merge(), .concat(), and .join() - Handle duplicate and inconsistent records after merging
- Create a unified dataset for analysis
Example Code:#
df1 = pd.read_csv("sales.csv")
df2 = pd.read_csv("customers.csv")
merged_df = df1.merge(df2, on="customer_id", how="left")
Data Visualization#
A good chart can explain what 1000 rows of data cannot.
- Use Pandas and Seaborn for visualizations
- Generate histograms, box plots, and correlation heatmaps
- Customize plots with titles, labels, and legends
- Combine multiple visualizations for reports
Example Code:#
sns.heatmap(df.corr(), annot=True, cmap="coolwarm")
plt.show()
Hands-On Exercise#
- Perform Data Cleaning: Work on a raw dataset with missing and inconsistent values.
- Conduct EDA: Summarize, visualize, and extract insights from a dataset.
- Analyze Time Series Data: Identify trends, resample data, and apply rolling averages.
- Merge and Transform Datasets: Combine multiple data sources and clean merged data.
- Develop a Visualization Dashboard: Create a Pandas-based data dashboard with key insights.
References#