Pandas: Performance Optimization

Vectorization vs. Loops in Pandas

If you’re still using loops in Pandas, your DataFrame is laughing at you behind your back. Loops are slow—vectorization is fast. Let’s see why:

  • Understanding Why Loops Are Slow
import pandas as pd
import numpy as np
import time

df = pd.DataFrame({"A": np.random.randint(1, 100, 1000000)})

start = time.time()
df["B"] = [x * 2 for x in df["A"]]  # Loop-based transformation
print("Loop Time:", time.time() - start)

Using loops on large DataFrames makes your CPU cry. Instead, use vectorized operations:

  • Using Vectorized Operations for Speed
start = time.time()
df["B"] = df["A"] * 2  # Vectorized transformation
print("Vectorized Time:", time.time() - start)

Vectorized operations leverage optimized C-level implementations, making them exponentially faster.

Using .apply() Efficiently

.apply() is a handy tool, but if misused, it can drag performance down.

  • When to Use .apply() Over Loops
def square(x):
    return x ** 2

df["C"] = df["A"].apply(square)

If .apply() does what a vectorized operation can do, don’t use it—go vectorized instead.

  • Optimizing .apply() for Row-Wise Operations
df["D"] = df.apply(lambda row: row["A"] * 2 if row["A"] > 50 else row["A"], axis=1)

Row-wise .apply() is slow. If possible, avoid using axis=1 and refactor with vectorized logic.

Memory Optimization Techniques

Your DataFrame is bloated, and it’s not the pizza. Optimize memory usage to prevent crashes and sluggish performance.

  • Reducing Memory Usage by Changing Data Types
df["A"] = df["A"].astype(np.int16)  # Shrinking integer storage size
  • Optimizing Categorical Data
df["Category"] = df["Category"].astype("category")

Categorical types take up far less memory than strings.

  • Dropping Unnecessary Columns and Downcasting Data Types
df = df.drop(columns=["Unnecessary_Column"])
df["Float_Column"] = pd.to_numeric(df["Float_Column"], downcast="float")

Working with Large Datasets Using Dask

Pandas struggles with massive datasets, but Dask can help.

  • Introduction to Dask for Scalable Pandas Operations
import dask.dataframe as dd
ddf = dd.read_csv("large_dataset.csv")
print(ddf.head())
  • Loading and Processing Large Datasets with Dask
ddf = ddf.groupby("Category")["Value"].mean().compute()

Dask processes data lazily in parallel, making it great for large-scale operations.

Hands-On Exercise

  1. Compare Loop vs. Vectorized Operations: Measure execution time of loops vs. vectorized operations in Pandas.
  2. Optimize .apply() Usage: Refactor slow .apply() functions to improve efficiency.
  3. Reduce Memory Usage: Optimize a dataset’s memory by changing data types and reducing redundancy.
  4. Process Large Datasets with Dask: Load and manipulate a large dataset using Dask DataFrame.

References