ML Projects: Customer Churn Prediction

Customer Churn Prediction: A Hands-On Machine Learning Project

Introduction

Customer churn is a critical issue for businesses, as losing customers can significantly impact revenue. In this project, we will build a machine learning model to predict customer churn based on historical data. The project involves data preprocessing, feature engineering, model training, evaluation, and deployment as an API using Flask or FastAPI.


Step 1: Dataset Overview

For this project, we will use the Telco Customer Churn dataset, which is publicly available on Kaggle. (kaggle.com)

Dataset Description

The dataset contains information about a fictional telecommunications company’s customers, including demographics, account information, and services subscribed. The target variable is Churn, indicating whether a customer has left the company.

Accessing the Dataset

  1. Download the Dataset:

    • Visit the Kaggle page: Telco Customer Churn Dataset
    • Click on the “Download” button to obtain the WA_Fn-UseC_-Telco-Customer-Churn.csv file.
  2. Load the Dataset:

import pandas as pd

# Load dataset
df = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')

# Display first five rows
print(df.head())

Step 2: Data Preprocessing and Feature Engineering

Before training the model, we must clean and preprocess the data. This includes:

  • Handling missing values
  • Encoding categorical variables
  • Feature scaling and normalization
  • Feature selection
from sklearn.preprocessing import LabelEncoder, StandardScaler

# Handle missing values
df.fillna(df.median(), inplace=True)

# Encode categorical variables
le = LabelEncoder()
df['gender'] = le.fit_transform(df['gender'])
df['Partner'] = le.fit_transform(df['Partner'])
df['Dependents'] = le.fit_transform(df['Dependents'])
df['Churn'] = le.fit_transform(df['Churn'])

# Normalize numerical features
scaler = StandardScaler()
numeric_features = ['tenure', 'MonthlyCharges', 'TotalCharges']
df[numeric_features] = scaler.fit_transform(df[numeric_features])

Step 3: Splitting Data into Training and Testing Sets

We will split the dataset into training and testing sets to evaluate the model’s performance.

from sklearn.model_selection import train_test_split

# Define features and target variable
X = df.drop(['customerID', 'Churn'], axis=1)
y = df['Churn']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 4: Training a Classification Model

We will train a Random Forest Classifier to predict customer churn.

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# Train the model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Step 5: Deploying the Model as an API using Flask

Once we have a trained model, we can deploy it as a web service using Flask.

Create a Flask API

from flask import Flask, request, jsonify
import pickle

# Save the trained model
with open('churn_model.pkl', 'wb') as f:
    pickle.dump(model, f)

# Load the model
with open('churn_model.pkl', 'rb') as f:
    model = pickle.load(f)

app = Flask(__name__)

@app.route('/predict', methods=['POST'])
def predict():
    data = request.get_json()
    features = [data['tenure'], data['MonthlyCharges'], data['TotalCharges']]
    prediction = model.predict([features])
    return jsonify({'Churn Prediction': int(prediction[0])})

if __name__ == '__main__':
    app.run(debug=True)

Testing the API

Save the above script as app.py and run it. Then, test the API using Postman or cURL.

curl -X POST http://127.0.0.1:5000/predict -H "Content-Type: application/json" -d '{"tenure": 12, "MonthlyCharges": 50, "TotalCharges": 600}'

Conclusion

In this project, we successfully built and deployed a machine learning model to predict customer churn. We covered data preprocessing, feature engineering, model training, evaluation, and API deployment. This end-to-end pipeline can be further improved with advanced hyperparameter tuning, feature selection techniques, and real-time monitoring.