ML Projects: Customer Churn Prediction
Customer Churn Prediction: A Hands-On Machine Learning Project
Introduction
Customer churn is a critical issue for businesses, as losing customers can significantly impact revenue. In this project, we will build a machine learning model to predict customer churn based on historical data. The project involves data preprocessing, feature engineering, model training, evaluation, and deployment as an API using Flask or FastAPI.
Step 1: Dataset Overview
For this project, we will use the Telco Customer Churn dataset, which is publicly available on Kaggle. (kaggle.com)
Dataset Description
The dataset contains information about a fictional telecommunications company’s customers, including demographics, account information, and services subscribed. The target variable is Churn, indicating whether a customer has left the company.
Accessing the Dataset
Download the Dataset:
- Visit the Kaggle page: Telco Customer Churn Dataset
- Click on the “Download” button to obtain the
WA_Fn-UseC_-Telco-Customer-Churn.csvfile.
Load the Dataset:
import pandas as pd
# Load dataset
df = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')
# Display first five rows
print(df.head())Step 2: Data Preprocessing and Feature Engineering
Before training the model, we must clean and preprocess the data. This includes:
- Handling missing values
- Encoding categorical variables
- Feature scaling and normalization
- Feature selection
from sklearn.preprocessing import LabelEncoder, StandardScaler
# Handle missing values
df.fillna(df.median(), inplace=True)
# Encode categorical variables
le = LabelEncoder()
df['gender'] = le.fit_transform(df['gender'])
df['Partner'] = le.fit_transform(df['Partner'])
df['Dependents'] = le.fit_transform(df['Dependents'])
df['Churn'] = le.fit_transform(df['Churn'])
# Normalize numerical features
scaler = StandardScaler()
numeric_features = ['tenure', 'MonthlyCharges', 'TotalCharges']
df[numeric_features] = scaler.fit_transform(df[numeric_features])Step 3: Splitting Data into Training and Testing Sets
We will split the dataset into training and testing sets to evaluate the model’s performance.
from sklearn.model_selection import train_test_split
# Define features and target variable
X = df.drop(['customerID', 'Churn'], axis=1)
y = df['Churn']
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)Step 4: Training a Classification Model
We will train a Random Forest Classifier to predict customer churn.
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
# Train the model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))Step 5: Deploying the Model as an API using Flask
Once we have a trained model, we can deploy it as a web service using Flask.
Create a Flask API
from flask import Flask, request, jsonify
import pickle
# Save the trained model
with open('churn_model.pkl', 'wb') as f:
pickle.dump(model, f)
# Load the model
with open('churn_model.pkl', 'rb') as f:
model = pickle.load(f)
app = Flask(__name__)
@app.route('/predict', methods=['POST'])
def predict():
data = request.get_json()
features = [data['tenure'], data['MonthlyCharges'], data['TotalCharges']]
prediction = model.predict([features])
return jsonify({'Churn Prediction': int(prediction[0])})
if __name__ == '__main__':
app.run(debug=True)Testing the API
Save the above script as app.py and run it. Then, test the API using Postman or cURL.
curl -X POST http://127.0.0.1:5000/predict -H "Content-Type: application/json" -d '{"tenure": 12, "MonthlyCharges": 50, "TotalCharges": 600}'Conclusion
In this project, we successfully built and deployed a machine learning model to predict customer churn. We covered data preprocessing, feature engineering, model training, evaluation, and API deployment. This end-to-end pipeline can be further improved with advanced hyperparameter tuning, feature selection techniques, and real-time monitoring.