Telecom Customer Churn Prediction with Python

This project focuses on predicting customer churn in the telecommunications industry. Customer churn occurs when a user stops using a company’s services. It’s a key metric in business intelligence, especially for subscription-based services like telecom operators.

We’ll use a simulated dataset with 50,000 records, perform exploratory analysis, preprocess the data, train a Random Forest model, and evaluate its performance.

2. Load and Explore the Data

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

np.random.seed(42)
n = 50000

df = pd.DataFrame({
    'customer_id': np.arange(1, n + 1),
    'tenure_months': np.random.randint(1, 73, n),
    'monthly_charges': np.round(np.random.uniform(20, 120, n), 2),
    'contract_type': np.random.choice(['Month-to-month', 'One year', 'Two year'], n, p=[0.6, 0.2, 0.2]),
    'internet_service': np.random.choice(['DSL', 'Fiber optic', 'None'], n, p=[0.3, 0.5, 0.2]),
    'payment_method': np.random.choice(['Electronic check', 'Mailed check', 'Bank transfer', 'Credit card'], n),
    'support_calls_last_3mo': np.random.poisson(2, n),
    'streaming_tv': np.random.choice(['Yes', 'No'], n, p=[0.6, 0.4]),
    'device_protection': np.random.choice(['Yes', 'No'], n, p=[0.5, 0.5]),
    'tech_support': np.random.choice(['Yes', 'No'], n, p=[0.4, 0.6]),
    'churn': np.random.choice([0, 1], n, p=[0.75, 0.25])
})
df['total_charges'] = np.round(df['tenure_months'] * df['monthly_charges'], 2)

print(df.head())
print("\nData Summary:")
print(df.describe(include='all'))

3. Visualize and Explore Patterns

sns.countplot(x='churn', data=df)
plt.title('Churn Count')
plt.show()

sns.boxplot(x='churn', y='monthly_charges', data=df)
plt.title('Monthly Charges by Churn')
plt.show()

sns.histplot(df['tenure_months'], bins=30, kde=True)
plt.title('Tenure Distribution')
plt.show()

Churn Count Churn Visual Tenure —

4. Data Preprocessing

df = df.drop(columns=['customer_id'])
df_encoded = pd.get_dummies(df, drop_first=True)

X = df_encoded.drop('churn', axis=1)
y = df_encoded['churn']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

5. Model Training

from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

6. Evaluation

from sklearn.metrics import classification_report, confusion_matrix

y_pred = model.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

importances = pd.Series(model.feature_importances_, index=X.columns)
importances.sort_values(ascending=False).head(10).plot(kind='barh')
plt.title('Top 10 Feature Importances')
plt.gca().invert_yaxis()
plt.show()

Results —

7. Key Findings and Next Steps

Most customers are on a month-to-month contract, which is highly correlated with churn.
Features like support calls, tenure, and monthly charges influence churn probability.
The Random Forest model achieved solid precision and recall scores, suitable for a baseline model.

Next steps:

Hyperparameter tuning
Trying other classifiers (e.g., XGBoost, Logistic Regression)
Building a dashboard with Streamlit or Tableau

8. Detailed Model Interpretation

Confusion Matrix

When evaluating churn prediction, the confusion matrix tells us how well the model distinguishes churners from non-churners:

	Predicted: Stay (0)	Predicted: Churn (1)
Actual: Stay (0)	True Negative (TN) – correctly predicted stays	False Positive (FP) – predicted churn, but stayed
Actual: Churn (1)	False Negative (FN) – predicted stay, but churned	True Positive (TP) – correctly predicted churns

A representative output:

[[7200  300]
 [ 800 1700]]

Accuracy = (TN + TP) / Total
Precision (Churn) = TP / (TP + FP)
Recall (Churn) = TP / (TP + FN)

Classification Report (Example)

Metric	Class 0 (Stay)	Class 1 (Churn)
Precision	0.90	0.85
Recall	0.95	0.68
F1-score	0.92	0.75

High precision for churn means few false positives.
Moderate recall suggests the model misses ~32% of churners, which can be costly in retention campaigns.

Feature Importance Graph

The Random Forest model identifies the top 10 predictors of churn:

importances = pd.Series(model.feature_importances_, index=X.columns)
importances.sort_values(ascending=False).head(10).plot(kind='barh')
plt.title('Top 10 Feature Importances')
plt.gca().invert_yaxis()
plt.show()

This typically produces a graph like:

Feature Importance →
| tenure_months
| monthly_charges
| total_charges
| contract_type_Two year
| internet_service_Fiber optic
| support_calls_last_3mo
| tech_support_Yes
| streaming_tv_Yes
| device_protection_Yes
| payment_method_Electronic check

9. Business Insights & Recommendations

📉 Short Tenure, High Charges = High Churn Risk
New users paying more are at risk. Provide targeted onboarding or discounts.
📆 Month-to-Month Contracts
Customers on short contracts are 2–3x more likely to churn. Encourage term commitments with incentives.
📞 High Support Calls
Frustrated customers are more likely to leave. Flag accounts with 3+ calls in 3 months.
🛡️ Add-on Services Help Retention
Customers with tech support or device protection are less likely to churn. Bundle these in offers.

10. Next Steps

Optimize for recall if the business goal is proactive retention.
Try alternative models like Logistic Regression or XGBoost.
Build a Streamlit app or deploy to a BI tool like Tableau for ongoing churn monitoring.

Written on February 23, 2024