Correlation Analysis
Correlation Analysis
To understand the underlying relationships between customer attributes and service usage, we perform a correlation analysis. This process helps identify which features are most strongly associated with customer churn and ensures that we are aware of any multi-collinearity (high correlation between independent variables) that might affect model performance.
Visualizing Feature Relationships
The project utilizes a correlation matrix, visualized through a heatmap, to represent the Pearson correlation coefficients between all numerical features.
To improve readability, the heatmap is rendered as a lower-triangle matrix. This removes redundant information and allows you to focus on the unique interactions between variables.
import seaborn as sns
import matplotlib.pyplot as plt
# Compute the correlation matrix
corr = df.corr(numeric_only=True)
# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))
# Plot the heatmap
plt.figure(figsize=(10,6))
sns.heatmap(corr, mask=mask, annot=True, fmt=".2f", cmap='Spectral')
plt.title("Feature Correlation Heatmap")
plt.show()
Key Insights from the Matrix
By analyzing the coefficients (ranging from -1 to +1), we can draw several conclusions about customer behavior:
- Primary Churn Drivers: Features like
SupportCallsandMonthlyBilltypically show a positive correlation with theChurntarget. As these values increase, the likelihood of a customer leaving the service also tends to rise. - Retention Factors:
TenureMonthsandAutoPayoften exhibit an inverse (negative) correlation with churn. This indicates that long-term customers and those with automated billing are more stable and less likely to cancel. - Usage Patterns: You may observe correlations between
DataUsageGBandMonthlyBill. While expected, monitoring these relationships ensures that no two features are so perfectly correlated that they provide redundant information to the Logistic Regression model.
Multi-Collinearity Assessment
In this analysis, we look for high correlation scores (typically > 0.80) between independent variables.
- Low Multi-Collinearity: Based on the dataset, most features provide distinct signals. This is ideal for the Logistic Regression model used in this project, as it allows the algorithm to assign stable coefficients to each predictor.
- Impact on Prediction: By identifying these relationships early, we can confirm that our "Feature Influence" horizontal bar chart is reflecting true impact rather than mathematical noise caused by overlapping features.
Practical Application
When using the Customer Churn Analyzer, you can use these correlations to:
- Identify High-Risk Segments: Focus retention efforts on groups where high-correlation triggers (like high support volume) are present.
- Refine Features: If two features are found to be highly redundant in future datasets, one can be dropped to simplify the model without losing predictive power.