Model Evaluation Metrics

After training the Logistic Regression model, we evaluate its performance using standard classification metrics. These metrics help determine how effectively the model can distinguish between customers who stay and those who are likely to churn.

Understanding the Evaluation Output

The evaluation process focuses on three primary outputs generated from the test dataset (30% of the total data).

1. Confusion Matrix

The confusion matrix provides a summary of prediction results. For a churn model, it breaks down the results into four categories:

True Negatives (TN): Customers predicted to stay who actually stayed.
False Positives (FP): Customers predicted to churn who actually stayed (Type I Error).
False Negatives (FN): Customers predicted to stay who actually churned (Type II Error).
True Positives (TP): Customers predicted to churn who actually churned.

In the context of churn, False Negatives are particularly costly as they represent missed opportunities to retain a customer.

2. Classification Report

The classification report provides a deeper dive into the model's precision and sensitivity:

| Metric | Description | | :--- | :--- | | Precision | Out of all customers the model flagged as "Churn," how many actually left? High precision means fewer false alarms. | | Recall | Out of all customers who actually left, how many did the model correctly identify? High recall means you are capturing most of the churn risk. | | F1-Score | The harmonic mean of Precision and Recall. This is the best "all-around" metric, especially if the dataset has an imbalance between churners and non-churners. |

3. F1 Score

The project explicitly calculates the F1 Score to provide a single performance headline. With a current baseline accuracy of approximately 54%, the F1 Score helps identify if the model is performing better than random guessing and where the trade-offs between precision and recall lie.

How to Run Evaluation

To generate these metrics within the project environment, execute the evaluation cell in the notebook or run the following logic in your script:

from sklearn.metrics import classification_report, confusion_matrix, f1_score

# Generate predictions on the test set
y_pred = model.predict(X_test)

# Display evaluation results
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print(f"Final F1 Score: {f1_score(y_test, y_pred):.4f}")

Performance Interpretation

Current State: The model serves as a baseline. An accuracy of ~54% indicates that while the model has learned certain patterns (like the impact of support calls and contract types), there is significant noise in the data.
Business Impact: At this stage, the model should be used as a supplementary tool rather than an automated decision-maker. It is best used to highlight "High Risk" segments for manual review by customer success teams.
Optimization Goal: Future iterations aim to increase the Recall score, ensuring that the business catches as many potential churners as possible, even at the cost of slight over-prediction (False Positives).