Preprocessing Pipeline
Preprocessing Pipeline
Before training the model, the raw dataset undergoes a structured preprocessing pipeline to ensure the data is in a numerical format suitable for Logistic Regression. This process includes feature selection, data cleaning, and categorical encoding.
1. Feature Selection and Cleaning
The initial step involves removing unique identifiers that do not contribute to the model's predictive power. The CustomerID column is dropped to prevent the model from overfitting on arbitrary labels.
# Drop non-predictive features
df.drop('CustomerID', axis=1, inplace=True)
2. Categorical Variable Encoding
The dataset contains several categorical variables (e.g., Contract types and Payment Methods). We use Label Encoding to convert these text-based categories into numerical values.
The pipeline specifically targets the following features:
Contract(e.g., Monthly, One Year, Two Year)PaymentMethod(e.g., Credit Card, Bank Transfer)AutoPay(Yes/No)Churn(Target variable)
To ensure consistency between training and future inference, the encoders are stored in a dictionary (le_dict):
from sklearn.preprocessing import LabelEncoder
label_cols = ['Contract', 'PaymentMethod', 'AutoPay', 'Churn']
le_dict = {}
for col in label_cols:
le = LabelEncoder()
df[col] = le.fit_transform(df[col])
le_dict[col] = le
3. Data Validation
The pipeline includes a check for missing values. In the current implementation, any null entries are identified to ensure data integrity before the split.
# Check for missing values
print(df.isnull().sum())
4. Preparing for Prediction
When processing new customer data for real-time prediction, the pipeline requires the input to be transformed using the same encoders used during training. This ensures that a value like "Monthly" is always mapped to the same integer recognized by the model.
Example: Processing a new data point
new_customer = pd.DataFrame({
'Contract': [le_dict['Contract'].transform(['Monthly'])[0]],
'SupportCalls': [4],
'MonthlyBill': [110.0],
'PaymentMethod': [le_dict['PaymentMethod'].transform(['CreditCard'])[0]],
'BillingIssues': [0],
'DataUsageGB': [85.0],
'TenureMonths': [12],
'AutoPay': [le_dict['AutoPay'].transform([1])[0]]
})
5. Training/Test Split
The final stage of the pipeline splits the cleaned dataset into training and testing sets using a 70/30 ratio, ensuring the model's performance can be validated on unseen data.
- Training Set: 70% of data used to train the Logistic Regression model.
- Testing Set: 30% of data used for evaluation (Confusion Matrix, F1 Score).