2020
Churn Customer Analysis
DATA MINING AND ANALYTICS II – C744
UBALDO MARTINEZ III
Table of Contents
I: Tool Selection ... 2
Data Extraction ... 2
A. Tool Benefits ... 3
B. Goals ... 3
C. Selection of Analytical Methods Part I ... 4
II: Data Exploration and Preparation ... 4
D. Target Variable ... 4
E. Independent Variables ... 5
F. Data Manipulation ... 5
G. Statistical Identity ... 6
H. Data Cleaning ... 9
III: Data Analysis ... 15
I. Univariate Statistics ... 15
J. Bivariate Statistics ... 19
Distribution of Categorical Variables Based on Target Variable ... 21
Distribution of Numerical Variables Based on Target Variable ... 33
K. Analytic & Evaluative Method ... 35
Cluster Analysis ... 35
Decision Tree, Logistic Regression and Random Forest ... 47
L. Selection of Analytical Method Part II ... 56
M. Visual Method Selection ... 57
IV: Data Summary ... 57
N. Non-Discriminatory Data ... 57
O. Predictor Variables ... 58
P. Sources ... 59
I: Tool Selection
Data Extraction
A. Tool Benefits
Python is the tool I selected for extraction of data and exploratory analysis for its advantages in rendering customize graphical representations making it easy to discover insightful information. For model development, model comparison and model evaluation I choose R. As an intermediate Python user and with this specific scenario I found R performed predictive analysis quicker and required less code than Python.
B. Goals
The first goal is to identify key attributes in correlation to churn customers by applying univariate statistics, bivariate statistics and cluster analysis. The second goal is to build a model devised for accuracy and sensitivity to ensure reliable prediction of churn customers based on new data.
C. Selection of Analytical Methods Part I
Based on the data type and distribution of features as per churn the best descriptive method to utilize is cluster analysis with k-means. K-means will group data points into distinct non-overlapping subgroups thus providing visualization of key features contributing to customer attrition. Three different predictive models will be trained and tested to ensure accuracy and specificity. The best predictive models to utilize for a discrete outcome are Decision Tree, Logistic Regression and Random Forest. Data will split into training and test sets for model development and model comparison. Hyperparameter Tuning will be applied to Random Forest to ensure model efficiency.
II: Data Exploration and Preparation
D. Target Variable
The target variable, also known as the dependent variable in the data is Churn. Its data type is categorical (qualitative) and binary as the values only contain two possible discrete outcomes, 1 and 0 (MichaelF, 2019). The following code was executed in Python to support claims for target variable:
E. Independent Variables
An example of an independent predictor variable in the data is Gender. Its data type is categorical (qualitative) and nominal since it does not have any numerical significance. InternetServiceType,
PaymentMethod and ContractType are also examples of nominal variables. The following code was executed in Python to support claims for independent predictor variable:
F. Data Manipulation
The goal of manipulating the data is to sort and rearrange the data to identify patterns or trends without fundamentally changing it. Cleaning data will consist of renaming column titles to create consistency in name format, converting values from nominal to binary, creating new columns to consolidate variables and checking for missing values. Data will be grouped by categorical, numerical, target, binary and nominal columns for descriptive and predictive analysis. After cleaning and
transforming data set verify all variables and values have been modified to perform exploratory and cluster analysis. Below is the code on how the data was grouped for calling on different variables for descriptive and predictive analysis.
G. Statistical Identity
The three quantitative, continuous independent variables are Tenure, MonthlyCharges and TotalCharges. Average Tenure is 32 Months, average MonthlyCharges are 64 dollars and average TotalCharges are 2282 dollars.
There are seventeen categorical independent variables after transformation of dataset. Gender, Partner, Dependents, SeniorCitizen, PhoneService, MultipleLines, InternetServiceType, OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, StreamingMovies, IsContracted,
ContractType, PaperlessBilling and PaymentMethod.
Categorical Independent Variables 1-10
Categorical Independent Variables 11-17
The one categorical dependent variable (target) and essential criteria to be predicted is Churn.
CustomerID is the identifier for all individual records in the dataset.
H. Data Cleaning
To get an overview of the data set and begin the transformation process the number of rows, columns, features, missing values and unique values and data types were examined.
The first step in transformation was renaming columns to have a data frame with consistent formatting. Second, the InternetService column name was changed to InternetServiceType and Contract to ContractType to match its nominal values and remained as objects. Third, columns Partner,
Dependents, PhoneService, MultipleLines, OnlineSecurity, OnlineBackup, DeviceProtection,
TechSupport, StreamingTV, StreamingMovies, PaperlessBilling and Churn were converted from object to int64. The purpose of converting from object to int64 was to map out its values of ‘Yes’ to 1 and values starting with ‘No’ to 0. Fourth, two additional columns were added as int64, InternetService and IsContracted with values of 1 and 0. If a customer had any type of InternetService a value of 1 was assigned if he did not have InternetService a value of 0 was assigned. For IsContracted if a customer had one- or two-year contracts a value of 1 was assigned, if customer was on a month to month payment plan then a value of 0 was assigned. Fifth, to resolve the problem of missing values, TotalCharges column was converted from object to float64. After converting to float64 the product of MonthlyCharges and Tenure replaced missing values. Below is both the code used for transformation and the results.
Verified no anomalies existed for numerical columns (Tenure, MonthlyCharges, TotalCharges) by executing a code to produce boxplots.
The last step in the process was to export the transformed and clean file to excel as demonstrated in the code below.
III: Data Analysis
I. Univariate Statistics
J. Bivariate Statistics
From a correlation heat map patterns between the target and independent variables can be identified based on score between -1 and 1. A zero indicating no correlation between two variables, a number closer to -1 indicating a negative correlation and a number closer to 1 indicating a positive correlation (admin, 2020). Based on the correlation heat map a conclusion can be drawn of Churn (target variable) having a negative correlation with Tenure, IsContracted and a positive correlation with
PaperlessBilling, MonthlyCharges and Internet Service. It would be normal for MonthlyCharges and
TotalCharges to have a high positive correlation with each other and services. Tenure, a numerical variable has high positive correlations with IsContracted, TotalCharges and Partner.
Distribution of Categorical Variables Based on Target Variable
Binary Columns
Nominal Columns
Demographics Distribution
Demographic analysis demonstrates 25% of churn customers are SeniorCitizens and of existing customers only 13% are SeniorCitizens. Churn customers tend to be single with 64% having no partners and 83% having no dependents. For gender, it is roughly equal for male and female churn customers.
Services Distribution
Services based analysis demonstrates OnlineSecurity, OnlineBackup, DeviceProtection and TechSupport were not significant with less than 30% of churn customers having these services.
StreamingTV and StreamingMovies are slightly higher with 44% of churn customers having these services. One of the biggest factors is 91% of churn customers had phone services. The other major factor is 94% of churn customers had internet services and specifically 69% of those had fiber optic.
Account Based Distribution
Account based analysis demonstrates 89% of churn customers are on a month-to-month payment plan with no contract. 75% of churn customers had paperless billing enabled on their accounts and 57%
used electronic check as a payment method.
Distribution of Numerical Variables Based on Target Variable
K. Analytic & Evaluative Method
Cluster AnalysisBlue Cluster 0 – Low Tenure and Low Monthly Charges
Green Cluster 1 – High Tenure and High Monthly Charges
Red Cluster 2 – Low Tenure and High Monthly Charges
Decision Tree, Logistic Regression and Random Forest
For the predictive analysis I switched to R for its simple code instructions and faster calculations.
Split the data into training and testing then proceeded to convert the churn variable to a factor in order to perform predictive analysis.
Decision Tree
Based on our decision tree plot the most important variable is ContractType with 55% of customers likely to churn if they do not have a one-year or two-year contract. The second variable of importance is InternetServiceType indicating if a customer is on a month-to-month plan and does not have DSL they are 85% likely to churn. Lastly, the variable Tenure indicates if a customer has been with the company less than 15 months there is a 15% chance they might churn. If a customer has all three variables mentioned than they are 100% to churn. The two metrics of focus are sensitivity (TP / All Positives) and accuracy (TP + TN / All). Sensitivity score was 37.63% and accuracy 79.19%.
Logistic Regression
Logistic Regression results indicate Tenure and ContractType having significance for both the coefficients and analysis of deviance tables. Accuracy is 80.54%, slightly better than decision tree accuracy and sensitivity is 53.85% which is a good improvement from 37.63%.
Random Forest
Random Forest provided the best results out of the three models after hyperparameter tuning and refitting the model. The result was 80.61% accuracy and 92.34% sensitivity, beating the other two predictive analysis methods.
L. Selection of Analytical Method Part II
Cluster Analysis is the analytical method used for customer segmentation and identifying key attributes of Churn customers (Soetewey, 2020). Customer data consists of binary and nominal categories (only three variables are numerical) making it suitable for cluster analysis (“StatisticsSolutions”, n.d.). By subgrouping categories into binary and nominal variables data was ready for exploratory analysis. If PCA was used, then transformation and manipulation of data would have been extensive compared for cluster analysis.
Decision Tree, Logistic Regression and Random Forest are the algorithms chosen for predicting churn customers. Logistic Regression works best in finding relationships with data consisting of one binary dependent variable and one or more nominal independent variables (“StatisticsSolutions”, n.d.).
The Decision Tree method has the advantages of being able to handle both numerical and categorical data, requires minimal effort for data preparation and easy to interpret results (Gupta, 2017). Random Forest provides the capability of randomizing initial variables, bootstrapping data, creating multiple decisions tree, combining the outputs of the decision trees and generating a definitive result. The final output contains a confusion matrix and the top features contributing to churn customers (Srivastava, 2014).
M. Visual Method Selection
For univariate and bivariate statistics data is presented using bar charts. Bar charts are best for visualization of categorical variables with the x-axis showing the variable name and y-axis showing percentages of the counts. For numerical variables, density plots are used representing the distribution of numeric variables with x-axis for the value of the variable and y-axis for the density probability. The density plot in bivariate statistics is suitable for comparing an independent variable to a binary dependent variable (Koehrsen, 2018). In cluster analysis to show the relationships between variables and observe patters as a whole scatter plots are utilized (Yi, 2019).
IV: Data Summary
N. Non-Discriminatory Data
After cleaning the data and performing exploratory analysis I found the data to be non-
discriminatory. The most important observation for determining if data could be bias was missing values.
The data set only had 11 missing values under the TotalCharges column. The missing values were replaced with the product of tenure and monthly charges. No variable with correlation to the dependent variable was omitted from the evaluative or predictive analysis to ensure a non-bias outcome. The
phenomenon of churn customers was detectable by the variables found within the dataset. The three variables having the biggest impact on the phenomenon were Tenure, InternetServiceType and ContractType (Krishnamurthy, 2019).
O. Predictor Variables
Based on the correlation heat map a conclusion can be drawn of Churn (target variable) having a negative correlation with Tenure, IsContracted and a positive correlation with PaperlessBilling,
MonthlyCharges and Internet Service. It would be normal for MonthlyCharges and TotalCharges to have a high positive correlation with each other and services. Tenure, a numerical variable has high positive correlations with IsContracted, TotalCharges and Partner.
Cluster Analysis was the tool used for exploratory analysis and discovery of customer
segmentations based on common attributes (predictor variables). First, normalization was performed on MonthlyCharges and Tenure in order to run K-Means and produce a scatter plot. Second, an elbow curve scatter plot demonstrated 3 clusters as the best fit for the analysis. The customers were represented by the following groups, blue cluster of low tenure and low monthly charges, green cluster of high tenure and high monthly charges and red cluster of low tenure and high monthly charges. Blue cluster represented 24.88% of customers, green 24.29% and red 50.83%. In bivariate analysis and predictive analysis with decision tree low tenure was found to be an important factor contributing to churn customers. The other important factor found using bivariate analysis was churn customers tended to have high monthly charges. Validation by different methods confirms the predictor variables of tenure and monthly charges having correlation with churn. The focus was pinpointed to the red cluster of low tenure and high monthly charges with 50.83%. For demographics, the red cluster showed customers tended to be female and senior citizens. For account-based information red cluster customers were on a month-to-month payment plan, paid by electronic check and received statements via paperless billing. Lastly, for services red cluster customers tended to have streaming tv, streaming movies subscriptions and fiber optic internet.
In summary, evaluative and predictive analysis supported Tenure, MonthlyCharges,
ContractType and InternetServiceType as the important factors to consider for identifying key attributes in churn customers. Random Forest would be the best algorithm in this scenario for its accuracy of 81%
and sensitivity of 92%.
P. Sources
MichaelF. (2019). Predictive Process Step 1: Finding Your Target Variable. Retrieved from
https://community.alteryx.com/t5/Data-Science-Blog/Predictive-Process-Step-1-Finding-Your- Target-Variable/ba-p/401639
Koo Ping Shung. (2008). Accuracy, Precision, Recall or F1? Retrieved from https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9
Admin. (2020). How to Read a Correlation Matrix. Retrieved from https://statology.org/how-to-read-a- correlation-matrix
Antoine Soetewey. (2020). The complete guide to clustering analysis. Retrieved from
https://towardsdatascience.com/the-complete-guide-to-clustering-analysis-10fe13712787.
“StatisticsSolutions” (n.d.). Conduct and Interpret a Cluster Analysis. Retrieved from https://statisticssolutions.com/cluster-analysis-2/
“StatisticsSolutions” (n.d.). What is Logistic Regression. Retrieved from https://statisticssolutions.com/what-is-logistic-regression/
Prashant Gupta. (2017). Decision Trees in Machine Learning. Retrieved from
https://towardsdatascience.com/decision-trees-in-machine-learning-641b9c4e8052
Will Koehrsen. (2018). Histograms and Density Plots in Python. Retrieved from
https://towardsdatascience.com/histograms-and-density-plots-in-python-f6bda88f5ac0
Mike Yi. (2019). A Complete Guide to Scatter Plots. Retrieved from https://chartio.com/learn/charts/what-is-a-scatter-plot/
Prabhakar Krishnamurthy. (2019). Understanding Data Bias. Retrieved from https://towardsdatascience.com/survey-d4f168791e57
Tavish Srivastava. (2014). Introduction to Random Forest – Simplified Retrieved from
https://www.analyticsvidhya.com/blog/2014/06/introduction-random-forest-simplified/