Volume 5, Issue 6, June 2019 (ISSN: 2394 – 6598)
456
©IJETIE 2019
CHURN PREDICTION IN TELECOM INDUSTRY USING STANDARD MACHINE
LEARNING TECHNIQUES
Pamina J, Selvakumari.M, Sowjanya.M, Sneha.S Abstract
Churn prediction targets to hit upon customers contemplated to leave a service provider. Customer churn has emanated as a critical issue for Customer Relationship Management and Customer retentiveness in the Telecom industry. Retaining one customer demands an organization from 5 to 10 times than gaining a new one.
Predictive models can contribute correct identification of possible churners in the near future in order to provide a desired solution. As churn management is a considerable duty for companies to cling to valuable customers, the ability to predict the customer churn is prominent. There are a multitude of issues that can lead customers to leave a business, but there are a few that are contemplated to be the leading cause of customer churn. There is a direct affiliation between customer lifetime value and the ability of growing business. Higher the customer churn rate, lower is the chances for profession growth. Linking two or more algorithms have been proved to provide better performances than many single techniques over a number of different domain problems. This project consists of models such as KNN Classification, Linear Regression, Random Forest algorithm, Decision Tree Classification, XGBoost algorithm, Naive Bayes algorithm, Artificial Neural Networks. As expected, Artificial Neural Networks gives more accuracy when compared with other algorithms.
1. Introduction
Churn prediction process is a highly debated research area in recent ten years, perhaps the biggest challenge in telco industry [1][2][3][4]. Data volume has been flourishing tremendously over the last decades due to advancements in information technology [1][2]. A small step towards retaining an existing customer can lead to a momentous increase in revenues and profits [13][18][19][20]. The requirement of retaining customers craves for accurate customer churn prediction models that are both accurate and comprehensible [5][7][8]. At the same time there has been enormous development in data mining [6][9][11][15]. The modern churn prediction mainly relies on classification algorithms [10]. Many new methods and techniques have been added to process data and gather information [14].
The data gathered from any source is raw data in which the valuable information is hidden [12][14][15]. Researchers from different discipline have tried to look on this problem from their own perspectives to come out with a clear understanding and to recommend an effective solution for churners
in many business areas [16][18][19][20].
Conventional churn prediction techniques has the advantage of being simple and robust with respect to defects in the input data, they possess serious limitations to the interpretation of reasons for churn [17][20]. Therefore measuring the effectiveness of the model depends also on how well the results can interpreted for inferring the possible reasons of churn [1][2][3][5]. The purpose of prediction is to anticipate the value that a random variable will assume in the future or to estimate the likelihood of the future events [4][6]. This project consists of executed algorithms with expected accuracy [6][7].
2. Modules Description
The study of predicting which persons are going to churn in advance will help the telecommunication industry to identify which persons are going to leave the network and which persons are going to use the same. The problem of our work to classify each subscriber as potential churner or potential non
457
©IJETIE 2019 churner. The framework is to compare and find the
best accuracy on six machine learning and one deep learning algorithm.
Figure 2.1
Figure 2.1 represents the module description of the analysis.
3. Data Cleaning
Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies. Ignore the tuple - usually done when class label is missing.
Use the attribute mean (or majority nominal value) to fill in the missing value. Use the attribute mean (or majority nominal value) for all samples belonging to the same class. Predict the missing value by using a learning algorithm: consider the attribute with the missing value as a dependent (class) variable and run a learning algorithm (usually Bayes or decision tree) to predict the missing value.
Identify outliers and smooth out noisy data.
Binning-Sort the attribute values and partition them into bins (see "Unsupervised discretization" below).
Then smooth by bin means, bin median, or bin boundaries. Clustering- group values in clusters and then detect and remove outliers (automatic or manual). Regression- smooth by fitting the data into regression functions. The next chapter deals with the Exploratory data analytics.
4. Exploratory Data Analytics
Figure 4.1
Figure 4.1 represents the attribute churn and exploration is done.
Figure 4.2
Figure 4.2 represents the attribute Internet service and exploratory data analytics is done.
Figure 4.3
Figure 4.3 represents the correlation matrix of the attributes Senior citizen, tenure, Monthly charges.
458
©IJETIE 2019 5. Dataset
The dataset is from IBM that consists of 7043 records and 21 fields. It issues us with basic details of customers.5174 are non-churners and 1869 are churners, which means there are 73.5% non- churners and 26.5% churners. Hence, the dataset is highly unbalanced in terms of the proportion of churners versus non-churners. The elements provided helps us to predict the customer churn with a great accuracy by comparing several machine learning algorithms and deep learning algorithms.
6. Algorithm 6.1 KNN Algorithm
K Nearest Neighbor is one of the most straightforward classification technique. It belongs to the supervised learning domain and finds intense application in pattern recognition, data mining and intrusion detection. In KNN classification, the out turn is a class membership. An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its K Nearest Neighbors (K is a positive integer, typically small). If K = 1, then the object is simply assigned to the class of that single nearest neighbor.
The aim of K-NN is to classify a test instance by finding its neighborhood, which consists of the k closest instances in the training set. The class label assignment of the test instance is based on the dominance of the classes in the neighborhood, that is the test instance should be assigned to the majority class of the k instances. It is literally difficult to set the value of k. The accuracy acquired while using this algorithm is 0.7915
6.2 Linear Regression
Linear regression attempts to model the relationship between two variables by fitting a linear equation to observed data. One variable is considered to be an explanatory variable, and the other is considered to be a dependent variable. A linear regression line has an equation of the form Y = a + bX, where X is the explanatory variable and Y is the dependent variable. The slope of the line is b, and a is the intercept (the value of y when x = 0). Linear regression analysis is used to predict the value of a variable based on the value of another variable. The variable you want to predict is called the dependent
variable. The variable you are using to predict the other variable's value is called the independent variable. This form of analysis estimates the coefficients of the linear equation, involving one or more independent variables that best predict the value of the dependent variable. Linear regression fits a straight line or surface that minimizes the discrepancies between predicted and actual output values. The accuracy obtained while using this algorithm is 0.267. There are simple linear regression calculators that use a “least squares”
method to discover the best fit line for a set of paired data. You then estimate the value of X (dependent variable) from Y (independent variable).
6.3 Decision Tree
Decision Trees are a type of Supervised Machine Learning that is you explain what the input is and what the corresponding output is in the training data where the data is continuously split according to a certain parameter. In general, Decision Tree algorithms are referred to as CART or Classification and Regression Trees. Decision trees learn from data to approximate a sine curve with a set of if-then-else decision rules. The deeper the tree, the more complex the decision rules and the fitter the model. The topmost decision node in a tree which corresponds to the best predictor called root node. A decision tree is a classifier which conducts recursive partition over the instance space. A typical decision tree is composed of internal nodes, edges and leaf nodes. Each internal node is called decision node representing a test on an attribute or a subset of attributes, and each edge is labeled with a specific value or range of value of the input attributes. In this way, internal nodes associated with their edges split the instance space into two or more partitions. Each leaf node is a terminal node of the tree with a class label. The tree can be explained by two entities, namely decision nodes and leaves. The leaves are the decisions or the final outcomes and the decision nodes are where the data is split. To get the best attribute, the following are computed-Entropy is the measure of uncertainty of a random variable, it characterizes the impurity of an arbitrary collection of examples. The higher the entropy more the information content. The accuracy obtained with this algorithm is 0.8765
459
©IJETIE 2019 6.4 XGBOOST
XGBoost is an algorithm that has recently been dominating applied machine learning. XGBoost is an implementation of gradient boosted decision trees designed for speed and performance. XGBoost stands for extreme Gradient Boosting. The library is laser focused on computational speed and model performance, as such there are few frills.
Nevertheless, it does offer a number of advanced features. The implementation of the algorithm was engineered for efficiency of compute time and memory resources. A design goal was to make the best use of available resources to train the model.
Some key algorithm implementation features include, Sparse Aware implementation with automatic handling of missing data values, Block Structure to support the parallelization of tree construction, Continued Training so that you can further boost an already fitted model on new data.
The accuracy obtained with this algorithm is 0.813
6.5 Naive Bayes Algorithm
Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem.
It is a classification technique based on Bayes’
Theorem with an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. Bayes’ Theorem finds the probability of an event occurring given the probability of another event that has already occurred. In order to create a classifier model, we find the probability of given set of inputs for all possible values of the class variable y and pick up the output with maximum probability. The technique is generally used to understand when described using binary or categorical input values. It is called naïve Bayes or idiot Bayes because the calculation of the probabilities for each hypothesis are simplified to make their calculation tractable. Rather than attempting to calculate the values of each attribute value P (d1, d2, d3|h), they are assumed to be conditionally independent given the target value and calculated as P(d1|h) * P(d2|H) and so on. This is a very strong assumption that is most unlikely in real data, i.e. that the attributes do not interact.
Nevertheless, the approach performs surprisingly well on data where this assumption does not hold.
The accuracy obtained with this algorithm is 0.7545
6.6 Artificial Neural Networks
It is truly said that the working of ANN takes its root from the neural network that resides in our human brain. ANN works on something referred to as hidden state is a transient form which has a more probabilistic behavior. A grid of such hidden state act as a bridge between the input and the output. But before using the technique, an analyst must know, how does the technique really work. Even though the detailed derivation may not be required, one should know the framework of the algorithm. This knowledge serves well for multiple purposes - Firstly, it helps us understand the impact of increasing / decreasing the dataset vertically or horizontally on computational time. Secondly, it helps us understand the situations or cases where the model fits best. Thirdly, it also helps us explain why certain model works better in certain environment or situations. Every linkage calculation in an Artificial Neural Network (ANN) is similar. In general, we assume a sigmoid relationship between the input variables and the activation rate of hidden nodes or between the hidden nodes and the activation rate of output nodes. Let’s prepare the equation to find activation rate of H1.
Logit (H1) = W(I1H1) * I1 + W(I2H1) * I2 + W(I3H1) * I3 + Constant = f = > P(H1) = 1/(1+e^(-f))
Using these errors we can recalibrate the weights of linkage between hidden nodes and the input nodes in a similar fashion. Imagine, that this calculation is done multiple times for each of the observation in the training set. ANN is rarely used for predictive modelling. The reason being that Artificial Neural Networks (ANN) usually tries to over-fit the relationship. ANN is generally used in cases where what has happened in past is repeated almost exactly in same way. For example, say we are playing the game of Black Jack against a computer. An intelligent opponent based on ANN would be a very good opponent in this case (assuming they can manage to keep the computation time low). With time ANN will train itself for all possible cases of card flow. And given that we are not shuffling cards with a dealer, ANN will be able to memorize every single call. Hence, it is a kind of machine learning technique which has enormous memory. But it does not work well in case where scoring population is significantly different compared to training sample.
For instance, if I plan to target customer for a
460
©IJETIE 2019 campaign using their past response by an ANN. I
will probably be using a wrong technique as it might have over-fitted the relationship between the response and other predictors. Artificial Neural Networks (ANN) have many different coefficients, which it can optimize. Hence, it can handle much more variability as compared to traditional models.
7. Summary of results of state of art algorithms
Table 1.1
Table 1.1 represents the accuracy scores of the dataset.
S.
n o
Algorithm Accuracy of Training and
Test set
Precision score
Recall score
1 KNN 0.82
3
0.754 0.540 0.418
2 Linear regression
0.26 7
0.278 0.501 0.390
3 Random forest
0.98 4
0.788 0.624 0.440
4 Decision tree
0.99 8
0.740 0.497 0.504
5 XGBoost 0.82 2
0.808 0.688 0.491
6 Naive bayes
0.75 45
0.755 0.524 0.7272
8. Conclusion
In this study, it is tired plotting different graphs then trained the data using different machine learning algorithms and a deep learning algorithm are used and they are compared to get the accuracy of the best
algorithm to predict churners and non-churners of telecom industry. Art Selecting the right combination of attributes and fixing the proper threshold values may produce more accurate results.
This study limits itself with prediction of churn. In future, we will do the analysis using several deep learning algorithms.
References
[1] B. Huang, M.T. Kechadi, and B. Buckley,
“Customer churn prediction in
telecommunications,” Exp. Sys. Appl. vol.39(1), Pp.1414-1425,2012
[2]Huang, Bingquan, MohandTaharKechadi, and Brian Buckley. "Customer churn prediction in telecommunications." Expert Systems with Applications 39.1 (2012):1414-1425.
[3]R.Mattersion, Telecom Churn Management, APDG Publishing,Fuquay-Varina, NC 2001 [4]J.Hadden, A.Tiwari, R, Roy, and D. Ruta, “Churn prediction: Does technology matter ? ,” World Academy of Science, Engineering and Tecnology.
vol.16, pp.193-199, 2006
[5]Mattison, Rob. The telco churn management handbook. Lulu. com, 2006.
[6]Rosset, S., Neumann, E., Eick, U. and Vatnik, N.
(2003) Customer lifetime value models for decision support. Data Min. Knowl. Discov., 7, 321–339.
[7]Yong, Zhou, Li Youwen, and Xia Shixiong. "An improved KNN text classification algorithm based on clustering." Journal of computers 4.3 (2009):
230-237.
[8]Soucy, P., &Mineau, G. W. (2001). A simple KNN algorithm for text categorization. In Proceedings 2001 IEEE International Conference on Data Mining (pp. 647-648). IEEE.
[9]Keramati, A., Jafari-Marandi, R., Aliannejadi, M., Ahmadian, I., Mozaffari, M., & Abbasi, U.
(2014). Improved churn prediction in telecommunication industry using data mining techniques. Applied Soft Computing, 24, 994-1012.
461
©IJETIE 2019 [10] Keramati A, Jafari-Marandi R, Aliannejadi M,
Ahmadian I, Mozaffari M, Abbasi U. Improved churn prediction in telecommunication industry using data mining techniques. Applied Soft Computing. 2014 Nov 1;24:994-1012.
[11] Kirui, C., Hong, L., Cheruiyot, W. and Kirui, H., 2013. Predicting customer churn in mobile telephony industry using probabilistic classifiers in data mining. International Journal of Computer Science Issues (IJCSI), 10(2 Part 1), p.165.
[12] Bi W, Cai M, Liu M, Li G. A big data clustering algorithm for mitigating the risk of customer churn.
IEEE Transactions on Industrial Informatics. 2016 Jun;12(3):1270-81.
[13] Huang, Y. and Kechadi, T., 2013. An effective hybrid learning system for telecommunication churn prediction. Expert Systems with Applications, 40(14), pp.5635-5647.
[14] Huang Y, Kechadi T. An effective hybrid learning system for telecommunication churn prediction. Expert Systems with Applications. 2013 Oct 15;40(14):5635-47.
[15] Keramati, A., Jafari-Marandi, R., Aliannejadi, M., Ahmadian, I., Mozaffari, M. and Abbasi, U., 2014. Improved churn prediction in telecommunication industry using data mining techniques. Applied Soft Computing, 24, pp.994- 1012.
[16] Hassouna, Mohammed, Ali Tarhini, Tariq Elyas, and Mohammad Saeed AbouTrab. "Customer churn in mobile markets a comparison of
techniques." arXiv preprint
arXiv:1607.07792 (2016).
[17] Hassouna M, Tarhini A, Elyas T, AbouTrab MS. Customer churn in mobile markets a comparison of techniques. arXiv preprint arXiv:1607.07792. 2016 Jan 18.
[18] Amin, A., Anwar, S., Adnan, A., Nawaz, M., Alawfi, K., Hussain, A. and Huang, K., 2017.
Customer churn prediction in the
telecommunication sector using a rough set approach. Neurocomputing, 237, pp.242-254.
[19] Amin A, Anwar S, Adnan A, Nawaz M, Alawfi K, Hussain A, Huang K. Customer churn prediction in the telecommunication sector using a rough set approach. Neurocomputing. 2017 May 10;237:242- 54.
[20] Amin, Adnan, et al. "Customer churn prediction in the telecommunication sector using a rough set approach." Neurocomputing 237 (2017): 242-254.