RESULTS - A comparative study of tree-based models for churn prediction : a case study in the t

This section discusses the results obtained from the presented methodology. (Section 7.1) Represents the results obtained from the preprocess step, (Section 7.2) Represents the final results obtained. All the results are assessed by AUC of the ROC curve as proposed by the challenge committee

4.1. P

REPROCESS RESULTS

The results of the preprocess step are presented in this section. The dataset contains three different labels: Churn, Appetency and Up-selling. Each of these labels will be used to construct a separate dataset and every dataset will be preprocessed in different manner. For every step a 10-fold validation was used and 80% of the dataset was used to train the model and the rest 20% was used as a test set to assess the performance

4.1.1. Imputing Missing Values for the three datasets

We start preprocessing this dataset by comparing different techniques to impute missing values. Table 7.1 shows the performance of different missing values imputation techniques in combination with three base classifiers and their results were assessed by using AUC of the ROC curve after the imputation. The best performance in each column is in bold. The results show that different techniques yielded different results with different classifiers.

4.1.1.1. Imputing missing values for Churn dataset

Different imputing methods was performed on the churn dataset to impute the missing values, to assess the performance of each imputation technique the AUC of the ROC curve of each technique is measures against

C4.5 DT, RF and GBM as shown in Table 3. The highest AUC was achieved by Hot-Deck imputation technique

and so this technique will be used to impute the missing value in the churn dataset. Mean Hot-Deck

Hot-Deck with Indicators

Mice

C4.5 0.5061 0.5116 0.5231 0.5169

RF 0.6407 0.6531 0.6888 0.6398

GBM 0.6683 0.6794 0.7162 0.6641

Table 3 The AUC results for Churn Dataset Missing Value Imputation

4.1.1.2. Imputing missing values for Appetency dataset

The same technique is used for Appetency dataset; the best performing technique was Hot-Deck imputation. So, this technique will be used to impute the missing value in the Appetency dataset. The results are shown in Table 4.

Mean Hot-Deck

Hot-Deck with Indicators

Mice

C4.5 0.5 0.5 0.5 0.5

RF 0.7464 0.7686 0.7445 0.7407

GBM 0.8027 0.8107 0.7975 0.8067

4.1.1.3. Imputing missing values for Up-Selling dataset

The same technique is used for Up-Selling dataset, the best performing technique was MICE imputation. So, this technique will be used to impute the missing value in the Appetency dataset. The results are shown in Table 5.

Mean Hot-Deck

Hot-Deck with Indicators

Mice

C4.5 0. 6969 0. 6465 0. 6934 0. 6262

RF 0. 8115 0. 8334 0. 8229 0. 8380

GBM 0. 8394 0. 8544 0. 8445 0. 8579

Table 5 The AUC results for Up-Selling Missing Value Imputation

4.1.2. Balancing the datasets

This subsection discusses different algorithms to balance the datasets, three algorithms were used and their results were assessed by using AUC after the balancing. The algorithm with the highest AUC will be used to balance the dataset.

4.1.2.1. Balancing the Churn Dataset

Different balancing algorithms were used to balance the dataset. To assess the performance of each algorithm two classifiers (RF and GBM) were used and their AUC were measured before and after applying each algorithm. The initial AUC of the RF was 0.6796 and for GBM was 0.7032306. After applying SMOTE the AUC results increased by more than 10% which can indicate that SMOTE algorithm overfitted the dataset which is common for SMOTE algorithm and so the OSS was used instead for balancing the dataset. The results are shown in Table 6.

SMOTE

TOMEK

OSS

RF 0.8565 0.6533 0.6592

GBM 0.8414 0.6865 0.6818

Table 6 The AUC results for Churn Dataset Balancing

4.1.2.2. Balancing the Appetency Dataset

The same technique is used for Appetency dataset. The initial AUC of the RF was 0.7760 and for GBM was 0.7814. SMOTE algorithm behaved the same as Churn Dataset and so the OSS was used to balance the Appetency dataset. Results are shown in Table 7.

SMOTE

TOMEK

OSS

RF 0.8830 0.7522 0.7608

GBM 0.8774 0.7744 0.7834

4.1.2.3. Balancing the Up-Selling dataset

The same technique is used for Up-Selling dataset. The initial AUC of the RF was 0.8172 and for GBM was 0.8427. SMOTE algorithm behaved the same as Churn Dataset and so the OSS was used to balance the Appetency dataset. Results are shown in Table 8.

SMOTE

TOMEK

OSS

RF 0. 8754 0. 8358 0. 8345

GBM 0. 8692 0. 8513 0. 8583

Table 8 The AUC results for Up-Selling Dataset Balancing

4.1.3. Feature Selection

This subsection discusses different algorithms to select the most important features in the dataset, the feature importance matrix of RF and GBM was used to decide which features are more important for the model to be selected. The features that will score the highest AUC will be used.

4.1.3.1. Feature Selection of the Churn Dataset

The initial AUC for RF was 0.6796 and for GBM was 0.7032. Using Random Forests to select the most important 25 features in the dataset, then to run RF on these important features and to assess the AUC. Also, the GBM is used to assess the predictive power of these features by running the algorithm on the original dataset and the reduced dataset. Feature selected using GBM achieved higher AUC than RF and so it will be selected for feature Selection. Results are shown in Table 9.

Feature Selection using Random Forests

Feature Selection using Gradient Boosting Machine

RF 0. 6153 0. 6635

GBM 0. 6482 0. 6935

Table 9 The AUC results for Churn Dataset after Feature Selection

4.1.3.2. Feature Selection of the Appetency Dataset

The initial AUC for RF was 0.7760 and for GBM was 0.7814. Using Random Forests to select the most important 25 features in the dataset, then to run RF on these important features and to see the AUC. Also, the GBM is used to assess the predictive power of these features by running the algorithm on the original dataset and the reduced dataset. Feature selected using GBM achieved higher AUC than RF and so it will be selected for feature Selection. Results are shown in Table 10.

Feature Selection using Random Forests

Feature Selection using Gradient Boosting Machine

RF 0. 6953 0. 7579

GBM 0. 7158 0. 7712

4.1.3.3. Feature Selection of the Up-Selling Dataset

The initial AUC for RF was 0. 8172661 and for GBM was 0. 8427427. Using Random Forests to select the most important 25 features in the dataset, then to run RF on these important features and to see the AUC. Also, the GBM is used to assess the predictive power of these features by running the algorithm on the original dataset and the reduced dataset. Feature selected using Random Forests achieved higher AUC than GBM and so it will be selected for feature Selection. Results are shown in Table 11.

Feature Selection using Random Forests

Feature Selection using Gradient Boosting Machine

RF 0. 8251 0. 8256

GBM 0. 8423 0. 8380

Table 11 The AUC results for Up-Selling Dataset after Feature Selection

4.2. F

INAL MODEL AUC RESULTS

This section discusses the results obtained after applying Random Forests, GBM and Stacking them to the cleaned dataset obtained from the preprocess step. Every classifier is applied to the three datasets and the AUC of the ROC curve was measured. From Table 12 we can see that Stacking did increase the performance of the classifiers by an average of 2%.

Churn Appetency Up-Selling Average

Random Forest 0.6635 0.7579 0.8251 0.7488

GBM 0.6935 0.7712 0.8423 0.769

Stacking 0.7111 0.78 0.845 0.7787

Table 12 The AUC results of the Final model

4.3. A

NALYSIS OF THE RESULTS

The machine used was Core i5 with 8GB RAM and all the experiments have been conducted using R statistical software. Training and testing ratio for each dataset was fixed at 80:20 ratios. Class imbalance was handled only for the training datasets. Testing datasets were kept separated all over the experiment and were used only in the end to assess the performance of the final classifiers. The present research work aims at identifying an appropriate workflow to deal with a challenging dataset that contains missing values, imbalance, high cardinality and of mixed nature. Also, this work aims at comparing the performance of different decision trees- based ensemble techniques to detect churn. After analyzing the results, it has shown that ensemble techniques improve the performance of the classifiers and the stacking of these ensembles further improved the results. Churn dataset was the hardest dataset whose AUC was the lowest compared to the other two datasets. Imputation techniques used increased the performance of the classifiers with different measures, the best performing technique was the hot-deck imputation technique which yielded good results for Churn and appetency datasets, while MICE gave the best performance for the Up-Sell dataset. The best performing balancing technique was the OSS which gave a descent increase in the AUC without overfitting the datasets. The feature importance of GBM yielded an outstanding result compared to the Random Forests for churn and appetency datasets while Random Forests was the same as GBM for the up-sell dataset. The Stacking the ensembles increased a lot the classifiers performance. The final average AUC result was 0.7787 which will

33 place us in the top 10% of this competition. The following table presents the top winner using the same dataset.

Table 13 The KDD 2009 winners (Guyon, Lemaire, Boullé, Dror, & Vogel, 2009)

4.4. K

F

INDINGS

 Multiple methods for imputation produces better results than single imputation methods.

 SMOTE should be treated with careful cause in some cases it can cause overfitting.

 Wrapper methods for feature selection can take less computational effort and produce better results than filter methods.

 Tree-based ensembles produced better results than single trees through decreasing the variance in the dataset.

 Churn can be predicted by using tree-based ensembles with high accuracy.

 H2O R package is a powerful tool that’s more than capable of handling large, deep and heterogeneous datasets.

In document A comparative study of tree-based models for churn prediction : a case study in the telecommunication sector (Page 41-46)