4. Methodology
4.2. Tuning Model Hyper-Parameters
For the implementation of machine learning algorithms, generally two types of parameters are trained and used in model building. Hyper-parameters which are passed to the model for better performance and others which are learned by the algorithm during the execution phase. Hyper-parameters can be randomly selected and tested for a better model, but it is computationally exhaustive and time consuming to search for and find the best combination of parameters using the hit and try method. Grid search and random search methods are employed in our study for the selection of the hyper-parameters. Every combination of specified hyper-parameters is used for building the model in the grid search method and then the model is evaluated against the criteria specified. In random search, however, random combinations of parameters are used to find the best model. The subset of parameter values in the grid and random search are taken from the related work done in the past for similar solutions that have produced competitive results. Grid search is computationally more expensive than random search as it builds a model for
each combination of parameters and it is largely a affected by the number of features defining the data. In grid search, the parameter values are defined in the matrix form and combinations of these parameters are evaluated for a model to achieve the best performance.
In our experiments, we have applied grid and random search methods on the training data set only. To get unbiased estimates of our models’ predictive performance, the test data set was taken out as the first step with a 20-80 ratio, while keeping the balance of non-churn and churn customers in both data sets the same as in the full data set. Using the best parameters, the models were trained and evaluated with k-folds cross validation stratified sampling. To reproduce the experiments with the same results, a single seed value is used for all random factors where applicable.
Following tuned hyper-parameters were used in the experiments:
4.2.1 Random Forest
The following parameters are used for the binary classification of churner and non-churner customers using the random forest algorithm, which are replicated from Kaya et al. [2018] and taken as a base model:
• Number of Estimators (n_estimators): 500
This value defines the number of trees in the forest. • Split criteria (criterion): Gini
To measure the quality of split criteria, Gini is used. • Maximum depth of tree (max_depth): None
Default value is used for the maximum depth of the tree, which means nodes will be expanded till all leaves are pure.
• Number of features (max_features): 2
Two features are considered while considering each split.
• Minimum number of samples for internal split (min_samples_split): 2 Minimum of two samples are required for internal node split.
• Minimum Number of Samples at Leaf Node (min_samples_leaf): 1 Minimum one sample required to be at a leaf node.
• Limit on the Leaf Node Number (max_leaf_nodes): None Default value is used to grow trees with max leaf nodes. • Minimum Impurity for Split (min_impurity_split): 10-7
This is the threshold for early stopping in tree growth. This parameter is depre- cated; minimum impurity decrease (min_impurity_decrease) is used here.
4.2.2 Xtreme Gradient Boosting (XGBoost)
Using the grid search method, the following parameters are selected for the gradient boosting tree classification method:
• Maximum Tree Depth (max_depth): 2
This numerical value sets the maximum tree depth. • Learning rate (learning_rate): 0.01
Boosting learning rate of 0.01 is used. • Number of trees (n_estimators): 200
Number of trees to fit.
• Learning task (objective): binary:logistic
Binary logistic is defined as the learning task for classification task of churner and non-churners.
• Booster (booster): gbtree
In our case for tree based model gbtree is used. • (min_child_weight): 1
Minimum sum of instance weight needed in a child. • (gamma): 0
Default 0 value is used for gamma, which defines the minimum loss value required to make a further partition on a leaf node of the tree.
• (subsample): 0.5
This value is used for the sub-sampling of training data prior to growing the tree. • scale_pos_weight
This value controls the balance of classes in the imbalance data set. We have used equal weights for both the classes.
• (reg_alpha): 100
This is L1 regularization term on weights which is used after parameter tuning.
4.2.3 Recurrent Neural Networks (LSTM)
For the RNN model, the activation and optimizer functions which pertain to the design of the model were searched as a first step using the grid search method and then used for the tuning of the learning rate of model execution. For LSTM, a three-layered network is defined. The input layer of LSTM takes values of nine features which were calculated with a quarter time interval (i.e. four values calculated for an year). The cell-states are taken as the output from the LSTM network. Then three layered sequential neural networks takes the input of cell states (from LSTM) along with the demographic and calculated behavioral features. The neural network is designed with single hidden layer having 32 neurons. Dropout rate (of value 0.2) is used after the input and hidden layer, which reduces the network tendency to over-fit. Relu is used as the activation function at the input and hidden layer, and the sigmoid function is used at the output layer. The binary cross entropy loss function is used with adam function as an optimizer.
4.2.4 Graph Convolutional Network (GCN)
For the GCN model, a three-layered network with learning rate of 0.001 and drop out rate of 0.2 after the input layer is used for training the model with 200 epochs. Rectified linear unit (ReLU) is used as an activation function. This network is based on the graph network model defined by Kipf and Welling [2016].