Selecting and training a machine learning model

Firstly, the data needs to be split in training and testing data. The training data will be used to train the model and the testing data will be used to evaluate the performance of the model. The data is split 30%/70% respectively for testing and training. There is 10% data assigned for training than in the baseline model because it was realized that the model needs more labeled data to have better performance. That results in 371 complexes for training and 160 complexes for testing. To gain more insight about the

50000 100000 150000 200000 250000 market_value_rented 0 200 400 600 800 1000 1200 1400 1600 preferred_rent policy_sell False True

Figure 4.6:Market value rented and Preferred rent scatter plot

training data, an algorithm for selecting the best N number of features for machine learning algorithms was used. The algorithm uses a metric to select the best features from a data set that are a good predictors for a particular feature.

The framework (scikit-learn [23]) that is used for machine learning provides three

types of metrics for univariate selection - ANOVA, chi2 _{and mutual information. The}

metric ANOVA takes into account every feature independently and does not take into account the combination with another feature, and during the data analysis it was concluded that a combination of features is required in order to make a satisfactory

predictive model. The chi2 _{metric requires the data to be composed with positive}

numerical feature and the current data set has negative features as well. For that reason, the metric that was used is mutual information classification. The metric measures the mutual dependence between two variables.

The algorithm was set to find the 15 optimal features in the data set. The algorithm selected the following features as optimal for machine learning algorithm:

rent, maximum rent, market rent, preferred rent, market value rented, taxation value, tax value, ratio rent market value, rental class cheap, rental class above rent allowance limit, dwelling size for three or more people, classification on separation date type not social, surface area, WOZ value, WWS points. The algorithm includes features that were also identified during the data analysis. This confirms that those features are good predictors for the sell policy.

For the next step, the training data is fed to machine learning algorithms and models are created. The models that were used are Logistic Regression, Support Vec- tor Machine, Random Forest Classifier and XGBoost.

linear classification machine learning problems. The last model (XGBoost) was selected as it is a trending model in online machine learning competitions. It is one of the most used model by winners of such competitions.

The models were trained and their metrics are calculated using 10 KFold cross validation with the following options:

• No further pre-processing. The data set is not further pre-processed.

• With optimal features. The algorithm used to select optimal features is feature ranking with recursive feature elimination and cross-validated selection. As a

scoring function, the ROC AUC3_{[3] is used which is a probability of the predic-}

tor classifying randomly chosen variable as a positive instance higher than a negative one. That means whenever a features is removed, the algorithm in- spects the ROC AUC and determines if it was better than before.

• With optimal features, PCA4_{[14] and normalized data}_{. The normalizer algo-}

rithm is rescaling instances in the data set independently, so that its norm equals one.

• With optimal features, PCA and robust scaled data. As the data contains outliers, an scaling using statistics that are robust to outliers are used. Scaling is done based on the median and the interquartile range.

• With optimal features, PCA, normalized data and robust scaled data

To select the models with the best performance, the F1 metric was used as the model. The F1 score is a combination between Precision and Recall - for that reason it is a metric that is a good indicator of model’s performance. Among the options, the most performant for each algorithm are:

• Logistic Regression with optimal features, normalized data and robust scaled data (see Table 4.5)

Metric Score

Accuracy 0.82

Precision 0.72

Recall 0.82

F1 score 0.77

Table 4.5:Logistic Regression classifier metrics

3_{Area Under the Receiver Operating Curve}

• Support Vector Machine5 _{without further pre-processing (only robust scaled}

data) (see Table 4.6)

Metric Score

Accuracy 0.85

Precision 0.78

Recall 0.80

F1 score 0.79

Table 4.6:Support Vector Machine classifier metrics

• Random Forest Classifierwithout further pre-processing (see Table 4.7)

Metric Score

Accuracy 0.85

Precision 0.84

Recall 0.74

F1 score 0.79

Table 4.7:Random Forest Classifier classifier metrics

• XGBoost with optimal features and robust scaled data (see Table 4.8)

Metric Score

Accuracy 0.88

Precision 0.81

Recall 0.87

F1 score 0.84

Table 4.8:XGBoost classifier metrics

All the classifiers have similar performance. In this case, although with little dif- ference, the model that shows the most acceptable trade-off between the True Nega- tives and the False Positives, indicated by the highest F1 score and highest accuracy is XGBoost. Therefore, it can be concluded that the algorithm that most accurately make predictions for the sell policy of a complex is XGBoost.

By default, the model makes prediction based on a predicted likelihood - whenever the model predicts that a certain complex should be sold with likelihood above

5_{The algorithm was trained with robust scaled data only. Otherwise it is very slow due to its imple-}

0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate

0.0 0.2 0.4 0.6 0.8 1.0

True Positive Rate

Receiver operating characteristic

ROC fold 0 (AUC = 0.93) ROC fold 1 (AUC = 0.98) ROC fold 2 (AUC = 0.86) ROC fold 3 (AUC = 1.00) ROC fold 4 (AUC = 0.98) ROC fold 5 (AUC = 0.89) ROC fold 6 (AUC = 0.85) ROC fold 7 (AUC = 0.92) ROC fold 8 (AUC = 0.86) ROC fold 9 (AUC = 0.90) Luck

Mean ROC (AUC = 0.92 ± 0.05) ± 1 std. dev.

Figure 4.7: XG Boost - ROC curve, AUC score with 10 KFold CV

or equal to 50%, it is classified as such and vice versa. In order to select a thresh-

old, the ROC6_{curve and the AUC}7 _{score are analyzed. The ROC curve is a graphical}

plot that represents the True Positive Rate (Sensitivity) and the False Positive Rate (Specificity) by varying the predictors threshold and AUC score is a metric for binary classification that also considers all possible thresholds. The ROC curve for the XG- Boost model can be observed at Figure 4.7. The most optimal threshold, with as high as possible Precision and Recall, is calculated from the ROC curve. In this case it is in 0.46, or 46%. That means that whenever the model predicts something with a likelihood that is greater than or equal to 46% likelihood, it should classify it with the sell policy, and with the do not sell policy in all other cases.

In the next step the parameters of the algorithm are optimized. For that purpose a greedy grid search was used that trains the model with different parameters and based on a scoring function it selects the best parameter. The following parameters were considered:

• Maximum depth. Indicates the maximum depth of a tree. (Default = 6)

• Minimum child weight. The minimum sum of the hessian8 _{that are required}

in a child. It is used to control over-fitting. Lower value will prevent the model from making relations that can be highly specific. (Default = 1)

• Gamma. Minimum loss reduction that is required in order to make a split on a leaf node of the tree. (Default = 0)

6_{Receiver operating characteristic}

7_{Area Under the Curve}

• Subsample. Ratio of the training instance for every tree. It prevents overfitting. Lower values will make the model more conservative. This is done in order to make the trees as varied as possible. (Default = 1)

• Columns sample by tree. Denotes a fraction of the total columns that will be randomly sampled for each tree. That introduces variance between the trees, as every tree will be constructed based on different columns. (Default = 1) • Regularization alpha. L1 regularization term on weights. Changing this value

can make the model more conservative. (Default = 0)

The grid search uses the F1 score as a scoring function. The reason for that is to try to lower the wrong predictions of the model using the parameter grid search. Also, the parameters are tuned based on the scores of the development (training) data set. With the following parameters, the highest F1 score was obtained:

• Max depth:2

• Minimum child weight:2

• Gamma:0

• Columns sample by tree:0.1

• Subsample:0.75

• Regularization alpha:0

With the parameters described above and the threshold obtained during the ROC curve analysis, the following metrics for the final model are obtained:

Metric Score

Accuracy 0.89

Precision 0.86

Recall 0.82

F1 score 0.84

Table 4.9:XGBoost classifier metrics with optimal features, tuned parameters and

4.6 Model interpretation

In document A system for improving analysis of complexes in real estate asset management (Page 54-60)