CHAPTER 4. DATA ANALYSIS AND RESULTS 53
4.10. Develop Predictive Model 90
4.10.3. Decision tree regression predictive model 106
Nevertheless, in order to fully explore this problem, another algorithm, the decision tree regression, was applied and the results were analyzed before choosing the best model for the problem. A decision tree builds a regression model in the form of a tree-like structure to solve regression problems and is a good fit to handle the complex nonlinear relationship between features variables and target variable. A decision tree is a top-down approach where the processing breaks down a data set into smaller subsets while at the same time the tree moves down until the leaf node. The basic idea is to break down complex decisions into smaller subsets of simpler decisions so that it is easier to arrive at a solution. In a regression problem, the
decision tree considers features of data as predictor variables and the continuous variable as the target variable. The features with important information are chosen for the model, and features with no information are rejected automatically from the model, thus increasing the computational efficiency (Xu, Watanachaturaporn, Varshney, & Arora, 2005).
4.10.3.1. Tuning hyperparameters for decision tree regression model
A grid search algorithm was applied to find the optimal hyperparameters for the decision tree regression model. The following block of codes imported some required classes and data science packages for tuning hyperparameters:
# Import classes
from sklearn.tree import DecisionTreeRegressor from sklearn.model_selection import GridSearchCV from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split # Import Data Science package
import pandas as pd
The training data set was loaded and split into the train set and the test set. The following block of code performed the grid search on the train set only:
107 # Load the training data
training = pd.read_csv('..//NTD/training.csv') # Create the X arrays
X = training.set_index('Revenue Vehicle Inventory ID') # Create the y arrays
y = X.pop('Service Life')
# Split the training data into a train set (2/3) and a test set (1/3) X_train, X_test, y_train, y_test = train_test_split(X, y, random_state
= 42, test_size = 0.3)
A wide range of parameters was set in the dictionary ‘param_grid’. The code is as follows: # Define the search parameter values
param_grid = {'max_depth': [2, 3, 4, 5, 6], 'min_samples_leaf': [1, 2, 3, 4], 'min_samples_split': [1.0, 2, 3, 4], 'max_features': [1.0, 0.8, 0.6, 0.5, 0.4, 0.3, 0.1,'auto',None]}
The parameter grid setting included parameters as keys and a list of parameter values as values. The parameter grid was searched to find the best values for the model. After setting the parameter values, the decision tree regression model was instantiated with the random_state number of 42, which meant every time it was run the output would remain the same. Next, the grid search method was instantiated with the required parameters and was fit with the train set. The code is as follows:
# Instantiate the model
est = DecisionTreeRegressor(random_state = 42) # Instantiate and fit the grid search
gs_cv = GridSearchCV(est, param_grid, scoring = 'mean_squared_error', n_jobs = -1).fit(X_train, y_train)
Once the iterations were completed, the following code generated the optimal parameter values from the list of values:
# Best hyperparameter setting
108 The output is listed below.
Best Hyperparameters for train set: {'max_features': 0.8, 'min_samples_split': 2, 'max_depth': 4, 'min_samples_leaf': 2}
4.10.3.2. Developing and evaluating a decision tree regression predictive model
The 4 steps scikit-learn modeling was used on the decision tree regression model in the same way the previous models were built with the training set. The following block of codes was used to develop the decision tree regression model:
# Import classes
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
# Split the data set in a training set (2/3) and a test set (1/3) X_train, X_test, y_train, y_test = train_test_split(X, y, random_state
= 42, test_size = 0.3)
# Instantiate the decision tree regressor model
dtr_eval = DecisionTreeRegressor(random_state = 42, max_features = 0.8, min_samples_split = 2, max_depth = 4, min_samples_leaf = 2)
# Fit the model
dtr_eval.fit(X_train,y_train)
# Make the predictions on the train set y_pred_train = dtr_eval.predict(X_train) # Make the predictions on the test set y_pred_test = dtr_eval.predict(X_test)
The following block of code calculated the performance measures by decision tree regression model on the training set:
# Find the error rate on the train set
rms_train = np.sqrt(mean_squared_error(y_train, y_pred_train)) mae_train = mean_absolute_error(y_train, y_pred_train)
r2_train = r2_score(y_train,y_pred_train)
print('Root Mean Squared Error:\t\t%0.2f'% rms_train) print('Mean Absolute Error:\t\t\t%0.2f'% mae_train) print('R2 Score:\t\t%0.2f'% r2_train)
109
The performance results of predictive model on the training set are listed in Table 29.
Table 29. The Performance Measures with Decision Tree Regression on Training Set
Performance Measures Performance Scores
Root Mean Squared Error (RMSE) 3.47
Root Mean Squared Error (MAE) 2.17
R2 Score 0.76
Again, the following block of code calculated the performance measures by the decision tree regression model on the test set.
# Find the error rate on test set
rms_test = np.sqrt(mean_squared_error(y_test, y_pred_test)) mae_test = mean_absolute_error(y_test, y_pred_test)
r2_test = r2_score(y_test, y_pred_test)
print('Root Mean Squared Error:\t\t%0.2f' % rms_test) print('Mean Absolute Error:\t\t\t%0.2f' % mae_test) print('R2 Score:\t\t%0.2f' % r2_test)
The performance results of predictive model on the test set are listed in Table 30. The high performance scores on the train and test sets indicated that the decision tree regression model was not a good fit for the problem on a revenue vehicle inventory data set and will not predict well on unseen data. Therefore, the decision tree regression model was not considered as our predictive model.
Table 30. The Performance Measures with Decision Tree Regression on Test Set
Performance Measures Performance Scores
Root Mean Squared Error (RMSE) 3.48
Root Mean Squared Error (MAE) 2.20
R2 Score 0.77