Classification model training and optimization results

Chapter 5 Transportaion Mode Classification Model

5.1 Developing the Classification model

5.1.6 Classification model training and optimization results

The input data for the model are user-supplied information about transportation mode segments (TMS) as provided previously in Table 4-2.

Feature Estimation and Selection

Next, statistical properties of attributes were computed. Average, maximum speed, minimum speed, acceleration and jerk were calculated; the standard deviations, the 98th percentiles and the difference between the 98th and 50th percentile values for these three parameters of motion were also computed. These attributes were ranked based on their ADP as shown in Equation (5-4).

The features with ADP less than 90% were eliminated, as the inclusion of weakly differentiating attributes actually diminished the classifiers’ performance. For instance, the minimum speeds for all transportation modes were almost equal – nearly zero – which created a common attribute and precluded differentiation. Figure 5-8 illustrates the ranked features and their ADP; those shaded in grey with values less than 90 were eliminated from the model. As a result of this step, the optimization method considers only a range of between 1 and 11 attributes in the Feature Vector, with the remaining five being eliminated.

Figure 5-8 ADP values of Ranked Attributes

Feature vector and classification model optimization

Given this list of attributes to be included in the classifier model, the next step is to investigate the optimal form of the model: α (the degree of correlation to be applied in calculating AADP); the data format; the classification technique and the parameters that define them; and the number of features (of the 11 brought forward from the previous step).

To this end, an iterative process was developed to calculate MCR for a given combination of model parameters. In fact, based on the size of the dataset, six independent values of MCR were calculated for each combination using the stratified cross validation technique above. Figure 5-9 shows the pseudo code for calculating the MCR under different combinations that varies all parameters.

Figure 5-9 The pseudo code for calculating the MCR under different combinations The number of combinations tested can be calculated as the product of all choice sets of input variables. These are: α = 6; NF = 8; CL= 3; Disc = 2; PCA = 2. Thus, the total is 576 combinations. Using the N-fold approach, in this case with N=6, produces six MCR results for each combination of inputs. It is possible to determine the “optimal” model performance simply by finding the minimum MCR amongst the total combinations. A more robust approach is to use a linear regression model to determine two outputs: the explanatory power of each combination and the interactions amongst a subset of the model’s parameters. The approach taken is to assign a new, binary variable for each possible state of the input variables. For example,1=1 when the model is run with =0, and 1=0 otherwise. Mathematically, it can be written:

when 1=1, _𝑖 = 0 ∀𝑖 ≠ 1.

Similarly, CL1 represents the state (type) of the classifier used in the model. When NB is

used, CL1=1 and 𝐶𝐿𝑖 = 0 ∀𝑖 ≠ 1. The result is 21 binary variables representing the states of each input. Table 5-6 presents the notation used for the binary variables.

Table 5-6 Notation of binary variables xi,j j | x i| i  {0:0.2:1} 6 NF {1,3,5,6,7,8,10,11} 8 CL {NB, k-NN, QDA} 3 PCA {0, 1} 2 Disc {0, 1} 2 Total 21

Each variable is indexed by the subscript i; the individual values that each variable can take are indexed with the subscript j. The length of each variable, |𝑥_𝑖|, is the number of levels that are possible; for example,  may take on six values, so its length is six.

Using this notation, a generalized form of the regression equation can be written. Equation (5-11) shows the equation to be solved. Here, Y represents the independent variable, in this case the misclassification rate, MCR. The variable θ represents the regression constant and coefficients of each term. The superscripts on θ indicate the “level” of interaction: level 1 is the main effect of the binary variables; level 2 is the pairwise interactions amongst all binary variables; and level 3 is the three-way interaction amongst select binary variables. For the three-way interactions, the model considers interactions amongst:

 , NF, CL;  NF, CL, PCA;  NF, CL, Disc;  CL, PCA, Disc.

(5-11)

Main effects Second-order

interactions Third-order interactions 𝑌 = 𝜃0+ ∑ ∑ 𝜃𝑖,𝑗 (1) 𝑥𝑖,𝑗 + ∑ ∑ 𝜃𝑘,𝑙 (2) 𝑥𝑖,𝑗𝑥𝑘,𝑙+ ∑ ∑ 𝜃𝑚,𝑛 (3) 𝑥𝑖,𝑗𝑥𝑘,𝑙𝑥𝑚 ,𝑛 |𝑥_𝑚| 𝑛=1 𝐿 𝑚 =𝑘+1 |𝑥_𝑘| 𝑙=1 𝐿 𝑘=𝑖+1 |𝑥_𝑖| 𝑗 =1 𝐿 𝑖=1

The regression model is calibrated iteratively. In the first case, all variables are included. After the first calibration, those variables that are statistically significant are retained in the model; those without statistical significance are eliminated. The model is then re-calibrated. The output of the model solution is a matrix of values for θ. The optimal model solution can be identified by finding the sum of θ values that minimize the misclassification rate. Given the complexity present in the model outputs – both the pairwise and three-way interactions – it is difficult to find through inspection the best combination of variables. Therefore, a graphical illustration of the sums of the θ values, including interaction terms, is shown in Figure 5-10. The optimal model performance is shown in the top right diagram.

Figure 5-10 Results of the Relative Contribution of Optimizing Model’s Parameters A number of observations can be made on the basis of the results provided in Figure 5-10 and provided in Appendix A-1:

 The coefficients associated with α were not statically significant. Recall that the purpose of  is to reorder the attributes in the feature vector to account for correlation amongst these attributes. The interpretation of  being statistically insignificant is not that the variables were not reordered, but rather that the reordering did not influence the model’s performance significantly. This result can be explained by the high correlation between the attributes considered in the feature vector (FV). The calculated correlation coefficients are included in Appendix A-2.

 The performance of the NB classification model improves when feature discretization is applied (Disc). This outcome is consistent with results found in the literature. However, feature discretization does not improve the performance of the QDA and k- NN models.

 When PCA is not applied, increasing the number of features in the feature vector either degrades or does not improve the models’ performance. It is speculate that the lack of improved performance with the increased number of features is because of the high correlation between some of the included features.

 The best model performance was obtained when applying PCA on the continuous (non-discretized) attributes using the whole set of features (11 feature) and the k-NN classification model. Generally, PCA transforms the original correlated data into uncorrelated linear components, called principal components; therefore, it overcomes the problem of the highly correlated set of features.

Final Model Formulation

Based on the model structure determined in the previous step, the parameters of the classification model are now calibrated. The results of the regression produce an optimized model with the following characteristics:

α =0;

All 11 attributes are included in the model;

The best performing classification method is k-NN with k=11; PCA is applied with 98% variation in the data; and

Although PCA is usually used for the purpose of dimensionality reduction, in this case PCA transforms the original data into linearly uncorrelated components which provides the best results perhaps as a result of the strong correlation amongst the attributes in the data set. As expected, producing a higher number of principal components captures an increasing amount of the variance in the data. Using a threshold of acquired retained variance of 98%, the first five ranked principal components have been selected. Figure 5-11 shows the accumulated variance; note that the first three principal components explain almost 95% of the total variance in the data.

Figure 5-11 Cumulative and Individual Explained Variance by each Principal Component As a preliminary test of the model formulation, the MCR was computed for the entire data set using cross validation with 10 folds. The average result of this training and testing produces a MCR of 9%.

In document Automating and Optimizing a Transportation Mode Classification Model for use on Smartphone Data (Page 88-94)