Cluster analysis - Modelling drivers’ braking behaviour and comfort under normal driving

The next step is the creation of different scenarios based on human factors, to reflect the differences among the drivers and on the braking pattern. To accomplish that,

103

cluster analysis will be employed. Cluster analysis is a convenient method for identifying homogenous groups of objects, sharing some common characteristics that are called clusters (Sarstedt and Mooi, 2011). The two most-used clustering techniques are hierarchical clustering and K-mean clustering, which use the hierarchical and the partitioning algorithms respectively. The hierarchical algorithm forms the clusters successively, it is a stepwise algorithm which at each step merges two objects with the least dissimilarity. On the other hand, the partitioning algorithms determine all the clusters at the same time, building different partitions. The two methods are explained in more detail in the next paragraphs.

Hierarchical clustering is one of the most straightforward clustering methods (Norušis, 2011). Most hierarchical techniques fall into a category called agglomerative clustering, which starts with each object representing an individual cluster. Then, the next step is to merge the two most similar clusters to form a new one at the bottom of the hierarchy and so on until all the objects are in one big cluster. A cluster hierarchy can also be formed with the opposite procedure (divisive clustering), i.e. all the observations form one cluster at the beginning and then they gradually split up according to their similarity till every object belongs to individual clusters (Norušis, 2011; Sarstedt and Mooi, 2011). When using hierarchical clustering, the number of clusters should be decided by the user, but it is not required before the clustering. Moreover, it can be concluded that even if it is a straightforward method, it is not suitable for a large dataset, since a distance/ similarity matrix between all pair of cases is required, i.e. the distances between all pair of cases should be calculated.

The K-mean algorithm, on the other hand, can be classified as a partitioning method and is one of the most popular clustering algorithms (Wang, 2012). It is computationally simple and can deal with large datasets. This algorithm measures dissimilarity between two objects and then assign them into k pre-decided clusters. This is one of the disadvantages of K-mean clustering method, i.e. that the number of the clusters is required before the clustering. To express (dis)similarity between objects, there have been used different measures. The most well-known one is the square of the Euclidian distance, which is the square of the straight line between them. Other distances are the Angular and Mahalanobis distance (Sarstedt and Mooi, 2011). The procedure of the algorithm conducts expectation and maximisation steps until it

104

is converged to one solution. In the first step, the algorithm assigns all objects to k clusters whose centroids are closest to each object and in the next step, the algorithm calculates the point for each cluster that minimises the sum of the distances between this point and the objects in the cluster, which becomes the centroid for each cluster. Next, it reclassifies all cases based on the new set of means and so on. Therefore, one object can belong to a different cluster at each step, which is one more difference from the hierarchical method. This procedure is repeating until the cluster centroids do not change much between successive steps (Norušis, 2011; Jung, 2012).

Generally, K-means clustering has some advantages comparing to the hierarchical clustering; it is influenced less by outliers and irrelevant clustering variables. Furthermore, as it was mentioned earlier, K-mean clustering can handle very large- dataset in contrast to hierarchical one, since the procedure is less computationally demanding. On the other hand, K-mean algorithm can handle mostly continuous variables (interval or ratio scaled data), due to the use of the Euclidian distance. Finally, the pre-decision of the number of clusters can be challenging.

To overcome the aforementioned disadvantages, the Two-step cluster analysis was developed by Chiu et al. (2001). So, the 2-step clustering method is a scalable cluster analysis algorithm designed to handle very large datasets. It can overcome the difficulties of the other classic clustering techniques. First, it can handle both categorical and continuous variables, since it is based on the likelihood distance measure assuming that all the variables are independent. In addition, all continuous variables are assumed to follow a normal distribution and categorical variables a multinomial one (SPSS Inc., 2001; Şchiopu, 2010; Norušis, 2011). Moreover, this method can automatically determine the optimal number of clusters by calculating and comparing measures of fit such as Akaike’s Information Criterion (AIC) or Bayes Information Criterion (BIC); the smaller value the better fit.

As its name reveals, this clustering technique consists of two steps: the pre-clustering step, and the clustering step (SPSS Inc., 2001; Şchiopu, 2010; Norušis, 2011; Sarstedt and Mooi, 2011). In the first stage, the algorithm aims in creating pre-clusters by undertaking a procedure where it checks if the current record should merge an existing cluster or form a new one (similar to K-mean clustering procedure). This is

105

accomplished by the construction of a Cluster Features (CF) Tree, where the first case is being placed at the root of the tree in a leaf node that contains useful information about that case. Then, other cases are added to an existing node or are forming a new one, based on the similarities to existing nodes using the distance measure. In the process of building the CF tree, the algorithm has implemented an optional step that allows dealing with outliers, i.e. records that do not fill well into any cluster. The next stage takes the resulted leaf-nodes of the CF tree as an input and groups them using an agglomerative hierarchical clustering algorithm which allows exploring a range of solutions with a different number of clusters.

Considering the clustering procedure of this thesis. The human characteristics that will be included in the cluster analyses are the gender and the age category (19-30,31- 50,51+). Specifically, the 2-step cluster analysis in SPSS will be used, due to two important advantages that have been mentioned before and are essential for this analysis. First, it can handle large dataset, by constructing a cluster features (CF) tree that summarizes the records in contrast to hierarchical clustering that is inadequate for large datasets and the two datasets that will be analysed consists of 2700 and 7160 observations. The other reason is that it can handle both categorical and continuous variables whereas K-mean clustering can only handle continuous variables and the current clustering is based on human factors and on deceleration profiles that are categorical variables. The other features that give leverage to this method, i.e. it automatically standardises all the variables, it can handle outliers and insignificant variables and it selects the best number of clusters automatically played an essential part on the selection of this method.

The procedure that it follows to select the best number of clusters is described below. The Schwarz's Bayesian Criterion is calculated for the different number of clusters. The smallest the Bayesian Information Criterion (BIC) the better the cluster analyses. The maximum number of clusters is set equal to the number of clusters where the ratio

BICk/BIC1 is smaller than c1 for the first time. In the table below the c1 has not been

reached yet and so the SPSS stops at the maximum number of clusters that is set by the user, i.e. 15. Moreover, the SPSS calculates the ratio change R(k) in distance for

k clusters. To decide the best number of clusters, SPSS calculates the ration R(k1)/

106

is set equal to k1, otherwise to the largest number between k1 and k2. In this case, the 2 largest R(k) are for the 2 and 3 clusters and the ratio R(2)/ R(3)=1.14<1.15 and therefore the 3 clusters is set as the best solution from SPSS (Table 4.4).

Table 4.4: Procedure for selecting the best number of clusters

Auto-Clustering

Number of Clusters

Schwarz's Bayesian

Criterion (BIC) BIC Change

Ratio of BIC Changes Ratio of Distance Measures 1 19050.606 2 15560.152 -3490.454 1.000 1.619 3 13428.661 -2131.492 .611 1.419 4 11945.290 -1483.371 .425 1.175 5 10692.548 -1252.742 .359 1.037 6 9487.064 -1205.484 .345 1.325 7 8592.616 -894.448 .256 1.121 8 7801.248 -791.368 .227 1.157 9 7126.100 -675.148 .193 1.243 10 6595.160 -530.940 .152 1.052 11 6093.612 -501.548 .144 1.167 12 5672.884 -420.728 .121 1.055 13 5277.438 -395.446 .113 1.165 14 4947.107 -330.331 .095 1.086 15 4647.895 -299.211 .086 1.021

Other useful information that is provided by the 2-step clustering is the goodness of fit which is called silhouette measure of cohesion and separation and it is based on the average distance between the object. Its value fluctuates from -1 to +1, with values less than 0.20 indicating poor solution quality, values between 0.20 and 0.50 a fair quality and values over 0.50 a good quality. Last but not least, the 2-Level clustering demonstrates the importance of each variable that was included in the procedure, showing to the user if one variable is not necessary.

107

In document Modelling drivers’ braking behaviour and comfort under normal driving (Page 122-127)