Describing website visitors
6.4 Model building
The first part of the analysis aims to identify the different behavioural segments within the sample of users. We use two different descriptive data mining tech- niques: cluster analysis and the unsupervised networks known as Kohonen maps. Both techniques allow us to partition the data to identify homogeneous groups or types possessing internal cohesion that differentiates them from the other groups.
Table 6.2 Frequency distribution. Page Frequency Initial 23492 Help 9287 Entertainment 2967 Office 15574 Windows 7328 Othersft 3046 Download 11320 Otherint 6237 Devolpment 8228 Hardware 2967 Business 2726 Information 2307 Area 3141
We use two techniques so we can compare their efficiency, but also to check that they produce consistent results.
6.4.1 Cluster analysis
Chapter 4 explained the main techniques of hierarchical cluster analysis as well as the non-hierarchicalK-means method. A variation of theK-means method is used here. The basic idea is to introduce seeds, or centroids, to which statistical units may be attracted, forming a cluster. It is important to specify the maximum number of clusters, sayG, in advance. As discussed in Section 4.2, hierarchical and non-hierarchical methods of cluster analysis do have some disadvantages. Hierarchical cluster analysis does not need to know the number of clusters in advance, but it may require too much computing power. For moderately large quantities of data, as in this case study, the calculations may take a long time. Non-hierarchical methods are fast, but they require us to choose the number of clusters in advance.
To avoid these disadvantages and to try to exploit the potential of both methods we follow a combined approach. First we run a non-hierarchical clustering procedure on the entire data set, having chosen a large value of
G. We take the first G available observations as seeds. Then we run an iterative procedure; at each step we form temporary clusters, allocating each observation to the cluster with the seed nearest to it. Each time an observation is allocated to a cluster, the seed is substituted with the mean of the cluster – the centroid – itself. We repeat the iterative process until convergence; that is, until no substantial changes in the cluster seeds are evident. At the end of the procedure, we haveG clusters, with corresponding centroids.
This is the input to the next step, a hierarchical clustering procedure on a sample from the available data, the aim of which is to find the optimal number
of clusters. The procedure is of course an agglomerative one, since the number of clusters cannot be greater than G.
Having ascertained the optimal number of clusters, we carry out a non-hierarchical clustering procedure to allocate the observations to the clusters, whose initial seeds are the centroids obtained in the previous step. The procedure is similar to the first non-hierarchical stage, and involves repeating the following two steps until convergence:
1. Scan the data and assign each observation to the seed that is nearest (in terms of Euclidean distance).
2. Replace each seed with the mean of the observations assigned to its cluster. Here we chooseG=40. We carry out the hierarchical stage of the procedure on a sample of 2000 observations from the available data. Our distance function is the Euclidean distance, and we use Ward’s method to recompute the distances as the clusters are formed. To obtain valid cluster means for use as seeds in the third stage, we impose a minimum of 100 observations in each cluster.
By applying Ward’s method, we obtain that the optimal number of clus- ters is 6. Applying the centroid method gives the same result. Running a final non-hierarchical procedure on the entire available data set, with six clusters, gave the results presented in Table 6.3. This shows the number of observations in each cluster. We haveR2=0.40 for the final configuration, which can be treated as a summary evaluation measure of the model.
To better interpret the cluster configurations, Table 6.4 gives the means of each cluster for the most important variables. Note that clusters 1 and 6 have similar centroids, expressed by a similar mean number of visits to each page (especially Office, Entertainment and Windows). On the other hand, cluster 2 appears to have rather different behaviour, concentrated mainly on three pages (Help, Office and Windows).
6.4.2 Kohonen networks
Kohonen networks require us to specify the number of rows and the number of columns in the grid space characterising the map. Large maps are usually the
Table 6.3 Cluster sizes for the final K-means cluster configuration.
Cluster Frequency 1 10725 2 60 3 19277 4 164 5 2325 6 160
Table 6.4 Cluster means for the finalK-means cluster configuration.
Web page Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6
Help 0.25 2.41 0.26 0.70 0.44 0.61
Download 1.01 0.91 0 0.75 0.07 0.41
Office 0.70 1.63 0.26 1.14 1.11 0.70
Entertainment 0.10 0.75 0.08 0.18 0.08 0.13
Windows 0.30 1.7 0.06 1.64 1.02 0.26
best choice, as long as each cluster has a significant number of observations. The learning time increases significantly with the size of the map. The number of rows and the number of columns are usually established by conducting several trials until a satisfactory result is obtained. We will use the results of the cluster analysis to help us. Having identified 6 as the optimal number of clusters, we will consider a 3×2 map. The Kohonen mapping algorithm implemented in R essentially replaces the third step of the clustering algorithm with a procedure that repeats the following two steps until convergence:
1. Scan the data and assign each observation to the seed that is nearest (in terms of Euclidean distance).
2. Replace each seed with a weighted mean of the cluster means that lie in the grid neighbourhood of the seed’s cluster.
The weights correspond to the frequencies of each cluster. In this way the cluster configuration is such that any two clusters that are close to each other in the map grid will have centroids close to each other. The initial choice of the seeds can be made in different ways; we choose them at random. Alternatively, we could have used the centroids obtained from the second stage of the K-means clustering procedure.
Table 6.5 reports, for each of the six chosen map clusters, the total number of observations in it (frequency). The groups obtained are now more homogeneous in terms of number of observations included. Table 6.6, which reports the cluster means, should be compared with Table 6.4 for theK-means procedure.R2is now
Table 6.5 Cluster sizes for the final Kohonen map configuration.
Cluster Frequency 1 9572 2 5784 3 8301 4 1863 5 4995 6 2196
Table 6.6 Cluster means for the final Kohonen map configuration.
Web page Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6
Help 0.40 0.42 0.40 0.49 0.90 0.52
Download 0.64 0.43 0.39 0.44 0.49 0.48
Office 0.67 0.42 0.38 0.43 0.42 0.61
Entertainment 0.46 0.47 0.54 0.49 0.50 0.51
Windows 0.47 0.45 0.51 0.49 0.56 0.51
0.58, which is 0.18 higher than we obtained for the K-means procedure. From Table 6.6 we conclude that the findings in Table 6.4 are substantially confirmed.