• No results found

5 Benchmarking and Energy Classification of the Hotel Sample

5.4. Clustering algorithms

Finding similar clusters is another methodology of grouping objects. Cluster analysis determines similar groups, or clusters, of data of an initially unclassified set of data. Objects in the same cluster have similar characteristics in a sense, where the profile of objects in different clusters is quite distinct.

Clustering algorithms are generally categorized under two different categories – partitional and hierarchical.

 Partitional clustering algorithms divide the data set into non-overlapping groups, thus in separate clusters. Algorithms k-means, fuzzy k-means, etc, fall under this category. Partitional clustering algorithms employ an iterative approach to group the data into a pre-determined k number of clusters (Ahmad and Dey, 2007; Todd et al., 2009). The main drawback of the k-means algorithms is the fact that the number of k clusters should be predefined and that there is no uniqueness of results as the initial conditions may strongly influence the classes structure, (De Smet et al., 2004; Foggia et al., 2009).  Hierarchical algorithms use the distance matrix as input and create a hierarchical set of

clusters. Hierarchical clustering algorithms may be:

- Agglomerative – where starting with a unique cluster consisting of a single data element, then a hierarchy of clusters is determined by repeatedly merging nearest clusters until only one – final cluster remains which contains all data elements, (Ahmad and Dey, 2007; Seem, 2005).

- Divisive – in which the initial cluster containing all points is successively split to contain cohesive sub-clusters, till each point belongs to a unique cluster or till some other pre-defined termination condition is reached (Ahmad and Dey, 2007).

Apart from the classical algorithms (partitional and hierarchical) other methods include the graph based algorithms. These algorithms do not require the number of clusters to be provided in advance and use the properties of the graphs (i.e. algorithm based on random walks, theory of spanning tree, etc) in the procedure of clustering, (Foggia et al., 2009). An important step in most clustering is to select the distance measure between data points. This will influence the shape of the clusters, as some elements may be close to one another according to one distance and further away according to another. Two of the most common distance measures are: the Euclidean distance (most used method), that corresponds to the

length of the shortest path between two elements and the city-block distance (the sum of distances along each dimension).

5.4.1. The k-means algorithm

The clustering of the Greek hotels is carried out using the k-means algorithm being one of the most used and efficient clustering methods (Ahmad and Dey, 2007; Gaitani, 2011; Foggia et al., 2009; De Smet et al., 2004; Di Piazza et al., 2011). The k-means algorithm assigns each point to the cluster whose center (also called centroid) is nearest. The algorithm uses an iterative algorithm that minimizes the sum of distances from each object to its cluster centroid, over all clusters. This algorithm moves objects between clusters until the sum of distances from all objects in that cluster cannot be decreased further. The centroid for each cluster is the point to which the sum of distances is minimized. The result is a set of clusters that are as compact and well-separated as possible, (MATLAB Help Index, 2002).

The k-means algorithm follows a simple and easy procedure to classify a given data set through a certain number of clusters fixed a priori. The main idea is to define k centroids, one for each cluster. These centroids should be placed in a suitable way because of different location causes different result. So, the better choice is to place them as much as possible far away from each other. The next step is to take each point belonging to a given data set and associate it to the nearest centroid. When all points are assigned to clusters, the first step is completed. Then it is necessary to re-calculate k new centroids as barycenter of the clusters resulting from the previous step. After we have these k new centroids, a new association has to be done between the same data set points and the nearest new centroid. This continues in a loop where the k centroids change their location step by step until no more changes are done.

Finally, this algorithm aims at minimizing an objective function, in this case a squared error function. The objective function

Where is a chosen distance measure between a data point and the cluster centre , is an indicator of the distance of the n data points from their respective cluster centres, (A tutorial on clustering algorithms Available on line; Di Piazza et al., 2011).

As a distance measure the squared Euclidean distance is taken as default. Examples of data clustered using the k-means algorithm are shown in Figure 30.

Figure 30 Examples of data clustered with k-means algorithm (Di Piazza et al., 2011)

In general the determination of the correct number of clusters is an issue. A possible way to compare the considered solutions is to look at the average silhouette values for different choices of clusters’ numbers. Usually, many attempts are made with a range of values for k, and the final selection of this leads to ‘positive’ and mostly high silhouette values, (Di Piazza et al., 2011).

The ‘Silhouette plot’ is the measure that indicates the success of the distinction of objects into the separate clusters. Silhouette ranges from ‘+1’, indicating that objects are well separated and belong to different clusters, through ‘0’, indicating points that are not distinctly in one cluster or another, to ‘-1’ indicating that probably some objects are located into the wrong cluster. By default the silhouette uses the squared Euclidean distance, (Lletí et al., 2004; Gaitani, 2011; Di Piazza et al., 2011).

Related documents