• No results found

2.3 Clustering

2.3.3 Partitioning Methods

Differently form hierarchical methods, partitioning methods create only

one partition of data, placing each of the n observations in one of the k clusters, where k is chosen by the analyst.

Partitioning methods produce this partition satisfying criteria of opti- mality, generally expressed by the maximization of an objective function.

This approach is generally more efficient and more robust than that one of

hierarchical methods. In fact, many methods do not need t store the distance matrix, as happens for hierarchical methods, with relevant computational advantages. Therefore they can be better implemented in large databases.

However the large number of possible solutions lead to constrained results, which often correspond to local maxima. In addition to this, the main difficulty of these methods is related to the choice of the value k by the analyst. Since this is often difficult, the algorithms are applied varying the value of k and evaluating the results on the basis of the indices introduced for agglomerative clustering methods.

Algorithms based on squared error

Algorithms based on squared error determine the partition of data min- imizing the squared error.

Given a cluster Ki, with observations ti1, ti2, . . . , timand centroid Ck, the

squared error is defined by:

seKi =

m

j=1

||tij− CKi||2 (2.72)

Considering a set of clusters K= {K1, K2, . . . , Kk}, the squared error for is

seK = k

j=1

seKj (2.73)

At the beginning of the procedure each observation is randomly as-

signed to a cluster. Then, at each step, each observation ti is assigned to

the cluster whose centroid is the closest to the observation. The centroids of the new clusters are re-calculated and the squared error is calculated considering the new partition of data. The procedure is stopped when the decrement of successive squared error is lower than a pre-fixed threshold.

K-means algorithm

K-means algorithm is probably the most famous partitioning method. The algorithm chooses initial seeds as initial values for the K-means, which are representative of the centroids of the clusters in the p-dimensional

space. Seeds must be sufficiently dispersed in the variable space to guar-

antee an adequate convergence of the algorithm. Specific sub-algorithms, which impose a minimum distance among seeds, have been developed to accomplish this task.

Once the initial seeds have been selected, the iterative structure of the algorithm begins:

• Assignment of the observation to the closest mean; • Calculation of the mean for each cluster.

The algorithm ends when the maximum number of iterations is reached or when a certain convergence criterion (such as a minimum value of the squared error) is satisfied.

K-means method is widely adopted as a clustering method, but suffers

of some shortcomings, in particular a poor computational scalability, the necessity of giving a priori the number of clusters and a search prone to local minima.

Different modifications of K-means have been proposed in recent years. Just citing some of the most recent development:

• X-means (Pelleg and Moore 2000) which allows the identification of the optimal number of clusters using the Bayesian Information Criteria (BIC);

• K-modes (Chaturvedi, Green, and Caroll 2001), which is a nonpara- metric approach to derive clusters from categorical data, following an approach similar to K-means;

• K-means++ (Arthur and Vassilvitskii 2007), which was designed to choose the seeds for the k-means trying to avoid the sometimes poor clusterings found by the standard k-means algorithm;

PAM Algorithm

Partitioning Around Medoids(PAM, also called K-medoids) algorithm is a

clustering method which adopts medoids instead of centroids, obtaining a relevant advantage in the treatment of missing data.

Initially k observations belonging to D are randomly chosen as medoids and the others are associated to the cluster with the closest medoid (Build Step). At each iteration the non-medoid observations are analysed, testing if they can become new medoids, improving the quality of the partition, that is minimizing the sum of the dissimilarities of the observations to their closest medoid (Swap Step).

Considering a cluster Ki represented by medoid ki, the algorithm eval-

uates if any other observation th of the cluster can be changed with ti,

becoming the new medoid. Cijh is the changing of cost for the observation

tj associated to the change of medoid from ti to th. Repeating this process

for all the observations of cluster Ki, the total changing of cost is equal to

the change of the sum of distances of observations to their medoids. As a consequence of the medoid change four different conditions could happen:

1. tj ∈ Ki but∃ another medoid tmsuch that dis(tj, tm)≤ dis(tj, th)

2. tj ∈ Ki but dis(tj, th)≤ dis(tj, tm)∀ other medoid tm

3. tj ∈ Km, < Kiand dis(tj, tm)≤ dis(tj, th)

4. tj ∈ Km, < Kibut dis(tj, th)≤ dis(tj, tm)

Therefore the total cost associated to the medoid change becomes:

TCih= k

j=1

Cjih (2.74)

Compared to K-means the main improvement provided by PAM is a main robust structure; however its use is not suggested for large datasets, because is highly penalized by its complexity.

Self-Organizing Neural Networks

Artificial Neural Networks can be used to solve clustering problems adopting unsupervised learning process. In this case ANNs are called Self-

the neural network, which self-organizes to detect significant groupings in data. Unsupervised learning can be competitive and non-competitive.

In non competitive learning the weight of connection between two nodes of the network is proportional to the values of both nodes. Hebb rule is used to update the values of weights. Given the j-th neuron of the neural network

connected to xijinput neurons with weights wij, Hebb rule is defined by:

∆wij= cxijyj (2.75)

where yjis the output of the j-th neuron and c is the learning rate.

In competitive learning neurons compete one with each other and the winner can update its weights. Usually the network has two layers (input and output); in the input layer there are p neurons, representative of the p explanatory variables which describe the observations, connected with the neurons of output layer.

When an observation is fed to the neural network each node in the output gives an output value, based on the values of the connection weights. The neuron whose weights are the most similar to the input values is the winner. Following the "Winner Takes All" rule, the output is set to 1 for the winner and 0 for the other neurons, and weights are updated.

At the end of the learning process some relations are detected between observations and output nodes. These relations mean that some clusters have been identified in the dataset: the values of the weights of nodes grouped in a cluster are the mean values of the observations included in this cluster.

The most famous neural network which adopt competitive learning are

Self-Organizing Maps (SOM), or Kohonen Networks.

Self-Organizing Maps

Self-Organizing Maps (SOM) are Artificial Neural Networks based on

unsupervised competitive learning. They are also known as Self-Organizing

Feature Maps (SOFM) or Kohonen Networks form the name of the mathe-

matician who first proposed them (Kohonen 1982).

Kohonen Networks map each p-dimensional observation into a 1 or 2-dimensional space. In the latter case the output space is represented by a grid of output neurons (competitive layer), which guarantees the spatial correlation of clusters in the output space. This contiguity of similar clusters is due to the fact that the update of neurons is done for the winner neuron and a group of neuron in its neighbourhood. Using this approach at the end of the learning spatial partitions of neurons are obtained, which graphically represent the presence of clusters.

1. The weights wij between the i-th input neuron and the j-th output

neuron are defined at iteration t as wij(t), 0 ≤ i ≤ n − 1, where n is the

number of input. Initial values of weights are randomly chosen in the

interval [0, 1] and the values Nj(0) s set, where Nj() is the number of

neurons in the neighbourhood of the j-th neuron at the iteration t= 0.

2. Observation X = x0(t), x1(t), . . . , xn−1(t) is fed to the neural network,

where xi(t) is the ’i-th input.

3. Distances dj between input neuron and each output neuron j are

calculated. If Euclidean distance is chosen:

d2j =

n

i=1

(xi(t)− wij(t))2 (2.76)

4. Neuron which has the minimum distance value is selected and called

j∗.

5. Weights of node j∗and nodes included in the neighbourhood defined

by Nj(t) are updated. The new weights are calculated by:

wij(t+ i) = wij(t)+ η(t)(xi(t)− wij(t)) for j= j∗e 0≤ i ≤ n − 1 (2.77)

whereη(t) , 0 ≤ η(t) ≤ n − 1 is the learning rate, which decreases with

t. In this manner the adaptation of weights is progressively slowed

down. In a similar manner dimensions of Nj(t) decreases, stabilizing

the learning process.

6. Algorithm goes bask to step 1.

The learning rateη is initially set to values greater than 0.5 and decreases

during the learning process, which usually needs 100 to 1000 iterations. Kohonen suggested the adoption of a linear decrease as a function of the number of iteration. SOM are effective and useful clustering techniques, in particular in those cases when it is important to maintain the spatial order of input and output vectors.