• No results found

Literature review - clustering methods

2.3 Graph–based methods

Optimal–transfer phase:

For each point xi, calculate R1(xi) using

R1(xi) = njd(xi,x¯j)2

(nj + 1) (2.11)

where point xi is currently assigned to cluster j, nj is the number of points in cluster j, ¯xj is the centre of cluster j and d(xi,x¯j) is the distance between point xi and ¯xj. Assign I1(xi) = cluster j.

Calculate Rj(xi) in a similar fashion for a subset of the other clusters, using Eq.

(2.11). If cluster j is part of the live set, calculate Rj(xi) for all clusters. If cluster j is not part of the live set, only include the clusters in the live set. Assign the minimum value of Rj(xi) to R2(xi) and note the associated cluster as cluster k.

If R2(xi) < R1(xi), then re-assign point xi to cluster k and change I1(xi) = k and I2(xi) = j. Assign both clusters k and j to the live set. Recalculate the new cluster means for them, because there was a change in the points assigned to them.

If R2(xi) ≥ R1(xi), then I2(xi) = k.

Continue until R1(xi) and R2(xi) have been calculated for all points. Clusters that have been changed in the last n optimal-phase steps are part of the live set, where n is the total number of points. Recalculate the cluster centres. If the live set is empty, continue with the quick–transfer phase, otherwise repeat the optimal–transfer phase.

Quick–transfer phase:

For each point xi, calculate R1(xi) and R2(xi) using the clusters from I1(xi) and I2(xi) respectively with the new calculated cluster centres. If R2(xi) < R1(xi) then swap the values of R1(xi) and R2(xi) and of I1(xi) and I2(xi) respectively. If the values were swapped, re-assign the point to the new I1(xi) cluster and add both I1(xi) and I2(xi) to the live set. After all points have been checked and re-assigned where needed, recalculate the cluster centres.

Continue with the quick–transfer phase until no points have been re-assigned in the last n steps. When complete, repeat the optimal–transfer phase until the live set is empty.

2.3 Graph–based methods

In the graph–based clustering methods, customer points are clustered using a graph.

The graph can be constructed by letting the points represent nodes and the rela-tionship (distance) between points be represented by edges that connect the nodes.

Where the nodes are connected, the customer points belong to the same cluster.

2.3 Graph–based methods

At the start all edges are added to the graph and all customer points belong to the same cluster. Next, edges are evaluated and "inconsistent" edges are removed from the graph. This will result in groups of customer points being disconnected from each other, effectively forming a number of clusters.

An edge between points i and j is considered inconsistent if either lij the length of edges around points i and j respectively for a sub-tree of depth d. σi

and σj are the sample standard deviations of the lengths of all the edges at points iand j for a sub-tree of depth d. σT and lT are predefined cut-off values.

Eq. (2.12) is called the ratio of edge lengths for points i and j respectively. The z–score of points i and j respectively are given in Eq. (2.13), (Jain and Dubes, 1998, p. 122).

Zahn (1971) recommends using σT = 3 and lT = 2 as a first pass to eliminate oversized edges from the statistics. The depth d used in identifying which edge to include in the statistics, is also referred to as the local neighbourhood depth. Zahn (1971) seems to favour d = 3.

Once the inconsistent edges have been removed from the constructed graph, the connected customer points are considered part of the same cluster. The number of clusters cannot be guaranteed to be an exact number, but depends on the values chosen for lT and σi. Three types of graphs can be constructed based on the graph–

based clustering. These are as follows:

2.3.1 MST graph–based clustering method

Zahn (1971) suggested a graph-based method where a minimum spanning tree (MST) is constructed. This is also referred to as Zahn’s method. The minimum span-ning tree includes all the edges needed to ensure all points are connected. Although the MST graph-based method works well with well-separated clusters, it seems to produce less good results if clusters have varying inter-cluster distances. Zahn (1971) suggests first identifying and removing the denser clusters before analysing the rest of the data.

2.3 Graph–based methods

2.3.2 RNG graph–based clustering method

Jain and Dubes (1998) suggest another graph–based method, using the relative neighbourhood graph (RNG) instead of the MST. In order to construct a RNG;

points i and j are declared connected if and only if

d(xi, xj) ≤ max (

d(xi, xk), d(xj, xk) )

∀ k, k 6= i and k 6= j, (2.14)

where k represents all points in the graph excluding points i and j and d(xi, xj) is the Euclidean distance between points i and j. Points xi and xj are therefore only connected if no other points fall in the region of influence. Here the region of influence is the overlapping area of the two circles with a radius of d(xi, xj) originating at points xi and xj, (Jain and Dubes, 1998), illustrated in Figure 2.4.a.

Inconsistent edges are determined using Eq. (2.12) and Eq. (2.13). According to Jain and Dubes (1998) the RNG graph–based clustering is more effective than the MST graph–based method.

Figure 2.4: The regions of influence for the RNG and GG, (Jain and Dubes, 1998, p. 125).

2.3.3 GG graph–based clustering method

A third method suggested by Jain and Dubes (1998) is to construct a Gabriel graph (GG). For this type of graph points are connected if and only if

d2(xi, xj) < d2(xi, xk) + d2(xj, xk) ∀ k, k 6= i and k 6= j. (2.15) The region of influence for the GG is illustrated in Figure 2.4.b. Points xiand xj are connected if the circle associated with the diameter having xi and xj as endpoints does not have any other points within its region of influence. Jain and Dubes (1998)