Pattern Migration Clustering - Predictive trend mining for social network analysis

The third module in the proposed FTPA process provides for further analysis of the patterns and trends contained in the generated SOMs (one per episode) with respect to the concept of pattern migration. The motivation here was that, at least in the context of the networks used for evaluation purposes in this thesis, discussion with potential end users indicated that they would be interested in how trends associated with particular patterns changed over time. In other words, in the context of the SOM maps, how the trend lines associated with particular patterns migrate (or did not migrate) across a sequence of maps. For this purpose each sequential pair of SOM maps was used to construct a second migration network/map comprising,i×j nodes and potentially (i×j)2 links (including “self links”).

Given two SOM maps Me and Me+1, the from map and the to map respectively,

the nodes in a migration network were labeled with the number of patterns held in the node in map Me (i.e. thefrom map). The links then represented the migration of

patterns from Me to Me+1, and were labeled with the number of migrating patterns

(thus a “traffic” value). The higher the value the stronger the link. The process of visualising such migration networks is discussed in the following sections (Section 4.6). The remainder of this section is organised as follows. Sub-section 4.5.1 describes the process of detecting migrations of frequent patterns between two SOM maps Me

and Me+1. This is followed in Sub-section 4.5.2 by a description of the clustering

of the identified pattern migrations; a hierarchical clustering method is suggested. A worked example of the clustering of the identified pattern migrations, using the Newman method, is shown in Sub-section 4.5.3.

4.5.1 Pattern Migration

The changes in trends associated with patterns can be measured by interpreting the SOM trend line maps in terms of a rectangular (2-D plane) where each point in the plane represents a SOM node. A Manhattan or Euclidean distance function can then be applied to determine the distance “traveled” by the patterns between nodes in successive SOM maps, which in turn can be used to observe the similarities and differences between pattern trends across episodes. The greater the distance a pattern moves the more significant the change. Thus, given a sequence of trend line SOM maps, com- parisons can be made to see how trends associated with individual frequent patterns change by analyzing the nodes in which they appear. The trend cluster analysis pseudo code is described in Algorithm 4.7.

Algorithm 4.7: Trend Cluster Analysis

input : FP = Set of all frequent patterns in episodes{e1, e2. . . , ee}

output: Sequence of (e−1) Migration Matrices Define Table measuring|F P | ×e;

for i←1 toe (step through episodes) do

for j←1 to |F P |(step through the set of FP) do

Table[i][j] = Table[i][j] ∪SOM node ID for ei; 4

end

Define (e−1) Migration Matrices (MMs), each measuring (x×x) wherex is the

number of SOM nodes;

for i←1 to(e−1) do

for k←1 to |F P |do

Increment count at MMi[Table[K][i]][Table[K][i+1]]; 10

end

The algorithm commences by defining a|F P| ×etable. The table is populated with the SOM node IDs, the frequent pattern trend cluster, for each discovered frequent pattern in{e1, e2, . . . , ee}for SOM mapsM ={M1, M2, . . . , Me}(line 4). Then in line

7, the algorithm defines a sequence of e−1 Migration Matrices (MMs) for each pair of SOM Me and Me+1, each measuring x×x. The process continues by comparing

the node numbers of the frequent pattern and counting the pattern migrations for each node ID (trend cluster) between SOMMe and Me+1 (lines 8 and 10).

Subsequently, the module also produces a trend cluster analysis of the pattern migrations between trend clusters. The analysis comprises a comparison of pattern migrations for each pair of SOM Me and Me+1. The number of patterns migrating

from nodei in Me to nodej in Me+1 are recorded. It is also possible to determine

how the sizes of trend clusters in a given pair of SOMs, Me and Me+1, change. This

analysis thus provides for identification of patterns that move, or do not move, between successive SOM nodes, which may be of interest given particular applications.

4.5.2 Pattern Migration Hierarchical Clustering

To aid the further analysis of the identified trend migrations it was also considered desirable to identify “communities” within networks, i.e. clusters of nodes which were “strongly” connected (feature significant migration). This would indicate significant groupings of patterns whose associated trend lines where changing between episodes

ek and ek+1. An agglomerative hierarchical clustering mechanism, founded on the

Newman method [95] for identifying clusters in network data, was therefore adopted. Newman proceeds in the standard iterative manner on which agglomerative hierarchical clustering algorithms are founded. The process starts with a number of clusters equiv-

alent to the number of nodes3. The two clusters (nodes) with the greatest “similarity” are then combined to form a merged cluster. The process continues until a “best” cluster configuration is arrived at or all nodes are merged into a single cluster. The overall process is typically conceptualised in the form of a dendrogram. Best similarity is defined in terms of the Q-value, this is a “modularity” value which is calculated as follows: Qi= i=n X i=1 (cii−a2i) (4.1)

whereQi is the Q-value associated with thecurrent cluster i,n is the total number of

nodes in the network,ciiis the fraction of intra-cluster (within cluster) links in clusteri

over the total number of links in the network, anda2_i is the fraction of links that end in the nodes in clusteriif the edges were attached at random. The value ai is calculated

as follows: ai= j=n X j=1 cji (4.2)

where cij is the fraction of inter-cluster links, between the current cluster i and the

cluster j, over the total number of links in the network.

Thus on each iteration the Q-values for all possible cluster pairings are calculated and the pairing with the highest Q-value selected for merging. The process proceeds until a best cluster configuration is achieved. This is defined as the configuration with the highest overall Q-value. Generally speaking, if the Q-value is above 0.3 then communities can be said to exist within the target network; the value of 0.3 was derived experimentally by Newman and Girvan [96]. Note that if all nodes are placed in one group the Q-value will be 0.0 (i.e. a very poor clustering). A worked example is presented in the following sub-section. The identified clustering (communities) are then displayed as “islands” in the following stage in the FPTA process. This will be described in Section 4.6.

4.5.3 Worked Example of Hierarchical Clustering Using Newman

Considering the simple example network presented in Figure 4.12. The Q value for this network at the start of the process, when each vertex is considered to represent a group, is (using data from Table 4.2):

Q=−0.01−0.01−0.04−0.01 =−0.07

There are six potential joins AB, AC, AD, BC, BD and CD; giving rise to six potential configurations. Calculating the Q-value for each configuration (Table 4.3)

i cii ai a2i Q

A 0.00 0.10 0.01 -0.01 B 0.00 0.10 0.01 -0.01 C 0.00 0.20 0.04 -0.04 D 0.00 0.10 0.01 -0.01 Table 4.2: Start Condition

Groups Internal Links

1 2 3 c11 c22 c33 a1 a2 a3 a21 a22 a23 Q AB C D 0.40 0.00 0.00 0.40 0.40 0.20 0.16 0.16 0.04 0.04 AC B D 0.00 0.00 0.00 0.60 0.20 0.20 0.36 0.04 0.04 -0.44 AD B C 0.00 0.00 0.00 0.40 0.20 0.40 0.16 0.04 0.16 -0.36 BC A D 0.00 0.20 0.00 0.60 0.20 0.20 0.36 0.04 0.04 -0.24 BD A C 0.00 0.00 0.00 0.40 0.20 0.40 0.16 0.04 0.16 -0.36 CD A B 0.00 0.40 0.00 0.60 0.20 0.20 0.36 0.04 0.04 -0.04

Table 4.3: First Iteration

gives a best Q-value of 0.04, this therefore represents the first join and the configuration {AB, C, D} is generated.

For the next join, there are three possible configurations: {ABC, D}, {ABD, C} and {AB, CD}. Calculating the Q-value for each of these configurations (Table 4.3) gives a best Q-value of 0.28, so this is the second join and the configuration{AB, CD} is formed.

For the third iteration the only remaining option is to combine all the vertices, this will give a Q-value of 0.0. The discovered maximal value for Q is then 0.28 and hence the configuration associated with this value,{AB, CD}, is selected as the best grouping (clustering). The dendrogram for the example is given in Figure 4.13.

In document Predictive trend mining for social network analysis (Page 81-84)