4.4 Trend Grouping
4.4.1 Trend Clustering using Self Organizing Maps
To group the trends one SOM was created per data episode. SOMs [74] may be viewed as a type of feed-forward, back propagation, neural network that comprises an input layer and an output layer (an i×j grid). The cells in the i×j grid are referred to as nodes; each node potentially represents a trend cluster (a grouping of trends that display similar geometry). Recall that in the work described in this thesis, the input layer comprises the trends (trend lines formed ofnsupport counts associated with each
frequent pattern) and the output layer the trend clusters. Each output node (map node) in the output layer is connected to every input node in the input layer, a trend line, which is assigned a set of weight vectors,w. The dimension of the weight vectors is the same as the dimension of the trend lines of interest, for example in this thesis trend lines are of length 12 (months). The SOM was then “trained” using a training input dataset. Algorithm 4.6 provides the trend grouping pseudo code for clustering the trend lines generated using TM-TFP. With reference to this pseudo code the SOM is first initialised (line 1) with a predefined x×y grid (map). A discussion on the optimum size of a grid/map is presented in Sub-section 4.4.2 below.
Algorithm 4.6: Trend Grouping using SOM
input :T ={τ e1, τ e2, . . . , τ ee}
output: SOM prototype map andntrend line maps
// generate a SOM prototype map
Initialise a SOM prototype map with x×y nodes;
1
Assign weight vectors, w, to the map nodes;
2
for i←0 to|τe1|do
3
Find the “winning” node for trend line t1i in the prototype map;
4
Adjust the weight vectors of nearby map nodes accordingly;
5
end
6
// generate a SOM trend line maps
for k←0 to edo
7
Initialise a SOM trend line map, with x×y nodes for episode k;
8
for i←0 to|τek|do
9
Plot tki onto the prototype map for episode k;
10
end
11
end
12
The SOM was thus trained using the trend lines associated with the frequent pat- terns discovered in the first data episode (e1) (line 3 to 5). Each record inτ e1 (defined
in Section 4.1) was presented to the SOM in turn. The output nodes then “compete” for each record. Once a record has been assigned to the “winning” map node, the network’s weightings are adjusted to reflect this new position. At first the adjustments are relatively large, but as the training continues the adjustments become smaller and smaller. A distance function1 and a neighbourhood function2 were used to determine similarity. A feature of the adjustment was that adjacent nodes would come to hold similar records; the greatest dissimilarity would be between nodes at opposite corners of the map. At the end of the SOM training phase, a prototype map was produced that represented the types of trend lines that existed within the set of identified trend lines inτ e1. Copies of the resultingprototype map were then populated with data from all
1
A Euclidean function was adopted with respect to the work described in this thesis. 2Gaussian function was used to determine the neighbourhood size of the map.
eepisodes (τ e1 toτ ee), to produce a sequence of emaps M ={M1, M2, . . . , Me}(line
8 to 12). Using this SOM based clustering process the substantial number of trends that are typically identified using TM-TFP could be grouped according to their trend “types” so as to consequently aid analysis. Figure 4.5 illustrates the process. The figure features four episodes which are used to generate four SOM maps (labeled I, II, II, IV) based on the prototype map.
Trend Grouping SOM Frequent patterns
and trends from FPTM I II III IV Prototype Map
Trend line Maps
Figure 4.5: SOM Prototype and Trend lines maps
The author experimented with a number of different mechanisms for training the SOM, including: (i) devising specific trends to be represented by individual nodes, (ii) generating a collection of all the mathematically possible trends and training the SOM using this set, and (iii) using some or all of the trends in the first episode to be considered. The first required prior knowledge of the trend configurations of interest; which, it was conjectured, tended to defeat the objective of the trend mining process. The second mechanism, it was discovered, resulted in maps for which the majority of nodes were empty. The third option was therefore adopted, the SOM was trained using the patterns in the first episode. Trends (patterns) in subsequent episodes are grouped using these cluster definitions. Thus a pattern can always be assigned to a cluster. The way that the SOM generation operates is that it includes intermediate cluster definitions even if there are no trend lines in the training data (episode) that may be assigned to these clusters. This is why some nodes (clusters) have no trend lines
associated with them (see for example the maps presented in Figures 4.10 and 4.11). This means that it is highly likely that the patterns found in subsequent episodes (not used for training) can be assigned to representative clusters. The patterns in the subsequent episodes would have to be very different to the patterns in the training data for them to be assigned to what might be considered to be non-representative clusters. Given the nature of the datasets considered this seems unlikely.
As mentioned previously, the generated prototype map determines the positions of the trend types (clusters) in the SOM trend line maps, which were subsequently used to group the trends from the rest of data episodes. By having fixed positions for trend clusters within the SOM map, the idea of identifying changes in trends associated with particular frequent patterns (pattern membership) between episodes can be supported.