Modifications of the SOM - High dimensional data visualization and clustering

1.5 High dimensional data visualization and clustering

1.5.2 Modifications of the SOM

The performance of the SOM depends on the initialization of neurons, the choice of the topology and a learning algorithm (Steps 2 and 3 in Algorithm1). Different versions of the SOM have been proposed to improve its performance (see, for example, [4, 7,8,13, 30, 41,

43,44,50,68, 72, 94,95,126,134,141,146,149,151,152,162]). The paper [146] presents an automated detection algorithm based on the SOM assuming that the training data is an adequate representation of the sample distribution. Therefore, the SOM is trained using a small proportion of the sample data set and the algorithm defines a region around prototypes

by employing a parameter rj that represents the distance of the farthest projected sample

into the neuron j, j = 1, . . . , q (where q is the number of neurons). The upcoming samples are distributed into the network and novelties are those samples which cannot fit into these regions.

A combinatorial two-stage clustering algorithm based on the SOM is introduced in [44]. The numerical results using the Ant Colony Optimization technique and the k-means demonstrate the superiority of the proposed algorithm in comparison with the SOM and the k- means. In [30], an enhanced Clusot algorithm [28] is applied in the SOM for automatic cluster detection.

In [134] the SOM’s prototypes are clustered hierarchically based on the density instead of the distance dissimilarity. A two-stage algorithm is proposed in [151] that applies the graph cut algorithm (see [128]) to the SOM output. Results demonstrate that this algorithm requires less computational time than direct clustering methods.

A dynamic SOM is a version of the SOM where its structure is not fixed during the learning phase. In [4], a growing self organizing map (GSOM) is presented which defines a spread factor to measure and control the growth of the network. Similarly in [13], a multi level interior growing SOM is introduced. Unlike the GSOM, which allows the growth only from border sides, this algorithm allows neurons to grow also from an interior node of the map.

The Growing Neural Gas (GNG), introduced in [66], is an improvement to the Neural Gas (NG) algorithm [101]. More specifically, it is an incremental version of the NG algorithm which does not require the pre-setting of the network size. The GNG algorithm is able to make explicit topological relations of input signals.

In [67], a new self organizing model, the so called Growing Grid (GG), is proposed to over- come some drawbacks of existing models. The network automatically chooses a height/width ratio suitable for the data distribution. Moreover, locally accumulated statistical values are used to determine where to insert new units.

A novel artificial neural-network architecture, called the growing hierarchical SOM (GH- SOM), is proposed in [117] to resolve two limitations of the SOM due to its static architecture as well as the limited capabilities for the representation of hierarchical relations of the data.

A parameter-less self-organizing map algorithm (PLSOM), proposed in [24], eliminates SOM parameters such as the learning rate and neighborhood size and calculates values of these parameters using the local quadratic fitting error of the map. This allows the map to make large adjustments in response to unfamiliar inputs, i.e., inputs that are not well mapped, while not making large changes in response to inputs it is already well adjusted to. Unfortunately, the PLSOM has the property that it overreacts to extreme outliers, even after long periods of training [25].

Another extension of the SOM algorithm is presented in [75]. This extension automatically calculates the learning parameters during the training. The algorithm is based on the Kalman filter estimation technique and the idea of the topographic product. The Fast Learning SOM (FLSOM) algorithm is presented in [62], which is based on the application of the simulated annealing (SA) metaheuristics to the SOM learning. The SA is used to modify the learning rate factor in an adaptive way. The FLSOM shows a better convergence than the original SOM algorithm.

A two-level clustering algorithm is proposed in [104] to improve clustering output. At the first level of the algorithm the data is trained by the SOM and at the second level the incremental clustering approach is applied to the output of SOM. The optimal number of clusters is found by applying the rough set theory to the output of the SOM. SA algorithm is adopted to minimize the uncertainty due to the overlapping between clusters, which is detected using the rough set theory. Similarly in [105], the overlapping, caused by the cluster structures, is removed by using a genetic algorithm instead of using SA.

All modifications of the SOM, described above, do not include any specific procedure to find initial weights of neurons. Therefore, the most of these algorithms are still sensitive to the initialization of neurons. Furthermore, the most of these algorithms, including the SOM, are not efficient in large data sets. In this research, a new version of the SOM is proposed to address these drawbacks. The proposed version includes an algorithm for initialization of neurons based on the split and merge procedure. The high dense areas in input data space are detected by this procedure. Then neurons are generated in those detected areas. Initialization of neurons in such areas accelerates the convergence of the algorithm and makes it applicable to large data sets. A new topology is presented to restrict the adaptation of the

neurons to a neighborhood which is located in the same high density area. Such an approach leads to a better local minimum of the quantization error than that of by the SOM.

In document Nonsmooth optimization models and algorithms for data clustering and visualization (Page 41-44)