SOM : self-organising map - Multivariate modelling

2.3 Multivariate modelling

2.3.8 SOM : self-organising map

Self-organising maps (soms), initially developed by Kohonen, are a class of unsupervised artificial neural networks that map the high-dimensional input data into a low-dimensional space, in a topology-preserving manner.59 _{After training the map, the weight vectors associated with each} unit (or neuron) comprising the map should approximate the patterns of the objects that were used to train the map, for example the molecular spectral profiles. The units are typically hexagons connected in a rectangular map shape. Units become more dissimilar to each other as the distance in the map between two units increases, leading to a ‘self-organised’ appearance that allows an instant overview of the data, an example is shown in figure 2.14.59, 108–110

Figure 2.14: Results of a self-organising map109 _{based on in vivo} 1_{H hepatic spectral nmr data acquired}

by members of the Metabolic and Molecular Imaging group at the Hammersmith Hospital Campus, Imperial College London, UK. (A) Raw1H spectral data of the 338 volunteers, the two visible peaks are from water (4.7 ppm) and lipids (1.3 ppm). The spectra were coloured with increasing body-mass-index (bmi); generally higher bmi values correspond to high lipid:water ratios. (B) The weight vectors corresponding to each unit of the self-organising map resemble the spectral data that were used to build the map. Higher fat levels are observed in the lower part of the map. (C) The distance map compares the weight vectors of neighbouring units, with similar units in black, and increasing levels of blue for units differing from their surrounding units. The bottom left units are highly dissimilar from the rest of the map and indicate a different class of spectra, as seen in B these have higher lipid and lower water peaks. (D) After training, the data are projected onto the map and the best matching unit is found for each spectrum, based on comparison of the measured spectral profile and the weight vectors shown in B. One unit will then be representing a number of objects, each represented as a pie slice in this hit map. The size of the pie corresponds to the number of objects that it was the best matching unit for. The colouring of each pie slice is based on the bmi value also used in A (higher bmi values are red), and show a correspondence to the organisation of the map based on spectral data: higher bmi values tend to

To create a self-organising map, a map of pre-defined size is initialised with weight vectors for every unit, corresponding to intensities for each variable that has to be modelled (comparable to a loading in pca). These weight vectors can be chosen randomly, but are commonly based on the first two principal components of the data set to decrease computational time and give a deterministic som and therefore reproducible result. Subsequently, the map is ‘trained’ to start resembling the topology of the data, allowing for non-linearities to be mapped efficiently. This is done by presenting an item from the training data set (e.g. a spectrum) to the map, and finding the ‘best matching unit’ by evaluation of the, typically Euclidean, distance between the training item x and each neuron on the map. This competitive learning step identifies the weight vector of the map with closest resemblance to the training item, and this weight vector wj(t) is then

updated, wj(t +1), to more closely resemble the data, by ‘learning’ from the training item x. This

is done in a topology-preserving manner, whereby neighbouring units also get updated to a degree dependent on their distance on the map to the best matching unit, see equation 2.21.108

wj(t +1) = wj(t) + η(t)N(t, r)[x − wj(t)] (2.21)

The learning is governed by the learning rate η(t), which decreases linearly as a function of the time t during the overall training time T , see equation 2.22; η(t0)is the initial training rate.

η(t) = η(t0)(1 − t

T) (2.22)

The neighbourhood function N(t, r), see equation 2.23, determines the degree to which units are updated, and is dependent on the training time t as well as the distance r on the map grid to the best matching unit. This is similar to a smoothing kernel, and in this work a Gaussian shape with a neighbourhood radius of σ(t), defining the region of influence of the training sample at time t, was used.

N(t, r) = exp( r 2

2σ(t)2) (2.23)

The gradually decreasing neighbourhood radius enables a global mapping of the som based on the samples, followed by a fine-tuning of the appearance of the map. The updating of units is repeated by subsequently presenting all items in the training data set, and can be performed efficiently in a batch algorithm. This process of presenting the training data set to update the som is repeated until convergence. The result of the learning process is that the weight vectors start to represent the data, see figure 2.14 A and B.109 _{Similarities, measured as distances, between neighbouring} units can be visualised using a distance map (based on U-matrix methods) and can be used to identify clusters and outliers, shown in figure 2.14 C.111

Data, either training or newly acquired (test) data, can be presented to the converged som and best matching units for each of the items can be calculated and visualised,110see figure 2.14 D. The mapping of data to the som highlights their visualisation capabilities: soms enable the detection of clusters and classes, show relative (dis)similarities of samples and present the responsible weight

vectors, see figure 2.14. soms allow non-linear mapping of data in low dimensions and are relatively insensitive to outliers, as they will occupy their own region in the map.

The use of som in metabonomic studies was nicely illustrated with a study of1H nmr plasma lipoprotein data, where the unsupervised som approach was able to characterise the lipoprotein subclass profiles in a clinically relevant way.112, 113

som maps have also been used for variable importance and selection.114

Chapter 3

The use of

1D

projections of

J -resolved

NMR

spectra in

metabonomics

3.1 Aims and objectives

Spectroscopic profiling of biological samples is an integral part of metabolically-driven top-down systems biology and can be used for identifying biomarkers of toxicity and disease. However, op- timal biomarker information recovery and resonance assignment still pose significant challenges in nmr-based complex mixture analysis. This chapter describes a method for reducing peak overlap, which is achieved when projecting two-dimensional J -resolved (jres) nmr spectra, and is based on the published work included in appendix B.115_{This is done by means of:}

1. Evaluating different processing steps to obtain high quality full-resolution jres spectral projections;

2. Demonstrating the necessity of peak alignment for modelling of full-resolution jres projections;

3. Investigation of the application of statistical total correlation spectroscopy to jres data; 4. Development and assessment of the feasibility of a newly proposed method: statistical total

regression spectroscopy;

5. Comparing the recoverable information content in full-resolution 1_{H jres projections with} conventional one-dimensional spectroscopy.

In document Development and Application of Chemometric Methods for Modelling Metabolic Spectral Profiles (Page 46-51)