Clustering - Data Mining in the Compiled Database

PHOSIDA – Phosphorylation Site Database

4.5 Data Mining in the Compiled Database

4.5.2 Clustering

‘Clustering is the process of grouping the data into classes or clusters so that objects within a cluster have high similarity in comparison to one another, but are very dissimilar to objects in other clusters’ (Witten, 2005).

The objects’ attribute values are usually transformed into a hyperdimensional feature space in order to calculate distance measures reflecting their dissimilarity. Figure 4.17 illustrates a two dimensional clustering resulting into three different groups (clusters). Each axis reflects a certain attribute value of a given object. The three different clusters are obvious by visual inspection. This visual grouping is highly intuitive because of the human brain’s highly evolved capacity for image and pattern recognition. Clustering analysis has been widely used in applications ranging from market analysis to microarray gene expression data analysis. The application of clustering to large scale datasets containing objects that can be described by multiple features has led to the design of a large number of different clustering approaches. Hierarchical methods, grid-based methods, density-based methods, or partitioning methods solve the problem of grouping given objects. Each approach has its advantages and disadvantages depending on the set of data.

We applied the Fuzzy C-Means (FCM) algorithm (Futschik and Carlisle, 2005), a partitioning method, in order to group the quantitative data reflecting phosphorylation changes upon treatment including certain stimuli. The main idea of k-Means clustering is to group a given set of objects into k clusters maximizing the cluster similarity measured in regard to the mean value of the objects in a cluster. It proceeds as follows: First, it randomly selects k objects, each representing a cluster’s center. The remaining objects are assigned to the most similar center out of k centers by the calculated feature distance. It then derives the new mean of each cluster iteratively, until no new cluster assignments can be calculated. FCM is a variant of the K-Means approach and allows membership of data elements in multiple clusters. Thus, FCM offers clustering tolerant to noise by variation of the fuzzification parameter m, which limits the contribution of ill-behaved profiles to the clustering process.

We applied the FCM approach to group profiles reflecting the phosphorylation dynamics upon EGF stimulation (Chapter 4.6.1.1.1). Consequently, each phosphorylated peptide could be assigned to a cluster representing upregulation or downregulation at a certain time point. We found optimal partitioning with six clusters and a fuzzification parameter of two. The corresponding resulting clusters of each identified phosphopeptide are also illustrated in the PHOSIDA online database (Figure 4.17 right panel).

Figure 4.17: Clustering in PHOSIDA

Illustration of three clusters in a two dimensional feature space (left panel) and integration of clusters reflecting phosphorylation dynamics on the basis of quantitative data (right panel) in PHOSIDA.

4.5.3 Classification

Data classification is a two-step process. At first, a model is built on the basis of a set of objects. Each object has certain attribute values, which are transformed into a feature vector space. The objects’ attributes are essential to determine dissimilarities between different samples by appropriate distance measures. As the category of each sample is known, the creation of a model describing the differences between classes is named ‘supervised learning’. The training samples of known classes are used to build a model described by mathematical formula or decision trees, for instance. To evaluate the accuracy of the learning approach, one usually selects a subset of the training samples. The classifications of these test samples, which are substracted from the training set, are used to test the performance of classification decisions by the learned classifiers.

One usually takes 90% of the specified samples for training and 10% for testing. To avoid scewing the evaluation of the classification performance by random selections, one applies this performance test iteratively (n fold cross validation), where each step comprises another random selection of training and test samples. If the performance of the classification approach is acceptable, one can use the trained model to classify uncategorized future samples.

Hence, classification is very similar to prediction. However, classification is used to predict discrete or nominal values. The species assignment of given organisms is a typical classification problem and the answers are either “dog” or “cat”. In contrast, prediction can be viewed as the construction of a model to assess the (continuous) value ranges of an attribute that a given sample is likely to take on. However, classification and prediction are very similar in their purpose.

Both prediction and classification have numerous applications including selective marketing, medical diagnosis, and protein docking prediction.

We applied a classification approach in order to predict, whether a given protein’s residue is likely to be phosphorylated or not. As consensus sequences are the basis for kinase specific phosphorylation, the surrounding sequence of a given residue is obviously decisive to predict the likeliness to be phosphorylated. With our determined phosphorylation sites from large- scale phosphoproteomics, we trained a support vector machine to classify unlabeled samples (residues) into phosphorylated or unphosphorylated amino acids. The main principle of support vector machines is described in Chapter 7.

We also tried to find additional features besides the raw sequence that enhance to the accuracy of classification. For example, the phosphorylation process suggests that phosphorylation targets (residues) have to be accessible to kinases, thus solvent accessibility is a potential parameter to consider.

As this machine learning approach was applied to various datasets resulting in multiple trained models that enable prediction of phosphorylation sites in various species, its implementation and application is discussed in detail in Chapter 7.

In document Gnad, Florian (2008): Bioinformatics of Phosphoproteomics. Dissertation, LMU München: Fakultät für Chemie und Pharmazie (Page 58-61)