2.3 Bioinformatics
2.3.3 Functional Prediction by Biostatistical Methods
With recent developments in biostatistics and advanced statistical methods such as machine learning algorithms, it is possible to design an efficient predictive model to extract patterns from a given dataset [2,62,67–69]. One strategy for predicting function is to reduce the problem to a classification exercise using data whose ontology allows each item to be identified with a specific category, for example top level EC class.
Chapter 2 2.3. Bioinformatics
In statistics, classification is defined as a problem that identifies the patterns in datasets and further arranges together those which are alike and separates those which are unlike. The idea is that the new individual items should be placed into groups based on one or more criteria. The application of classification analysis can be seen in various research fields such as bioinformatics, cheminformatics [68], drug discovery, toxicogenomics and many more. Classification methods could be further divided into two approaches, as supervised and unsupervised algorithms.
Supervised algorithm A supervised classification algorithm uses defined
examples to learn patterns and, based on this learning, it then classifies the data points. It has been shown that many machine learning methods have profound improvement on the prediction of function when chemoinformatics descriptors were designed using sequence or structure information [23,70,71]. In bioinformatics and system biology, machine learning methods are widely applicable [72].
The machine learning method focuses on prediction, based on learning from known properties from the training data. Machine learning algorithms are also beneficial in investigation into the structure of the data and handling a huge dataset. In other words, a machine learning algorithm constructs a model in order to predict the outcome of an experiment. However, the disadvantage of machine learning algorithms is that they hugely depend on the nature, source and quality of the data.
We preferred supervised classification methods because we wanted to see if our descriptors could show patterns which could be used further to annotate enzyme functions. This is a machine learning task, to infer function from given examples. Although algorithms of machine learning methods are often very complex, they nonetheless work on the very basic philosophy of learning from examples.
Machine learning algorithms are applicable to problems such as predict- ing the structure of a protein [72]. Prediction of protein structure has been a challenge for decades and with the advent in technology we are able to get some successful output. Machine learning methods have many advantages to map the input sequence of amino acid to the features of output sequences. Where an inadequate amount of information is available for two proteins sharing the same function annotation, machine learning algorithms can be
Chapter 2 2.3. Bioinformatics
very profitable through extracting more information from multiple proteins [2,23, 73, 74].
Unsupervised Clustering Analysis As its name suggests, this type
of classification is not supervised by any examples. In this method, the data points are grouped together based on some criteria such that they can distinguish between self and non self. There are two ways a distinct cluster is defined, either by finding greater similarities within the members of the group or finding clear separation between the clusters. Sometimes clustering analysis is only used to summarise the data which could further be used for analysis purposes. In biology, bioinformatics, pattern recognition and social science, cluster analysis is commonly used for understanding the data or to annotate function.
Indeed, human eyes are skilled in grouping objects based on certain cri- teria, for example, a child can label a photograph as a building, vehicle, people etc. Biologists have used clustering analysis to create a taxonomy of living things: kingdom, phylum, class, order, family, genus and species. Among many unsupervised classification algorithms, hierarchical clustering is the simplest and very popularly used by biologists. An example of hierar- chical clustering in biology is Gene Ontology (GO) which classifies genes into hierarchies of biological processes and molecular functions. Moreover, three structural classification databases which define sequence-structure-function relationship are SCOP, CATH and DALI. The EC nomenclature and classi- cal taxonomy are both hierarchical methods used to classify enzymes based on biochemical classes and organism-level morphological features, respec- tively.
Another database using microarray data to study a large variety of bi- ological mechanisms, including association with diseases, is the Database for Annotation, Visualization and Integrated Discovery (DAVID ) [75, 76]. This database is popularly used to understand biological meaning of gene list using various sophisticated statistical methods.
As indicated schematically in Figure 2.2, biophysical information, with bioinformatics analyses of an entire set of related or non-related proteins, can be used to identify novel function by one of the two strategies using either supervised or unsupervised classification method. A new protein can
Chapter 2 2.3. Bioinformatics
be classified either as a member or non-member depending on its feature vector by using these machine learning methods.
Figure 2.2: Schematic representation of an interactive approach to function annotation and prediction. It starts with extracting useful information from sequence or structure such as catalytic residues, reaction entities, to build a predictive model. Once the data is pre-processed, it can be further used for annotation using either supervised learning or unsupervised method. To get fewer false positives, the practice of evaluation is highly recommended.