The aim of classification or supervised learning is to determine whether an object belongs to a certain class. Classification of patients into existing disease classes using gene expression information is a typical application. In microarray analysis, classification is used to predict sample phenotypes based on gene expression patterns. Classifiers based on gene expression normally predict that a certain percentage of individuals that have a given expression profile will also have the phenotype of interest [100]. When working with complex data variables (features), such as what might be seen in large, noisy and incomplete microarray data sets, supervised methods are more efficient than the unsupervised ones.
The term classification in its broadest sense covers any context in which some decision or forecast is made according to currently available information [92]. Classification pro- cedures include some formal methods in order to make judgment in new situations. More strictly, classification is constructing a procedure that will be applied to continuing cases, and the aim is to assign each new case to one of pre-defined classes on the basis of observed
Classification algorithm for gene expression data sets 4.1. Introduction
attributes or features. The construction of a classification procedure using a set of data for which the true classes are known is called pattern recognition, or supervised learning. An example of this is assigning a credit status to an individual on the basis of financial and personal information.
Three main approaches that historically have been applied in this area include: statistical approaches, machine learning and neural networks.
Statistical approaches are generally characterised by inclusion of a probability model. This model provides the classification as well as the probability of belonging to a particular class. Since techniques are used by humans, some intervention in variable selection or structuring the problem is expected.
Classification within the statistical community has occurred in two main phases. The first phase is the classical phase which focuses on derivatives of Fisher’s early work on linear discrimination. The second phase known as the modern phase, uses more classes of models which try to provide an estimation of the joint distribution of the features within each class, which in turn can be used for developing a classification rule.
Machine learning includes computing procedures that are based on logical or binary operations, and which learn a task from a series of examples. Machine learning tries to make classifying expressions simple enough to be understood by the human. They try to mimic human reasoning in order to provide insight into the decision process. Machine learning uses background knowledge, as statistical approaches use, however operation is conducted without human intervention. Machine learning focuses on decision-tree approaches, in which classification is a result of a sequence of logical steps.
Neural networks have different applications ranging from understanding and imitating the human brain, to practical scientific, commercial and engineering disciplines of pattern recognition, modelling, and prediction.
Neural networks might include different techniques however they all include layers of interconnected nodes, where each node produces a non-linear function of its input. The input to a node might come from the input data or from other nodes. A complete network represents a complex set of interdependencies that may incorporate any degree of nonlin- earity, which allows general functions to be modelled. It has been argued that to a certain extent neural networks mirror the behaviour of networks of neurons in the brain.
Classification algorithm for gene expression data sets 4.1. Introduction
classes by means of certain, not necessarily linear, functions. Classification algorithms based on linear separability have been developed in work by Bennet and her colleagues work [24, 25].
Over the last decade different approaches have been proposed to find piecewise linear functions separating two sets. Bennet et al. [27] develop the bilinear separability concept where two hyperplanes are used to separate sets. Astorino et al. [7] introduce the concept of polyhedral separability. In the latter case, one of the sets is approximated by a polyhedral set and the rest of the space is used to approximate the second set. The number of hyperplanes is not restricted, however the piecewise linear function is polyhedral, that is it is convex. However in many real situations, sets cannot be separated using only a few hyperplanes nor by using convex piecewise linear functions.
Support Vector Machines algorithms have been developed by Burges [35], Vapnik [129] and Thorsten [126].
An algorithm based on polyhedral separability has been introduced by Astorino and colleagues [7] and another algorithm based on max-min separability has been developed by Bagirov [18].
It should be mentioned that among these algorithms, only Support Vector Machines algorithms have been applied to gene expression data sets.
Supervised microarray data analysis (like any supervised data analysis process) includes four stages:
1. Construction of a classifier or model: We need a set of genes (training set), functional classes to which these genes belong (dependent variables), and independent variables that describe characteristics of the genes.
2. A learning phase: Training data are analysed by a classification algorithm. 3. A testing phase: The test data are used to assess the accuracy of the classifier.
4. An application phase: Classifier predicts the class label of the unknown gene ex- pression values. There are other methods to analyse microarray data including linear discriminant analysis, decision trees, nearest neighbours, support vector machine.
Validation will require the use of data other than those used to develop the classifier. Validation issues arise including questions regarding the applicability of the new algorithms