• No results found

Machine learning

Chapter 2 Biological background and computational methods

2.4 Machine learning

In the year 1959, the term machine learning was introduced by Arthur Samuel in an article published in the IBM Journal of Research and Development [87]. Machine learning refers to the ability of computer systems to solve problems without being explicitly programmed [88]. In the field of machine learning, researchers aim to study and construct algorithms for building a model. After learning from input data, the result model that can be used to make predictions on new coming data [89]. Broadly speaking, there are two main approaches for machine learning algorithms: supervised and unsupervised learning. The former starts with the goal of predicting a known output or target [90]. In contrast, in unsupervised learning, there are no outputs to predict. Instead, learning algorithms try to find naturally occurring patterns or groupings within the data [90]. Examples of supervised learning algorithms include linear regression, naive Bayes classifier, and support vector machines. In contrast, unsupervised learning algorithms include diverse clustering methods such as hierarchical clustering and k-means clustering.

2.4.1 Support Vector Machines

A support vector machine (SVM) [91] is a supervised learning model which is used for data classification and regression analysis. Like the other methods, we need to train our model first based on a suitable training set of β€œpositive” and β€œnegative” data points. SVM training constructs a hyperplane in order to separate training data belonging to these two classes (Figure 2.9).

19

Figure 2.9 An example SVM in 2 dimensional space. Image was adapted from [91]

Let n points in training data be

(π‘₯βƒ—βƒ—βƒ— , 𝑦1 1), … , (π‘₯βƒ—βƒ—βƒ—βƒ— , 𝑦𝑛 𝑛)

where 𝑦𝑖 indicate the class to which the point π‘₯βƒ—βƒ—βƒ— belongs. Values of 𝑦𝑖 𝑖 are either -1 or 1. Each π‘₯βƒ—βƒ—βƒ— is 𝑖

a p-dimensional vector. Our goal here is to find the "optimal hyperplane" that divides the group of points π‘₯βƒ—βƒ—βƒ— for which 𝑦𝑖 𝑖=1 from the group of points for which 𝑦𝑖=-1, so that the distance between the hyperplane and the nearest point π‘₯βƒ—βƒ—βƒ— from either group is maximized (optimal margin). 𝑖

If the training data is linearly separable, the classification function f is a linear function:

𝑓(π‘₯ ) = 𝑀𝑇π‘₯ + 𝑏

where w and b are the parameters of the classifier. The class of π‘₯ is the sign of the function 𝑓(π‘₯ ). The hyperplane can be written as the set of point π‘₯ satisfying

𝑓(π‘₯ ) = 𝑀𝑇π‘₯ + 𝑏 = 0

and the two margins as follow:

𝑀𝑇π‘₯ + 𝑏 = 1

20

For every data point, we have 𝑦𝑖(πœ”π‘‡π‘₯ 𝑖

βƒ—βƒ—βƒ— + 𝑏) β‰₯ 1.

If the data is not linearly separable, we may allow misclassification. By adding a cost Ρ𝑖 > 0, the

optimization constraints become

𝑦𝑖(πœ”π‘‡π‘₯ 𝑖

βƒ—βƒ—βƒ— + 𝑏) β‰₯ 1 βˆ’ πœ€π‘–

If 0 < Ρ𝑖 < 1, the point π‘₯βƒ—βƒ—βƒ— is correctly classified but within the margin. If Ρ𝑖 𝑖 > 1, the point is in the hyperplane or on the wrong side of it. We want to maximize the margin and minimize the cost. Another approach is using non-linear classifiers by transforming data into higher-dimensional space. This transformation is achieved using kernel functions. Examples of kernel functions include polynomial, hyperbolic tangent, and Gaussian radial basis functions.

So far, our SVM model only works with two classes (binary classifier). An approach for classifying with more than two classes is reducing the single multiclass problem into multiple binary classification problems [92]. Common methods for such reduction include: one-against-all [93], one-against-one [94], and directed acyclic graph SVM [95].

2.4.2 Model validation and evaluation

It is often useful to measure the performance of the model so that we can choose an appropriate method for a specific problem or tune the parameters of the model to improve the results. There are many metrics that can be used to measure the performance of a classifier. Performance measures are usually based on:

ο‚· Success: the class label of data point is predicted correctly ο‚· Error: : the class label of data point is predicted incorrectly Examples of performance metrics include:

ο‚· Error rate: proportion of incorrectly classified instances over the whole set of instances ο‚· Accuracy: proportion of correctly classified instances over the whole set of instances In the field of machine learning, to visualize the performance of an algorithm, people usually uses a specific table called confusion matrix (Table 2.4).

21

Table 2.4 Confusion matrix

Predicted condition positive Predicted condition negative

True condition positive True positive (TP) False negative (FN)

True condition negative False positive (FP) True negative (TN)

The following metrics can be derived from Table 2.4 : ο‚· Accuracy (ACC) = TP+TN

TP+TN+FP+FN

ο‚· Prevalence = TP+FN TP+TN+FP+FN

ο‚· Positive predictive value (PPV), Precision = TP TP+FP

ο‚· False discovery rate (FDR) = FP

TP+FP

ο‚· False omission rate (FOR) = FN TN+FN

ο‚· Negative predictive value (NPV) = TN

TN+FN

ο‚· True positive rate (TPR), Recall, Sensitivity, probability of detection = TP TP+FN

ο‚· False positive rate (FPR), Fall-out, probability of false alarm = FP FP+TN

ο‚· Specificity (SPC), Selectivity, True negative rate (TNR) = TN

FP+TN

ο‚· False negative rate (FNR), Miss rate = FN TP+FN

ο‚· Positive likelihood ratio (LR+) = TPR

FPR

ο‚· Negative likelihood ratio (LRβˆ’) = FNR

TNR

ο‚· Diagnostic odds ratio (DOR) = LR+ LRβˆ’

ο‚· F1 score = 2TP

2TP+FP+FN

In the following, three methods to estimate classifier problems will be explained. The first one is the holdout method. This method separates data into two sets, one for training (training set) and the other for testing (test set). One disadvantage of this method is that fewer labeled examples are available for training (because the test set holds some examples). Consequently, the result model may not be as good as when all the labeled examples are used for training [96]. The second method is cross-validation. In this method, data is segmented into k equally-sized partitions. Each iteration

22

uses one of the partitions for testing and the other remaining partitions for training. To use each partition for testing exactly once, this procedure is repeated k times. A special case of cross validation occurs when k is equal to the size of the data set so that each test set only contains one record. This case is called leave-one-out cross validation. The third method is bootstrap. Not like holdout or cross-validation, in which training records are sampled without replacement, in the bootstrap, a record already chosen for training is put back into the original pool of records.

Related documents