• No results found

SUPPORT VECTOR MACHINES

In document Pharmaceutical Data Mining (Page 150-153)

ITERATIVE SCREENING

4.5 SUPPORT VECTOR MACHINES

In recent years, applications of support vector machines have become very popular in chemoinformatics. Support vector machines are a supervised binary classifi cation approach [84,85] . The basic underlying idea is to linearly sepa-rate two classes of data in a suitable high - dimensional space representation such that (1) the classifi cation error is minimized and (2) the margin separating the two classes is maximized. Accordingly, the popularity and success of this method can be attributed to that fact that instead of only trying to minimize the classifi cation error, support vector machines employ structural risk mini-mization methods to avoid overfi tting effects. The structural risk minimini-mization principle implies that the quality of a model does not only depend on minimiz-ing the number of classifi cation errors but also on the inherent complexity of the model. That is, models with increasingly complex structures involve more risk, which means that they do not generalize well, but display signifi cant trends of overfi t relative to the training data. Thus, following basic ideas of support vector machines, fi nding a maximal separating hyperplane corre-sponds to minimizing the structural risk.

Overfi tting is generally known to be a serious problem in machine learning, which is typically a consequence of using only small training sets but many variables. For classifi cation machines, this would mean using sparse training data, but permitting many degrees of freedom to fi t a data - separating bound-ary. Generally, this situation is referred to as the curse of dimensionality and means that with the increase of (feature) dimensionality, the size of training data sets to sample feature space with constant resolution needs to grow exponentially. In principle, a support vector machine implements a linear classifi er; however, using the so - called kernel trick , i.e., the mapping of data into a high - dimensional space via a kernel function, it also is capable of deriv-ing nonlinear classifi ers.

Let us consider a training set of overall size m split into two classes, A and B , of, for instance, active and inactive compounds. Each compound is described by an n - dimensional vector x i of numerical features such as descriptor values.

Compounds of class A are assigned the value y i = +1, i ∈ A and those of class B the value y i = − 1, i ∈ B . If linear separation is possible, the support vector machine is defi ned by a hyperplane that maximizes the margin, i.e., the closest distance from any point to the separating hyperplane. A hyperplane, H , is defi ned by a normal vector, w , and a scalar, b , so that

H: x w, + = 0 b , (4.16) where 〈 · , · 〉 defi nes a scalar product.

For the hyperplane H to separate classes A and B , it is required that all points x i , i ∈ A lie on one side of the hyperplane and all points x i , i ∈ B on the other. In algebraic terms, this is expressed as

x wi, + ≥ +b 1 for iA,i.e.,yi= +1 (4.17) x wi, + ≤ −b 1 for iB,i.e.,yi= −1. (4.18) Combining these inequalities yields

yi( x wi, +b) − ≥1 0fori= …1 m (4.19) Points that meet the equality condition and are closest to the separating hyper-plane defi ne two hyperhyper-planes,

H+1: x wi, + = +b 1 (4.20) and

H1: x wi, + = −b 1, (4.21) parallel to the separating hyperplane H , which determine the margin. Their separating distance is 2/|| w ||. So, minimizing || w || with respect to the inequality constraints yields the maximum margin hyperplane, where the inequalities ensure correct classifi cation and the minimization produces the minimal risk, i.e., the best generalization of performance. Those points that lie on the margin are called the support vectors because they defi ne the hyperplane H , as can be seen from Figure 4.4 . These are the points for which equality holds in Equations 4.17 and 4.18 .

Figure 4.4 Maximal margin hyperplane. The maximal margin hyperplane H is defi ned by the vector w and the distance | b |/|| w || from the origin. The support vectors are indi-cated by solid circles. The classifi cation errors are indiindi-cated by the dotted circles.

The lines from the margins to the dotted circles indicate the magnitude of the slack variables.

The basic technique for solving optimization problems under constraints is to introduce Lagrange multipliersαi . The Langrangian

LP iyi i b derivatives with respect to w and b yields the conditions

w= x

0. This corresponds to a convex quadratic optimization problem that can be solved using iterative methods to yield a global maximum. If the problem is solved, w is obtained from Equation 4.23 and b can be obtained from

yi(x wi, +b) − =1 0 (4.26) for any vector i with αi ≠ 0. The vectors i with αi ≠ 0 are exactly the support vectors, as the Lagrangian multipliers will be 0 when equality does not hold in Equations 4.17 and 4.18 . Once the hyperplane has been determined, com-pounds can be classifi ed using the decision function

f b iyi i b

Usually, the condition of linear separability is too restrictive and, therefore, slack variables are introduced to the conditions, Equations 4.17 and 4.18 , thereby relaxing them to permit limited classifi cation errors:

x wi, + ≥ + −b 1 ξ fori iA (4.28) x wi, + ≤ − +b 1 ξ fori iB (4.29)

ξi≥0fori= …1 m. (4.30)

Figure 4.4 illustrates the introduction of slack variables. The dotted lines from the margins represent slack variables with positive values allowing for classifi cation errors of the hyperplane. The objective function to be minimized under those constraints becomes 1

1

As stated above, support vector machines are not limited to linear boundar-ies. Nonlinear boundaries can also be achieved by introducing kernel func-tions. Equation 4.25 only requires the calculation of the scalar product between two vectors and does not require an explicit representation of the vectors.

Conceptually, kernel functions correspond to a mapping of the original vectors into a high - dimensional space and calculating the scalar product. Popular kernel functions include, for example, the Gaussian kernel function, polyno-mial functions, or sigmoid functions: The fl exibility of the kernel approach also makes it possible to defi ne kernel functions on a wide variety of molecular representations that need not be numerical in nature. Azencott et al. [22] provide examples of a variety of kernel functions. For 1 - D SMILES and 2 - D graph representations, a spectral approach is used by building feature vectors recording either the presence or the absence or the number of substrings or substructures. The constructed vectors are essentially fi ngerprints, and the kernel function is subsequently defi ned as a similarity measure on the basis of those fi ngerprints. Using 3 - D structures, kernel functions can also be constructed for surface area represen-tations and pharmacophores, or by considering pairwise distances between atoms recorded in histograms. Thus, different types of kernel functions make it possible to tackle diverse classifi cation problems and ensure the general fl exibility of the support vector machine approach.

In document Pharmaceutical Data Mining (Page 150-153)