• No results found

Kernel methods are algorithms that operate on a type of data representation known as a kernel matrix. Kernel matrices provide a general framework to represent data and satisfy certain mathematical properties. A kernel matrix is defined not in terms of individual variables but in terms of pairwise similarity among all variables. So, instead of using a mapping φ : X → F to represent each object x ∈ X by φ(x) ∈ F, a real valued similarity function k : X × X → R is used and the dataset with n variables is represented by a n × n matrix of pairwise similarities kij = k(xi, xj). The most significant fact regarding these methods is that once we have a kernel matrix representation of the data then the original data is not required and the methods can work on just these matrices. This is where the real beauty of these methods arise as different types of data types do not necessitate changes in the underlying algorithm. Kernel methods require that a kernel matrix is symmetric and positive semi-definite. This means that if k is an n × n matrix of pairwise similarities then ki,j = kj,i for 1 ≤ i, j ≤ n, and ckc ≥ 0 for any c ∈ Rn. This also implies that the matrix has non-negative real eigenvalues.

Each similarity value (ki,j) in a kernel matrix is calculated using a so called kernel function ( k(x,y) ) that acts a suitable similarity between the variables. Hence, a real valued kernel matrix could be obtained for diverse data types (strings and graphs) as long as a similarity function can be defined over a pair. This nice property leads to complete separation of similarity function definition from the algorithms that operate on these matrices. This is specially useful in bioinformatics because of diverse types of datasets (as pointed in previous chapter) where a real valued representation of individual variables is non intuitive while a similarity score makes sense, e.g. genomic sequences. We will see different types of kernels in Section-5.2.1.

5.2.1 Various kernel or similarity functions

We provide a short description of various possible kernels for different data types (vectors, strings and graphs) and their properties.

Vector Data

• The Linear or Dot kernel is the simplest one.

kL(x, x) = xTx (5.2)

• The Polynomial kernel is a more general case of the linear kernel

kP oly(x, x) = (xTx + c)d (5.3) where d is the degree of the polynomial and c is a constant. When c is non-zero then this kernel corresponds to a feature space spanned by all products of at most 2 variables i.e., {1, x1, x2, x21, x1x2, x22}. When c is zero then this space is restricted to only the products of exactly 2 variables i.e., {x21, x1x2, x22}.

• The most popular and widely used kernel function used for real data is the Gaussian or Radial Basis Function (RBF) kernel

kG(x, x) = exp −k x − x k22

!

(5.4)

the width of the Gaussian being controlled using σ. This affinity function naturally encodes the local neighbourhood property and its value falls rapidly as the pairwise dissimilarity increases.

• Another popularly used kernel is the Sigmoid kernel

kS(x, x) = (kxTx + θ) (5.5) where k > 0 and θ < 0 are the gain and threshold.

Graph data

A graph is informally defined as a set of nodes connected by edges. In bioinformatics, typical examples of a graph would be the interactions between the proteins of an organism or the interaction network representing the metabolic pathway. Other common examples of such graphs are social networks and hyperlinked internet web

pages. While a graph represents local similarity i.e., a node’s direct interactions in its neighbourhood, we need a similarity function that represents global similarity i.e., a node’s interaction to every other node in the graph. The simplest measure of similarity on a graph is the shortest-path distance, but it is not positive semi-definite which is our requirement. Apart from this, this is very sensitive to insertions and deletions of edges. A more robust similarity measure is required which could perhaps average over many paths. The physical process of diffusion suggests a natural way of propagating such local information and has led to the most popular type of similarity on graphs known as the diffusion kernel (Kondor and Lafferty, 2002).

Laplacian L of an undirected unweighted graph is defined as,

Li,j = edges originating from ith node. The kernel function on the graph can be defined using the negative of this Laplacian (H = −L) as

Kβ = eβH = lim

m−>∞ I + βH m

m

(5.7) where β is a positive constant and I is an identity matrix. Kβ represents an expo-nential family of similarity functions with generator H and bandwidth parameter β.

Using power series expansion this can be expanded to

Kβ = I + βH + β2H2

2 +β3H3

3! + . . . (5.8)

Note that eβH yields a matrix but it is not the same as component-wise exponen-tiation eβHij. If a matrix is diagonal then its exponential can be obtained by just exponentiating every entry on the diagonal, i.e., eD = diag(ed11, ed22, . . . , ednn). This is an important property that could be used for computing the exponential. If we diagonalise H i.e., if H = UDU−1 and D is diagonal, then eH = UeDU−1. Based on this, we have used the technique discussed in Moler and Loan (2003) to com-pute our matrix exponentials. It involves computing the normalized eigenvalues and

eigenvectors of H.

H =

n

X

i=1

viλiviT (5.9)

which when replaced in Equation-5.8

Kβ =

n

X

i=1

vieβλivTi (5.10)

This similarity function is also known as the diffusion function because its differential equation form resembles the diffusion equation of heat through continuous media in classical physics (Kondor and Lafferty, 2002). The function of β is to control the extent of diffusion similar to the σ of the Gaussian kernel. In fact, as shown in Schlkopf et al. (2004), there is a straightforward correspondence between the diffusion kernel and the Gaussian kernel. The former can be considered a discretized version of the latter. In the next section we discuss our actual technique of similarity matrix integration.

5.2.2 From similarities to a valid kernel

Sometimes we have a well defined measure of similarity between a pair of objects, but the resulting matrix is not a valid kernel matrix according to the strict definition of positive semi-definiteness. In such cases, two methods have been proposed in the literature that may be used to convert the similarity matrix to a valid kernel. Tsuda (1999) has proposed a principled technique called empirical kernel map. Roth et al.

(2002) have proposed an ad-hoc technique of eigen-decomposition of the similarity matrix and then removal of negative eigenvalues. They have also showed that this preserves the cluster structure of the data. When we are not sure if the similarity matrix that we have obtained is a kernel matrix then one of these techniques could be used to make it a kernel matrix.

5.2.3 Kernel normalization

In order to add kernels, we need to normalize them so that they are on the same scale. Given an unnormalized kernel matrix, K, the normalized version is

ij = Kij

pKii× Kjj (5.11)

This can be easily computed if we define A = (1/√

K11, . . . , 1/√

Knn). Then, ˆK = K ∗ (AAT), where ∗ denotes element-wise product.