• No results found

2.5 Large Scale Approximations

2.5.2 Low-Rank Approximations

In this section, we propose an alternative approach for the derivation of knowledge-based kernel principal components in large scale problems compared to the approach presented in Section 2.5.1. The main idea behind this approach is to substitute the full kernel matrix with an approximate low-rank factorization and adapt the techniques presented in Section 2.4 to account for the low-rank approximation. More specifically, we propose to use a matrix

K = Ψ>Ψ with Ψ ∈ Rl×nsuch that for all ε > 0 there exists l ≤ n so that K− Ψ>Ψ p< ε ,

where k·kpdenotes the Schatten p-norm of a symmetric and positive definite matrix (Weid-

mann, 1980). Typically, the rank of the approximation l  n and this enables us to find the approximate knowledge-based kernel principal components using the closed form solvers in time O(l3). This is a significant speed-up compared to the runtime cost of O(n3) for the

optimization problem in Eq. (2.11) defined with the full kernel matrix.

For the moment, suppose that the kernel matrix K can be approximated with a low-rank factorization Ψ>Ψ. Then, the optimization problem from Eq. (2.12) can be written as

min z∈Rn z >Ψ>Ψ Ψ>Ψ−1+ E  Ψ>Ψz− 2b>Ψ>Ψz s.t. z>Ψ>Ψ2z = R2, (2.25) where E = S −K−1. The fact that the matrix Ψ is of rank l  n implies that the inverse matrix

(Ψ>Ψ)−1∈ Rn×nis also of rank l. To see this, let us perform a singular value decomposition of

matrix Ψ = UΠV>, where U ∈ Rl×land V ∈ Rn×nare orthogonal matrices, and Π ∈ Rl×n

is a diagonal matrix with at most l positive singular values. From the decomposition it follows that Ψ>Ψ = V Π2V>. If we denote with Π

l ∈ Rl×l the diagonal matrix with l

non-zero singular values then the inverse matrix (Ψ>Ψ)−1= V

lΠ−2l Vl>, where Vl denotes

the right singular vectors corresponding to non-zero singular values in Πl. The fact that the singular value matrix Π is of rank l also implies that in (Ψ>Ψ)2= V

2.5 Large Scale Approximations 35 no dependence on the right singular vectors corresponding to zero singular values. Hence, substituting z = Π2

lVl>z∈ Rl into Eq. (2.25) we obtain the optimization problem for the

low-rank approximation of knowledge-based kernel principal components, min z∈Rl z > Π−2l + Vl>EVl z − 2Vl>b>z s.t. z>z = R2. (2.26) In the latter problem, Π−2

l + Vl>EVl∈ Rl×lis a symmetric matrix that can be computed in

time O(l3+ l2n), whereO(l3) stems from the singular value decomposition of matrix Ψ and

O(l2n) from the matrix-matrix multiplications in V>

l EVl. The latter computational cost is

not cubic because the matrices comprising E are either diagonal or very sparse (e.g., see Eq. 2.12). Hence, a closed form solution for the problem in Eq. (2.26) can be computed in time O(l3+ l2n) using the approaches from Section 2.4. As l  n the approach can scale

knowledge-based kernel principal component analysis to millions of instances.

Having described the optimization problem for the computation of low-rank approx- imations to knowledge-based kernel principal components, we now review two standard approaches for obtaining a good low-rank factorization of the kernel matrix. While the approach reviewed in Section 2.5.2.1 is suitable for any kernel function, the one reviewed in Section 2.5.2.2 works only for the class of stationary kernels (e.g., see Chapter 3).

2.5.2.1 Nyström Method

The section provides a brief review of the Nyström method (Nyström, 1930; Williams and Seeger, 2001) for low-rank approximation of kernel matrices. The method will be investi- gated in more details in Chapter 4, where an approximation bound will also be given. The presentation in this section follows closely that of Williams and Seeger (2001), where the approach was first introduced for the purpose of low-rank approximation of kernel matrices. The Nyström method computes a low-rank approximation K of a kernel matrix K by first sampling (without replacement) l instances from X. The literature often refers to these selected instances as landmarks. If we denote with Kl,lthe block in the kernel matrix

corresponding to kernel function values between the landmarks and with Kn,l the block with kernel values between all available instances and the landmarks, then the Nyström approximation is given by

K = Kn,lKl,l−1Kl,n.

Now, from the eigendecomposition of the symmetric and positive definite matrix Kl,l =

Vl,lΣ2l,lVl,l>, we obtain that the low-rank approximation can be written as

K = Kn,lVl,lΣ−1l,l Kn,lVl,lΣ−1l,l>= Ψ>Ψ , where Ψ = Kn,lVl,lΣ−1l,l

>

. In order to express a particular instance xi∈ X in this feature

representation, one first needs to compute the column vector, Ki, with kernel values between

that instance and landmarks. Then, the instance xican be represented as K>

i Vl,lΣ−1l,l.

The computational complexity of the Nyström method is O l3+ l2n

and the fact that

l n implies that the method is capable of alleviating the cubic complexity of our approach. If we denote with Vl and Σl matrices with the top l eigenvectors and eigenvalues of the kernel matrix K = V ΣV>, then the optimal approximation of the kernel matrix (measured

2.5.2.2 Random Fourier Features

In this section, we provide a brief overview of the random Fourier features method for the approximation of stationary kernel functions. A more detailed review of this approach is provided in Chapter 3. Before we give a low-rank approximation of the kernel matrix using random Fourier features, we define the class of shift-invariant/stationary kernel functions. Definition 2.1. LetD⊂ Rd be an open set. A positive definite kernelk : D× D → R is called

stationary or shift-invariant if there exists a functions : D → R such that k (x,y) = s (x − y), for allx, y∈ D. The function s is said to be a function of positive type.

Having defined the class of stationary kernels, let us now review the key theoretical result for the approximation of kernel functions using random Fourier features.

Theorem 2.5. (Bochner, 1932) The Fourier transform of a bounded positive measure on Rdis a continuous function of positive type. Conversely, any function of positive type is the Fourier transform of a bounded positive measure.

From this theorem it follows that for a stationary kernel k it holds

k (x, y) = s (x− y) = Z

Rd

exp(−i w,x − y )dµ(w) ,

where µ is a positive and bounded measure. As k (x,y) is a real function in both arguments, the complex part in the integral on the right hand-side is equal to zero, and we have

k (x, y) = 2

Z

cosw>x + b cosw>y + b

d ˆµ (w, b) ,

where ˆµ(w,b) = µ(w)

> 0 for all w∈ Rd and b ∈ [−π,π]. Hence, it is possible to sample

(w,b) proportional to ˆµ(w,b) and approximate the kernel value at (x,y) by the Monte-Carlo estimate of the integral defining the inner product between two instances. The first kernel approximation algorithm based on this idea was proposed by Rahimi and Recht (2008a). That work gives an approximation of a stationary kernel using l random Fourier features by

k (x, y) =2 l l X i=1 cosw> i x + bi cosw>i y + bi ,

where {(wi, bi)}li=1 are independent samples from the probability distribution that is pro-

portional to the measure ˆµ(w,b). The convergence of the approximation to the actual value of the kernel function at a given pair of instances follows from the Hoeffding’s concentration inequality (e.g., see Chapter 3 for more details). Hence, if we denote with

ψl(x) = vecn√

2/lcosw>1x + b1 ,...,√2/lcosw> l x + bl

o

, the approximation of the kernel function at (x,y) can be written as

k (x, y) = ψl(x)>ψl(y) .

From here it then follows that the approximation of the kernel matrix can be written as

K = ψl(X)>ψl(X) ,

where xidenotes the i-th column in the data matrix X (1 ≤ i ≤ n) and ψl(X) ∈ Rl×nis the