Regression With Gaussian Measures

(1)

Regression With Gaussian Measures

Michael J. Meyer Copyright c _{April 11, 2004}

(2)

PREFACE

We treat the basics of Gaussian processes, Gaussian measures, kernel re-producing Hilbert spaces and related topics. All mathematical details are included and every effort is made to keep this as selfcontained as possible. Only elementary Hilbert space theory and integration theory as well as basic results from probability theory are assumed.

This is a work in progress and has been written up in haste. Undoubtedly there are mistakes. Please email me at [email protected] if you find mistakes or have suggestions.

Michael J. Meyer April 11, 2004

(3)

Introduction

We will freely use the terminology which will be defined later. Let F be a nonempty set and f : F → R _{a real valued function on} _F_{. Consider}

the following problem: we have observed the value of f at some points

x1, . . . , xn∈F as

yj =f(xj), j = 1, . . . , n, (1.1)

and from this we want to estimate f itself. We will follow a Bayesian ap-proach. It is assumed that the function f belongs to a real vector space H of functions on F. A prior probability P is placed onH and the regressor

ˆ

f (the estimate of f in light of the data) is computed as the mean of P

conditioned on the data (1.1).

The probabilityP is defined on theσ-fieldE_{generated by the continuous}

linear functionals on H. IfI : (H,E_{, P}₎ →H _{denotes the} H_{-valued random}

variable defined as I(f) = f (the identity on H) then the mean of the distribution P on H is the expectation EP[I] of I under P, that is, the

H-valued integral EP_[_I_{] =} Z HIdP = Z Hf P(df), (1.2)

Do not worry if this sounds needlessly abstract since it is not how things are handled in practice. It merely serves to motivate the procedures below. The vector valued integral (1.2) commutes with all continuous linear functionals Λ on H, that is,

Λ EP[I] =EP(Λ◦_I_{) =}

Z

HΛ(f)P(df)

and the same holds true if the ordinary expectation is replaced with a con-1

(6)

ditional expectation. The regressor ˆf is the conditional expectation ˆ

f =EP[I|_data] _(1.3)

and so we have

Λ( ˆf) =EP[Λ|_data] _(1.4)

for each continuous linear functional Λ on H. (note that Λ◦_I _{= Λ). Thus}

rather than computing the regressor ˆf globally as in (1.3) we compute Λ(f) for enough continuous linear functionals Λ onHto obtain a good view of ˆf. For each x∈_F _let

Ex:f ∈H7→f(x)∈R

denote the valuation functional at the pointx. If Λ =Ex then Λ( ˆf) = ˆf(x)

is our prediction for the value off at the pointx in light of the data (1.1). Note that the data themselves can be written in terms of the valuation functionals as

Ej(f) =yj, 1≤j≤n, (1.5)

whereEj =Exj is the evaluation functional at the pointxj. With this the

regressor ˆf becomes the condional expectation ˆ

f =EP[I |_E_j ₌_y_j_{, j}≤_n_]

and

Λ( ˆf) =EP [Λ|_E_j ₌_y_j_{, j} ≤_n_]_, _(1.6)

for each continuous linear functional Λ onH. To make this feasible we have to assume that

1. The evaluation functionalsEx,x∈F, are continuous onH.

The computation of (1.6) involves only the finite dimensional distribution of the random vector

W = (E1, . . . , En,Λ)

on Rn+1 under the probability P. Note that each continuous linear func-tional onHis a random variable on the probability space (H,E_{, P}_).

The measureP is is called aGaussian measure onHif every continuous linear functional Λ onHis a normal random variable under P. In this case the distribution of the vector W is automatically Gaussian (multinormal) on Rn+1 and the computation of the conditional epxectation (1.6) involves merely routine computations with the multinormal density.

(7)

3 We have chosen the particular form (1.1) for the data because this is the standard in regression problems. Note however that our approach applies to all forms of data and predictions which can be articulated in terms of events involving finitely many continuous linear functionals on H.

Regression with Gaussian processes assumes that f is the trajectory of a Gaussian process Z = Z(x) on F. The mean of the process is assumed to be zero and thus the process Z completely determined by its covariance functionK(x, y) which is a symmetric positive semidefinite kernel on F.

The kernel K :F ×_F → R _{is a parameter of the regression procedure.}

The space H is the product space H=RF of all functions f :F → R _and

the probability P is the distribution of Z on H. Kolmogoroff’s existence theorem for product measures guarentees the existence of the probability P

on Hfor every symmetric, positive semidefinite kernel K on F.

The spaceH=RF _{is a topological vector space with only one redeeming}

quality: the evaluation functionals are the coordinate functionals and hence continuous in the product topololgy on H.

Unfortunately there are essentially no other continuous linear functionals on H. Every continuous linear functional onHis a finite linear combination of coordinate functionals.

Consequently this setup limits us to data presented in the form (1.1) and consequent predictions of values f(x) at other points x ∈ _F _{in a point by}

point fashion.

There are other disadvantages. For example it requires a substantial effort to extract properties of the admissible functionsf, that is, the trajec-tories of the Gaussian processZ, from properties of the covariance kernelK

and the resulting properties are often weaker than desired.

Consequently we take a slightly different approach. We assume instead thatf is an element of a separable Hilbert spaceHof functions onF. P is a Gaussian measure onHdefined in terms of an orthonormal basis {_ψ_j}_of H

and a sequence (σj) of positive numbers (which diagonalize the covariance

operatorQ ofP below).

We can then proceed as above provided that the evaluation functionals are continuous on H. But we also have other options. The data and pre-dictions can be articulated in any fashion which uses only finitely many continuous linear functionals Λ on H. Point estimates are one possibility. Another possibility are the coefficients

Λ(f) = (f, ψk)

(8)

Here we had to assume that the evaluation functionals are continuous on H. A Hilbert space of functions on F with this property is called a

reproducing kernel Hilbert space on F. Such a Hilbert space H defines a unique symmetric, positive semidefinitekernelK :F×_F →R_{. Conversely}

every symmetric, positive semidefinite kernel onK:F×_F →R_determines

a unique reproducing kernel Hilbert space. There is an interesting interplay between orthonormal bases ofHand the kernelK.

A basic question is how to find an orthonormal basis forH. IfF ⊆Rd _is

compact andK is continuous, then we have additional structure in the form of the Euclidean topology and Lebesgue measure onX. Associated with the kernelK we have the integral operator T :L2(F)→_L2₍_F_{) defined by}

(T f)(x) =

Z

F

K(x, y)f(y)dy, f ∈_L2₍_F₎_{, x}∈_F,

wheredxdenotes Lebesgue measure on F. It turns out thatT is a Hilbert-Schmidt operator. Consequently the orthogonal complement of the null space of T has an orthonormal basis {_φ_j} _{consisting of eigenvectors of} _T_.

Letλj denote the corresponding eigenvalues. Then the functions

ψj =

p

λjφj

are an orthonormal basis for the reproducing kernel Hilbert space H with kernelK. This establishes the connection to the spectral theory of compact, selfadjoint operators on a Hilbert space.

There is another connection. For f ∈ H _{let Λ}_f _{be the bounded linear}

functional Λf(h) = (h, f) on H. The Gaussian measure P on H defines a

unique bounded linear operatorQ:H→H_{such that the covariances of the}

random variables Λf, Λg are given as

CovP(Λf,Λg) = (Qf, g)_H, f, g∈H. (1.7)

The operator Q is a positive trace class operator. Conversely for every positive trace class operator Q : H → H_{, there exists a unique Gaussian}

measureP onHsuch that (1.7) holds.

Thus the material presents an interesting interaction of functional anal-ysis and probability theory. If you are only interested in the regression prob-lem you need only read Chapter 2, Chapter 3, sections 1-4,7,8 and Chapter 4, sections 1,2,4.

(9)

Chapter 2

Operators on Hilbert Space

In this chapter we develop the spectral theory of compact operators between Hilbert spaces. Our scalars are the reals, that is, weconsider only real Hilbert spaces.

2.1 Hilbert space basics

We review the basics of Hilbert space theory. LetHbe a (real) Hilbert space with inner product (·_,·_{). Let}

H1 ={x∈H : kxk ≤1} denote the closed unit ball in Hand

S1(H) ={x∈H : kxk= 1}

the unit sphere in H. For vectors x, y ∈ H _{we write} _x ⊥ _y _{(orthogonal) if}

(x, y) = 0. For subsetsA,B of Hwe writeA⊥_B _if_a⊥_b_{for all} _a∈_A _and

b∈_B_{. We let}

A⊥:={_x∈H|_x⊥_a, _{for all}_a∈_A}_.

Then A⊥ is a closed subspace of H. IfV is aclosed subspace ofH, then

H=V +V⊥,

in particular every closed subspace of H is complemented in H. This is the first fundamental fact about Hilbert spaces. Each element x∈H_{has unique}

decomposition x=v+v⊥ with v∈_V _and _v⊥∈_V⊥_{. We have} k_xk2 ₌k_vk2₊_v⊥2

(10)

(the Law of Pythagoras). The map x 7→ _v _{is called the perpendicular}

pro-jection onto the subspaceV and is denotedπV. If (φj) is an ON-basis ofV,

then

πV(x) =

X

j(x, φj)φj, x∈H. (2.1)

The second fundamental property of a Hilbert space His the fact that the continuous linear functionals onHcan be identified with the elements ofH ifa∈H_{, then}

Λa:x∈H7→(x, a)∈R

defines a continuous linear functional onH. The converse is also true: ev-ery continuous linear functional on H has this form (Riesz Representation Theorem).

Bilinear forms. Let X and Y be Hilbert spaces. A function ψ=ψ(x, y) :

X×_Y →R_{is called a}_{bilinear form} _{if it is linear in both variables}_x _and_y_.

The bilinear formψ is called continuousif

k_ψk_{= sup}{ |_ψ₍_{x, y}₎| _: _x∈_X₁_{, y}∈_Y₁}_<∞_. _(2.2)

In this case|_ψ₍_{x, y}₎| ≤ k_ψk k_xk k_yk_{, for all}_x∈_X _and_y∈_Y_{. Note that the}

closed unit ballsX1,Y1 can be replaced with the unit spheresS1(X),S1(Y) with no effect on the definition of the norm ofψ.

If A : X → _Y _{is a bounded linear operator, then} _ψ₍_{x, y}_{) = (}_{Ax, y}₎

defines a continuous bilinear form onX×_Y _withk_ψk₌k_Ak_{. Conversely}

Theorem 2.1.1 (Lax-Milgram). Letψ=ψ(x, y) be a continuous bilinear form on X ×_Y_{. Then there exists a bounded linear operator} _A _: _X → _Y

such thatψ(x, y) = (Ax, y)Y, for all x∈X and y∈Y.

Proof.Fixx∈_X_{. Then Λ}_x₍_y_{) =}_ψ₍_{x, y}_{) is a continuous linear functional on}

Y. By the Riesz Representation Theorem there exists an elementa∈_Y _with

Λx(y) = (a, y)Y, for ally∈Y. Clearlyais uniquely determined byx. Write

a=Ax. Thus defines a mapA:X→_Y _{which satisfies} _ψ₍_{x, y}_{) = (}_{Ax, y}_).

The uniqueness ofaand linearity of ψ in the first argument imply that the mapA is linear. The continuity of ψimplies thatA is continuous.

If X =Y =H, then a bilinear form ψ = ψ(x, y) on X×_Y _{is called a}

bilinear form on H. Such a bilinear from is called symmetric if it satisfies

ψ(x, y) =ψ(y, x), for all x, y∈H_{. In this case}

Proposition 2.1.1. Let ψ = ψ(x, y) be a symmetric bilinear form on H. Then

(11)

2.2. ADJOINT OPERATOR 7

Proof. Let C denote the right hand side of (2.3). Obviously C ≤ k_ψk _and

we have to show only the reverse inequality. Write φ(x) = ψ(x, x). Then

|_φ₍_x₎| ≤_C _ifk_xk ≤_{1. Using the the symmetry of}_ψ _{we can write}

ψ(x, y) = 1 2 φ x+y 2 −_φ x−_y 2 .

Recall thatH1 deotes the closed unit ball inH. Ifx, y∈H1, then (x±y)/2∈

H1 and it follows that|φ((x±y)/2)| ≤C. From this

ψ(x, y)≤ 1

2(C+C) =C. Taking the sup over all x, y∈H₁ _{now yields} k_ψk ≤_C_.

2.2 Adjoint operator

The Lax-Milgram theorem can be used show the existence of the adjoint operator. Let X, Y be Hilbert spaces and T : X → _Y _{a bounded linear}

operator. Then ψ(y, x) = (y, T x)_Y is a continuous bilinear form on Y ×_X_.

Consequently there exists a bounded linear operatorT∗ :Y →_X _{such that}

ψ(y, x) = (T∗y, x)_X, for all y ∈ _Y _and _x ∈ _X_{. It is easy o see that the}

operatorT∗ _{is uniquely determined by its defining property}

(T x, y) = (x, T∗y), x∈_{X, y}∈_Y.

ObviouslyT∗∗=T. We note the following

Proposition 2.2.1. We have (i) N(T∗T) =N(T).

(ii) N(T∗) =R(T)⊥. (iii) N(T) =R(T∗₎⊥_.

Proof. (i) If T x = 0, then T∗T x = 0. Conversely, if T∗T x = 0, then

k_{T x}k2 _{= (}_{T x, T x}_{) = (}_T∗_{T x, x}_{) = 0, thus}_x_∈_N₍_T_).

(ii) Letw∈_N₍_T∗_{) and}_y₌_{T x}_{for some}_x∈_X_{. Then (}_{y, w}_{) = (}_{x, T}∗_w_{) = 0.}

Thus w∈_R₍_T₎⊥_.

Conversely, if w∈ _R₍_T₎⊥_{, then (}_T∗_{w, x}_{) = (}_{y, T x}_{) = 0, for all} _x ∈ _X_.

This implies T∗y = 0 (let x = T∗y) and so w ∈ _N₍_T∗_{). Now (iii) follows}

from this. ReplaceT withT∗ and note that T∗∗=T.

Remark. By taking orthogonal complements in (ii) and (iii) we obtain

R(T) ⊆ _N₍_T∗₎⊥ _and _R₍_T∗₎ ⊆ _N₍_T₎⊥ _{but we will not have equality in}

(12)

For any subset A ⊆ _X _{we have} _A⊥ _{= (}_A₎⊥_{. Thus (ii) can be written}

as N(T∗) = [R(T)]⊥. Note that this implies that T∗ is one to one on the closureR(T) of the range of T.

2.3 Selfadjoint and positive operators

A bounded linear operatorT on His called selfadjointif it satisfies

(T x, y) = (x, T y), (2.4) for all x, y ∈ H_{. In this case the nullspace} _N₍_T_{) =} {_x ∈ H | _{T x} _{= 0}}

satisfies

N(T) =R(T)⊥.

The converse R(T) =N(T)⊥ _{is not true in general simply because the}

rangeR(T) will not in general be closed.

The number λ is called an eigenvalue of T if there is a nonzero vector

x∈H_with_{T x}₌_λx_{, that is}_x∈_N₍_T−_λI_{), where}_I _{is the identity operator}

onH. We let

Eλ(T) :=N(T−λI) ={x∈H|T x=λx}

denote theeigenspaceassociated with the eigenvalueλ. Obviously this space is defined wether or notλis an eigenvalue ofT. It is an eigenvalue if and only ifEλ(T)6={0}. The nonzero elements ofEλ(T) are called the eigenvectors

associated with the eigenvalueλ.

Proposition 2.3.1. Let T be a selfadjoint operator on H. Then λ 6₌ _µ

impliesEλ(T)⊥Eµ(T), in other words, eigenvectors with respect to different

eigenvalues are perpendicular to each other.

Proof. Assume that T x = λx and T y = µy. Then λ(x, y) = (T x, y) = (x, T y) =µ(x, y). Since λ6₌_µ_{this implies that (}_{x, y}_{) = 0.}

If λ = 0 then the eigenspace Eλ(T) is simply the nullspace N(T) and

λ= 0 is an eigenvalue ofT if and only ifT has an nontrivial nullspace. IfT

is selfadjoint this eigenspace is perpendicular to the rangeR(T) and so no eigenvector associated with the eigenvalue zero is in the range ofT.

By contrast, if λ6_{= 0, then} _E_λ₍_T₎ ⊆_R₍_T_{) since every eigenvector}

asso-ciated withλsatisfiesx=λ−1T x.

A subspaceV ⊆_X_{is called} _T_-invariant_{if it satisfies}_T₍_V₎⊆_V_{. In this}

(13)

2.3. SELFADJOINT AND POSITIVE OPERATORS 9

Proposition 2.3.2. Let T be a selfadjoint operator on Hand V ⊆H_a _T

-invariant subspace. Then the orthogonal complementV⊥is alsoT-invariant. Proof. Letx∈_V⊥_{. Then for all}_y _∈_V _{we have (}_{T x, y}_{) = (}_{x, T y}_{) = 0, since}

T y∈_V_{. Thus}_{T x}∈_V⊥_.

Assume thatV is a closedT invariant subspace, write H=V +V⊥ and let T1,T2 denote the restrictions ofT toV respectivelyV⊥. Then

T =T1◦πV +T2◦πV⊥,

where πV, πV⊥ are the orthogonal projections onto the subspaces V, V⊥.

Thus the restrictions T1,T2 completely determine the operatorT.

Every eigenspace Eλ(T) of T and in particular the null space N(T) is

T-invariant. Write

H=N(T) +W,

where W = N(T)⊥. Then the restriction of T to W is a linear operator on W and obviously this restriction completely determines the operator T

(since the restriction of T to its null space is simply zero).

Thus we will often be able to disregard the eigenvectors associated with the eigenvalue zero, that is, the eigenvectors in the nullspace ofT.

Proposition 2.3.3. If the operator T onH is selfadjoint, then

k_Tk_{= sup}{ |₍_{T x, x}₎| _: k_xk_{= 1}}_. _(2.5)

Proof. Clearly it will suffice to show (2.5) with ”k_xk _{= 1” replaced with}

”k_xk ≤_1”.

Setψ(x, y) = (x, T y). Then ψis a bilinear form withk_ψk₌k_Tk_{. Since}

T is selfadjoint,ψ is symmetric. Now apply (2.3).

Positive operators. A bounded linear operatorA on H is calledpositive

if it satisfies

(Ax, x)≥₀_, _{for all}_x∈H_.

If strict inequality holds for all nonzero x, thenA is called strictly positive.

For example, if X and Y are Hilbert spaces and T : X → _Y _{a bounded}

linear operator, then the operator A=T∗T on X is positive: (Ax, x) = (T∗T x, x) = (T x, T x) =k_{T x}k2 ≥₀_.

Proposition 2.3.4. If the operatorAonHis positive, then every eigenvalue λ of A satisfies λ≥₀_.

(14)

Proof.Letx be an eigenvector with eigenvalue λ. Then

λk_xk2 ₌_λ₍_{x, x}_{) = (}_{Ax, x}₎≥₀_.

Proposition 2.3.5. If the operator A on H is positive, then the operator αI+A has a bounded inverse on all of H, for eachα >0.

Proof.Letα >0 an set T =αI+A. Then, for each x∈H_{we have} k_{T x}k2₌_α2k_xk2_{+ 2}_α₍_{Ax, x}_{) +}k_Axk2≥_α2k_xk2_.

It follows thatT is one to one and has closed range. Moreover T is selfad-joint. ThusR(T)⊥=N(T) ={₀}_{. Thus}_T _{has dense range. It follows that}

R(T) =Hand T has an inverse T−1:H→H _{as a linear map. The inverse}

is bounded sincek_{T x}k ≥_αk_xk _{implies that}_T−1_y≤_α−1k_yk_.

We will also need the following result

Proposition 2.3.6. If the operator A on H is positive, then there exists a unique positive operator S on Hsuch that A=S2_.

The operator S is called the (positive) square root of A and denoted

S =

√

A. The existence of S is a special case of the so called continuous functional calculus which is a consequence of the representation theory of commutativeC∗-algebras. This theory is quite easy and provides the most natural proof. The reader is referred to the literature.

2.4 Compact operators between Banach spaces

Let us recall without proof some facts about compact sets in a complete normed spaceX. A subsetA⊆_X _{is called}_{relatively compact} _{if the closure}

ofA is compact. The set A is calledtotally bounded if for each >0 there are finitely many balls B(xi, ), xi ∈ X, of radius which cover A. With

this

Theorem 2.4.1. For a subset A⊆_X _{the following are equivalent:}

(i) A is relatively compact. (ii) A is totally bounded.

(iii) Each sequence (an)⊆A has a subsequence which converges in X.

The proof is given in every class on metric spaces. The limit of the subsequence in (iii) will be in the closure ofAbut need not be in A itself.

(15)

2.4. COMPACT OPERATORS BETWEEN BANACH SPACES 11

LetX, Y be complete normed spaces. A linear operator T :X →_Y _is

called compact if the image T(B) ⊆_Y _{of the unit ball} _B ⊆_X _{is relatively}

compact inY. T is called afinite rank operatorit the rangeR(T) :=T(X)⊆

Y is finite dimensional. In this case T has the form

T(x) =X

j<nΛj(x)φj, x

∈_X, _(2.6)

where n=dim(R(T)), φj ∈Y and the Λj are continuous linear functionals

on X. Simply let the {_φ₀_{, . . . , φ}_n₋₁} _{be a basis for}_R₍_T_{) and Λ}_j ₌_ψ_j◦_T_,

where ψj is the coordinate functional associated with the basis vector φj,

that is,

y =X

j<nψj(y)φj, y

∈_R₍_T₎_.

Now set y = T x. Conversely every operator of this form is finite dimen-sional withR(T) =span({_φ_j}_{). Since a bounded set in a finite dimensional}

space is relatively compact (Bolzano-Weierstrass Theorem) every finite rank operator is compact.

Theorem 2.4.2. Let X, Y be complete normed spaces and T :X → _Y _a

linear operator.

(i) If T is a finite rank operator then T is compact.

(ii) If T is the limit in operator norm of compact operators, thenT is com-pact.

Proof. Assume that Tn:X→ Y is compact, for each n≥1, and Tn→T in

operator norm. Let B ⊆_X _{be the unit ball and}_>_{0. Choose}_n_{such that} k_T_n−_Tk_{< /}_{2. There exist finitely many balls}_B₍_y_i_{, /}₂₎⊆_Y _{which cover}

Tn(B). Then the corresponding balls B(yi, ) coverT(B). This shows that

T(B) is totally bounded.

Let us introduce the following notation: with B(X, Y) we denote the space of all bounded linear operators T : X → _Y_{. Likewise} _F₍_{X, Y}_{) and}

K(X, Y) denote the set of finite rank respectively compact operators in

B(X, Y). IfX =Y, we writeB(X),F(X) andK(X) forB(X, X),F(X, X) and K(X, X).

It is easily verified that F(X, Y) and K(X, Y) are in fact subspaces of

B(X, Y). Then from (ii)

F(X, Y)⊆_K₍_{X, Y}₎⊆_B₍_{X, Y}₎_.

The converse of (ii) is not true in general but it is true if X and Y are Hilbert spaces as we shall see below. In other words, F(X, Y) 6₌_K₍_{X, Y}₎

(16)

For an operatorT ∈_F₍_{X, Y}_{) we set} _rank₍_T_{) =}_dim₍_R₍_T_{). If}_T _{has the}

form (2.6, thenrank(T) =n ifφ0, . . . , φn−1 are linearly independent. Let T ∈_K₍_{X, Y}_{). Then the image} _T₍_D₎ ⊆ _Y _{of each bounded subset}

D⊆_X _{is relatively compact. Using (2.4.1) we see}

Proposition 2.4.1. LetT ∈_B₍_{X, Y}₎_{. Then} _T _{is compact if and only if the}

sequence(T xn)⊆Y has a convergent subsequence for each bounded sequence

(xn) a bounded sequence in X.

LetAbe any set andτ,σ topologies onAwithτ ⊆_σ_{. If}_τ _{is a Hausdorff}

topology and A compact in the topologyσ thenτ =σ.

It will suffice to show that each σ-closed set F ⊆_A _is _τ_{-closed. Indeed,}

Fisσ-compact and henceτ-compact (every cover withτ-open sets is a cover withσ-open sets). Sinceτ is Hausdorff it follows thatF isτ-closed.

Let X be a normed space and X∗ the space of all continuous linear functionals onX. Recall that the weak topology onX is the weakest topol-ogy in which all functionals F ∈ _X∗ _{are continuous. Clearly this topology}

is weaker than the norm topolgy on X. It is a Hausdorff topology (the continuous linear functionals on a normed spaceX separate points onX).

The observation above shows that the weak topology agrees with the norm topology on every norm compact subset ofX. Recall that a sequence (xn) ⊆ X satisfies xn → x weakly (in the weak topology) if and only if

F(xn)→F(x), for each continuous linear functionalF ∈X∗.

Proposition 2.4.2. Let T ∈ _B₍_{X, Y}₎ _{be compact and} ₍_x_n₎ ⊆ _X _bounded.

If xn→x∈X weakly, then T xn→T x in norm.

Proof. Since T is bounded the weak convergence xn → x ∈ X implies the

weak convergence T xn → T x. Choose a bounded subset B ⊆ X with

(xn)⊆B and x∈B. Then K =T(B)⊆Y is compact. Consequently the

weak topology agrees with the norm topology on K. Since T xn, T x ∈ K

andT xn→T x weakly it follows that T xn→T xin norm.

Remark. A weakly convergent sequence (xn) is automatically bounded, that

is the assumption of boundedness above is superfluous but we don’t need this result. If (xn) is weakly convergent then it is weakly bounded, ie.

sup_n|_F₍_x_n₎|_<∞_,

for each continuous linear functional F ∈ _X∗_{. The Uniform Boundedness}

(17)

2.4. COMPACT OPERATORS BETWEEN BANACH SPACES 13

Exercise. Let X, Y, Z be complete normed spaces and T : X → _Y_,

S :Y → _Z _{bounded linear operators. If one of}_S_, _T _{is compact then so is}

the product S,T.

Hint: regardless of compactnessT maps bounded sets to bounded sets and

S maps relatively compact sets to relatively compact sets. We conclude this section with a characterization of compact operators on Hilbert space

Theorem 2.4.3. LetXandY be Hilbert spaces andT ∈_B₍_{X, Y}₎_{a bounded}

linear operator. Then T is compact if and only if k_{T e}_nk → ₀_{, for each}

orthonormal sequence (en)⊆X.

Proof. (⇒_{) Assume that}_T _{is compact and let (}_e_n₎⊆_X _{be an orthonormal}

sequence. Then

X

n

|₍_{x, e}_n₎|2 ≤ k_xk2_<∞

and so (x, en)→ 0, as n↑ ∞, for each x∈X. By the Riesz representation

theorem this means F(en) → 0, for each continuous linear functional F ∈

X∗, that is, en →0 weakly inX. According to 2.4.2 the compactness of T

now implies T en→0 in norm.

(⇐_{) Recall that} _N₁ _{denotes the closed unit ball of a normed space} _N_.

Assume that T is not compact and henceT(X1)⊆_Y _{not totally bounded.}

Let > 0 be such that the closure T(X1) cannot be covered with finitely many balls of radius 2. We construct an orthonormal sequence (en) ⊆X

such that k_{T e}_nk ≥_{, for all} _n≥_1.

(A) We claim that for every finite dimensional subspaceN ⊆_X_{there exists}

e∈_N⊥ _withk_ek_{= 1 and} k_{T e}k ≥_.

If this were not true let N ⊆ _X _{be a finite dimensional subspace such}

that k_{T e}k ≤_{, for all} _e∈_V _:=_N⊥ _withk_ek ≤_{1, that is}_T₍_V₁₎⊆_Y₁_.

Note that T(N1) ⊆Y is compact and hence can be covered by finitely many balls Bj(yj, ) of radius . Since X1 ⊆ N1+V1 we have T(X1) ⊆

T(N1) +T(V1). It follows that T(X1) is covered by the balls Bj(yj,2) in

contradiction to the choice of. This shows (A).

(B) Now we can construct the sequence (en) by induction. Using (A) with

N = {₀} _find _e₀ _with k_{T e}₀k ≥ _{. Given that orthonormal} _e₀_{, . . . , e}_n _with k_{T e}_jk ≥ _{have already been constructed set} _N ₌ _span₍{_e₀_{, . . . , e}_n}_{) and}

choose en+1 ∈ N⊥ with ken+1k = 1 such that kT en+1k ≥ . Then the sequence {_e₀_{, . . . , e}_n₊₁}_{is orthonormal and the construction continues.}

(18)

2.5 Compact selfadjoint operators

Let T be a compact, selfadjoint operator on a Hilbertspace H. Then T

can be diagonalized in the sense that there is an orthonormal basis for H consisting of eigenvectors ofT. This result makes it very easy to work with such operators. For the proof we need the following

Lemma 2.5.1. Let T be a compact, selfadjoint operator on H. Then at least one ofλ=k_Tk _or _λ₌−k_Tk _{is an eigenvalue of} _T_.

Proof.We may assume thatT 6_{= 0. From (2.5) we get a sequence of vectors}

xn ∈H with kxnk = 1 and λsuch that |λ|=kTk and |(T xn, xn)| → λ, as

n↑ ∞_{. Then, for each} _n≥_{0 we have}

0≤ k_{T x}_n−_λx_nk2 ₌ k_{T x}_nk2−₂_λ₍_{T x}_n_{, x}_n_{) +}_λ2k_x_nk2 _(2.7) ≤ k_Tk2−₂_λ₍_{T x}_n_{, x}_n_{) +}_λ2 _(2.8)

Asn↑ ∞_{, the rightmost quantity converges to 2}_λ2₋₂_λ2_{= 0. Thus we also} haveT xn−λxn→0. Setyn=T xn. By compactness ofT the sequence yn

has a convergent subsequence.

Passing to this subsequence we may assume that the sequenceynis itself

convergent. But then the sequencexn=λ−1(yn−(yn−λxn) converges also.

Since T xn−λxn → 0 the limit x = limnxn must satisfy T x = λx. Since

k_x_nk_{= 1, for all} _n_{, we have} k_xk_{= 1.}

With this we can now prove the main result about compact selfadjoint op-erators:

Theorem 2.5.1. LetT be a compact, selfadjoint operator onH. Then there exists an orthonormal basis forHconsisting of eigenvectors ofT. More pre-ciselyN(T)⊥ _{has a countable orthonormal basis} ₍_φ

j)consisting of

eigenvec-tors ofT and ifλj are the associated eigenvalues, then

T x=X

jλj(x, φj)φj, x

∈H_,

where the series converges in the norm ofH. If the sequence(φj) is infinite,

thenλj →0, asj ↑ ∞.

Proof. By induction we construct a (possibly finite) sequence of numbers

λj 6= 0 and orthonormal vectors φj such that

(i)T φj =λjφj,

(ii) the restrictionTj of T to{φ0, . . . , φj−1}⊥ satisfies kTjk=|λj|, and

(19)

2.5. COMPACT SELFADJOINT OPERATORS 15

Since the λj are nonzero, each φj is in N(T)⊥ and from (iii) it follows that

theφj span all ofN(T)⊥ (recall that (A⊥)⊥ is the closed linear span ofA).

The quantitiesλ0 and φ0 exist by lemma (2.5.1). Assume thatλ0, . . . λj

and φ0, . . . , φj have already been constructed. Set

Xj ={φ0, . . . , φj}⊥.

IfT = 0 onXj, then we are finished. Otherwise note thatXj is a closedT

-invariant subspace (sincespan({_φ₀_{, . . . , φ}_j}_{) is}_T_{-invariant). The restriction}

Tj of T to Xj is a compact selfadjoint operator on Xj. Applying lemma

(2.5.1) toTj we see that there is a unit vectorφj+1∈Xj and a numberλj+1 such that

(a) |_λ_j₊₁|₌k_T_jk _and

(b) T φj+1 =Tjφj+1=λj+1φj+1.

Obviouslyφj+1 ⊥φ0. . . , φj and so the resulting sequence (φj) is

orthonor-mal. IfTj = 0 at any time, then (iii) is already satisfied and we are finished.

Assume now thatTj 6= 0, for allj≥0, setX={φ0, φ1, . . .}⊥ and letS be the restriction of T toX. We must show thatS = 0.

From (ii) it follows that|_λ₀| ≥ |_λ₁| ≥ · · · ≥ |_λ_j| ≥ k_Sk_{, for all}_j≥_{0, and}

so it will suffice to show that λj → 0 asj↑ ∞.

Ifλj 6→0, we have |λj| ≥ρ for some numberρ >0. Then the sequence

(φj/λj)⊆H is bounded and by compactness ofT the sequence

yj =T(φj/λj) =φj

has a convergent subsequence. However this contradicts the fact that the sequence φj is orthonormal and hence kφj − φkk =

√

2, for all j 6₌ _k_.

Consequently we must have λj →0.

Remark 2.5.1 (Spectrum). We claim that the sequence (λj) contains

all the nonzero eigenvalues of T. If λ6₌_λ_j_,_{0 were another eigenvalue, the}

associated eigenspace would be contained in N(T)⊥ and perpendicular to all the φj which contradicts the fact that the φj span N(T)⊥. It follows

that the λj contain all the nonzero eigenvalues of T. Note also that the

convergence λj → 0 implies that the eigenspaces corresponding to nonzero

eigenvalues are all finite dimensional.

The sequence (λj) contains all nonzero eigenvalues ofT but what about

the spectrum of T, that is the set

(20)

Let us assume thatHis not finite dimensional. Then the unit ballH1 is not compact. It follows thatT is not invertible, that is, 0∈_σ₍_T_{) (regardless of}

wether 0 is an eigenvalue or not). However, ifλ6₌_λ_j_,_{0, for all} _j ≥_{0, then}

it can be shown that the operatorT −_λI _{is invertible on} H_{. To compute}

(T−_λI₎−1 _{we must solve}

(T−_λI₎_x₌_y _(2.9)

forx in terms of y. Write V = N(T) and x =πV(x) +πV⊥(x) as well as

y=πV(y) +πV⊥(y). With this (2.9) becomes

−_λπ_V₍_x_{) + (}_T−_λI₎_π_V⊥(x) =π_V(y) +π

V⊥(y)

and since V⊥ is T-invariant and hence T −_λI_{-invariant, this is equivalent}

with

−_λπ_V₍_x_{) =}_π_V₍_y₎ _and ₍_T −_λI₎_π_V⊥(x) =π_V⊥(y) (2.10)

Since the φj are an ON-basis for V⊥ we have πV⊥(y) = P

j(y, φj)φj and

π_V⊥(x) = P

jαjφj with αj to be determined. Note that (T −λI)φj =

(λj −λ)φj. With this (2.10) becomes

X

jαj(λj −λ)φj =

X

j(y, φj)φj

which solves forαj = (y, φj)/(λj−λ) resulting in

x=πV(x) +πV⊥(x) =− 1 λπV(y) + X j (y, φj) λj−λ φj.

The solutionx exists for each y and is a continuous linear function of y, in other words (T−_λI₎−1_y₌−1 λπV(y) + X j (y, φj) λj−λ φj

exists as a continuous linear operator onH. Consequently the pointλis not in the spectrum ofT and we have shown that

σ(T) ={_λ_j} ∪ {₀}_.

Remark 2.5.2 (Range). The series expansion (2.5.1) also allows us to determine the rangeR(T) quite easily. Lety∈H_{and consider the equation}

T x=y. (2.11)

If this equation has a solution x, then y ∈ _N₍_T₎⊥_{. Assume now that} _y ∈

(21)

2.6. COMPACT OPERATORS BETWEEN HILBERT SPACES 17

withT x=y we can restrict ourselves tox∈_N₍_T₎⊥_{. Such}_x_{will then have}

an expansion

x=X

jαjφj (2.12)

withαj to be determined. In terms of these series expansion (2.11) becomes

X

jαjλjφj =T x=y=

X

j(y, φj)φj

which implies that we must have αj =λ−_j1(y, φj). However for theseαj the

series (2.12) converges exactly if P_jλ−_j2|₍_{y, φ}_j₎|2 _<∞_{. It follows that}

R(T) = n y ∈_N₍_T₎⊥ _: X jλ −2 j |(y, φj)|2<∞ o

2.6 Compact operators between Hilbert spaces

The case of a general compact operatorsT :X→_Y _{between Hilbert spaces}

XandY can be reduced to the selfadjoint case by observing that the product

T∗T is a compact, selfadjoint operator onX. The results of the last section then carry over with minimal changes.

LetX and Y be Hilbert spaces, T ∈_B₍_{X, Y}_{). A} _{singular system} _for _T

is a sequence (µj, φj, ξj)j where

(i) µ0≥µ1≥ · · · ≥µn· · ·>0,

(ii){_φ_j}_{is an ON-basis for} _N₍_T₎⊥_,

(iii) {_ξ_j} _{is an ON-basis for} _N₍_T∗₎⊥_{, and}

(iv) T φj =µjξj and T∗ξj =µjφj, for all j≥0.

Assume that (µj, φj, ξj)j is such a system, set V =N(T)⊥ and let x ∈X.

Then the orthogonal projection πV(x) ofx on V has an expansion

πV(x) =

X

j(x, φj)φj

and applying T to this expansion it follows that

T x=T πV(x) =

X

jµj(x, φj)ξj (2.13)

with convergence pointwise on X. For φ∈_X_and _ξ∈_Y _{define the rank one}

operatorS =φ⊗_ξ _as

(22)

Then the above expansion forT can be rewritten as

T =X

jµj(φj

⊗_ξ_j₎ _(2.14)

where the series converges pointwise onX. Set

Tn=

X

j<nµj(φj

⊗_ξ_j₎ _(2.15)

and letx∈_X_{. Using (i) and the orthonormality of the}_ξ_n _{we have} k₍_T−_T_n₎_xk2 ₌ X j≥nµj(x, φj)ξj 2=X j≥nµ 2 j|(x, φj)|2 ≤ _µ2_nX j≥n |₍_{x, φ}_j₎|2 ≤_µ2_nk_xk2_.

This shows that

k_T −_T_nk ≤_µ_n_. _(2.16)

in operator norm. Lettingx=φnabove we see that we actually have

equal-ity. Consequently, if µn → 0, then the series (2.14) converges in operator

norm and henceT is compact.

Not every operator T ∈ _B₍_{X, Y}_{) has a singular system. However, if}

X=Y and T ∈_B₍_X_{) is selfadjoint, let}{_φ_j}_{be the eigenvectors associated}

with the nonzero eigenvalues λj of T arranged in decreasing order. Then

(µj, φj, ξj)j with µj = λj and ξj = φj is a singular system for T. This is

exactly the content of Theorem 2.5.1. Now we generalize this fact to all compact operatorsT ∈_K₍_{X, Y}_):

Theorem 2.6.1. Let T : X → _Y _{be a compact operator, set} _A ₌ _T∗_T_,

note thatAis compact and selfadjoint onX and let{_φ_j}_{be the eigenvectors}

associated with the nonzero eigenvaluesλj ofAarranged in decreasing order.

Then

µj =

p

λj, and ξj =µ−j1T φj

defines a singular system (µj, φj, ξj)j for T. We have µn → 0 and hence

the series (2.14) converges in operator norm. In particularT is the limit of finite rank operators.

Proof. Note first that N(A) = N(T) according to (2.2.1). Thus theφj are

an ON-basis for N(T)⊥. By definition of (µj, φj, ξj) we have T φj = µjξj

and T∗T φj =µ2jφj and this implies that T∗ξj =µjφj. We claim that{ξj}

is an ON-basis forN(T∗)⊥. Indeed, forj, k ≥_{0 we have}

(ξj, ξk) = (µ −1 j T φj, µ −1 k T φk) = (µjµk) −1₍_T∗ T φj, φk) =δjk. (2.17)

(23)

2.6. COMPACT OPERATORS BETWEEN HILBERT SPACES 19

Thus {_ξ_j} ⊆ _R₍_T₎ ⊆ _N₍_T∗₎⊥ _:= _W _{is an orthonormal system. We claim}

that this system spans all of W. Let w ∈_W _{and assume that} _w ⊥_ξ_j_{, for}

all j≥_{0. Then}_T∗_w∈_R₍_T∗₎⊆_N₍_T₎⊥ _and

(T∗w, φj) = (w, T φj) =µj(w, ξj) = 0,

for all j ≥ _{0. Since the} {_φ_j} _{are an ON-basis for} _N₍_T₎⊥ _{it follows that}

T∗w= 0, that isw∈_N₍_T∗_{) =}_W⊥_{. Thus}_w∈_W ∩_W⊥₌{₀}_{. This shows}

that the orthonormal system {_ξ_j} _in_N₍_T∗₎⊥ _{is complete. .}

Remark. IfT :X→_Y _{is any bounded linear operator and (}_φ_j_{) an ON-basis}

for V = N(T)⊥, then the expansion (2.1) is valid and applying T to this expansion yields

T x=X

j(x, φj)T φj.

What makes the expansion (2.13) interesting is the additional information contained in the singular system for T.

Remark 2.6.1 (Adjoint). Recall thatT∗∗=T. If (µj, φj, ξj)jis a singular

system for T then (µj, ξj, φj)j is a singular system for T∗ and so we have

the expansion

T∗y=X

jµj(ξj ⊗φj).

Thus if T is compact then so is the adjointT∗_.

Remark 2.6.2 (Range). The expansion (2.13) allows us to work with ON-bases just as in the case of a compact selfadjoint operator. As an example we determine the range R(T), that is, we study the equation

T x=y. (2.18)

Fixy∈_Y_{. If a solution exists, then}_y∈_R₍_T₎⊆_N₍_T∗₎⊥_{. Now assume that}

y ∈ _N₍_T∗₎⊥_{. Then we have an expansion} _y ₌P

j(y, ξj)ξj. If there exists

any solutionxof (2.18) inX, then there exists a solution inV =N(T)⊥(in factπV(x) is one). Thus we may assume thatx∈V and have an expansion

x=X

jαjφj. (2.19)

Applying T to this yields

X

jαjµjξj =T x=y=

X

(24)

It follows that we must have αj = µ−j1(y, ξj). With this the series for x

converges exactly ifP_jµ−_j2|₍_{y, ξ}_j₎|2 _<∞_{. Consequently}

R(T) = n y ∈_N₍_T∗₎⊥ _: X jλ −2 j |(y, ξj)| 2 _<_∞o _(2.20) exactly as in the selfadjoint case.

2.7 Hilbert-Schmidt and trace class operators

LetX,Y be Hilbert spaces,T ∈_K₍_{X, Y}_{) compact and (}_µ_j_{, φ}_j_{, ξ}_j₎_j _a

singu-lar system forT. We know from Theorem 2.6.1 thatT is the limit in operator norm of finite operators. Now we quantify the speed of convergence.

Approximation numbers. Set

Tn=

X

j<nµj(φj

⊗_ξ_j₎_.

We have seen that then

k_T −_T_nk ≤_µ_n_. _(2.21)

On the other hand we show now that

k_T−_Sk ≥_µ_n_, _(2.22)

for each finite rank operator S ∈ _F₍_{X, Y}_{) with} _rank₍_S₎ ≤ _n_{. Set} _X_n ₌

span({_φ₀_{, . . . , φ}_n}_{and note that}

k_{T x}k ≥_µ_nk_xk_, _{for all} _x∈_X_n_. _(2.23) Let x ∈ _X_n_{. Then} _x ₌ P j≤n(x, φj)φj and so T x = P j≤nµj(x, φj)ξj. It follows that k_{T x}k2 ₌X j≤nµ 2 j|(x, φj)|2 ≥µ2n X j≤n |₍_{x, φ}_j₎|2 ₌_µ2_nk_xk2_.

Now letS ∈_F₍_{X, Y}_{) with}_dim₍_R₍_S₎₎≤_n_{. Then}_S _{is not one to one on}_X_n

and so there exists a unit vectoru∈_X_n _with_Su_{= 0. Using (2.23) we have} k₍_T −_S₎_uk₌k_{T u}k ≥_µ_n_.

Thusk_T −_Sk ≥_µ_n_{. The quantities}

(25)

2.7. HILBERT-SCHMIDT AND TRACE CLASS OPERATORS 21

are called the approximation numbers of T. Here a0(T) = kTk. The esti-mates (2.21) and (2.22) show that

µn=an(T) (2.25)

and that the operator S = Tn provides the best approximation of T in

the operator norm among all operators of rank at most n. In particular this shows that the numbers µn in a singular system for T are uniquely

determined by T and do not depend on the singular system.

The µn are called the singular values of T. Obviously the vectors φn

andξnin a singular system forT are not uniquely determined. Consider the

selfadjoint case and note that there are many ways to extract an orthonormal basis from each eigenspace of T.

The approximation numbers an(T) are defined for each bounded linear

operator T ∈ _B₍_{X, Y}_). _T _{is compact if and only if} _a_n₍_T₎ → _{0, as} _n ↑ ∞

and this is the only case of interest. In this case we havean(T) =µn, where

the µj are the singular values ofT (square root of the eigenvalues of T∗T).

For each bounded linear operator T ∈_B₍_{X, Y}_{) let} k_Tk₌X_a_n₍_T₎p

1/p

and let

S_p₍_{X, Y}_{) =}{_T ∈_B₍_{X, Y}_{) :} k_Tk

p <∞ }.

Clearly each T ∈ S_p₍_{X, Y}_{) is compact. One can show that} S_p₍_{X, Y}₎ ⊆

B(X, Y) is a closed subspace but we won’t need this result. We are only interested in the cases p= 1,2.

We now assume thatT ∈_K₍_{X, Y}_{) is compact and (}_µ_j_{, φ}_j_{, ξ}_j₎_j _{a singular}

system for T.

Hilbert-Schmidt operators. The operatorT is called aHilbert-Schmidt operator, ifT ∈ S₂₍_{X, Y}_{), that is,}

k_Tk2 2 := X nan(T) 2 ₌X nµ 2 n<∞.

Proposition 2.7.1. If T ∈_K₍_{X, Y}₎ _{is compact and} {_e_α} _{is any ON-basis}

for X, then k_Tk2 2= X α k_{T e}_αk2_.

Remark. It follows that T is a Hilbert-Schmidt operator if and only if

P

αkT eαk

2

<∞_{, for some ON-basis} {_e_α} _of _X _{and in this case the sum is}

(26)

We do not assume that X is separable, that is that the basis {_e_α} _is

countable. However since T and hence T∗ are compact the entire action is essentially separable: both N(T)⊥ and R(T) = N(T∗)⊥ have countable ON-bases.

Proof. Let {_e_α} _{be any ON-basis for} _X_{. Since} {_ξ_k} _{is an ON-basis for}

N(T∗)⊥=R(T), we have k_{T e}_αk2₌X k|(T eα, ξk)| 2₌X k|(eα, T ∗ ξk)|2 = X kµ 2 k|(eα, φk)|2,

for eachα. It follows that

X α k_{T e}_αk2 ₌ X α X kµ 2 k|(eα, φk)|2 =X kµ 2 k X α |₍_e_α_{, φ}_k₎|2 = X kµ 2 kkφkk2 = X kµ 2 k=kTk 2 2.

Hilbert-Schmidt operators on the spaceX =L2(ν) of square integrable functions with respect to a finite measure ν will be characterized in terms of integration kernels below.

Trace class operators. We now assume that X and Y have the same orthogonal dimension, that is, ON-bases {_e_α} _of _X _and {_f_α} _of _Y _{can be}

indexed with the same indicesα. Because of the compactness of T we can even assume both spaces to be separable. The operator T is called a trace class operator,ifT ∈ S₁₍_{X, Y}_{), that is,}

k_Tk

1=

X

nan(T)<

∞_.

Recall that (µj, φj, ξj)j denotes a singular system for T. It follows that

k_Tk

1=

P

nµn.

Proposition 2.7.2. Let T ∈_K₍_{X, Y}₎_{. Then} k_Tk

1 = max

X

α

|₍_{T e}_α_{, f}_α₎|_, _(2.26)

where the maximum is taken over all ON-bases{_e_α} _of _X _and {_f_α} _of _Y_.

Proof. Let {_e_α} _and {_f_α} _{be ON-bases of} _X _and _Y _{and write} _{T e}_α ₌

P jµj(eα, φj)ξj. It follows that X α |₍_{T e}_α_{, f}_α₎| ≤ X α X jµj |₍_e_α_{, φ}_j₎||₍_ξ_j_{, f}_α₎| = X jµj X α |₍_e_α_{, φ}_j₎||₍_ξ_j_{, f}_α₎| ≤ X jµj X α |₍_e_α_{, φ}_j₎|2 1/2X α |₍_e_α_{, ξ}_j₎|2 1/2 ≤ X jµj k_φ_jk k_ξ_jk₌X kµk= k_Tk 1.

Regression With Gaussian Measures