Stochastic Processes - Kernel Methods, Stochastic Processes and Bayesian Non-

Chapter 2 Kernel Methods, Stochastic Processes and Bayesian Non-

2.2 Stochastic Processes

We have concluded our introduction to RKHSs and now move on to discuss stochastic processes. As we will see, the theory of stochastic processes, and especially GPs, is closely intertwined with that of RKHSs. Understanding this relation will be important in the theoretical developments of further chapters.

2.2.1 Introduction to Stochastic Processes

Stochastic processes are one of the major tools used throughout probability theory and statistics, and providing a complete overview of this topic is out of the scope of this thesis. In this chapter, we will mostly focus on the notions which will be useful in the following chapters, and highlight connections with the theory of RKHSs. Further details can be found in the books of Doob [1953]; Gikhman and Skorokhod [1969]; Karlin and Taylor [1975]; Grimmett and Stirzaker [2001]; Koralov and Sinai [2007]; Pavliotis [2014]. See also the paper by Meyer [2009] for a historical overview.

To avoid delaying this further, we begin with the definition of stochastic process (also called random process).

Definition 2(Stochastic process). Let (X,B(X))be a measurable space consist- ing of an index setX and its corresponding Borelσ-algebraB(X). Let(Ω,F,P)be a probability space, and(Y,G) a measurable space. A stochastic process is a collection

{g(x,·) : x ∈ X } such that for each fixed x ∈ X, g(x,·) : Ω → Y is a random

variable.

Stochastic processes are informally viewed as random functions. For a fixed

x ∈ X, a stochastic process is a Y-valued random variable, whereas for a fixed

ω∈Ω, it consists of a (deterministic) function g(·, ω) :X → Y.

The set X is known as the sample space, where Y is the state space of the stochastic process. In the literature, the sample space is often denoted using the dummy variable T due to the historical context of random functions over time. However, it is now common to haveX be a multidimensional index (e.g. time and space). In particular, whenX ⊆_R2_{, the stochastic process is often called a random}

field. Note thatX can be either a finite or infinite index set.

The stochastic processes that we will look at in later chapters will have X

and Y being Euclidean spaces. For this reason, we will limit ourselves to this level of generality for the remainder of the chapter.

A first example of stochastic process that we have already encountered in this thesis are the discrete-time Markov chains used in MCMC methods, for example the random-walk Metropolis algorithm with Gaussian proposal (see Chapter 1). In this case X is clearly discrete and the process is real-valued. A second example is the Langevin diffusion which was used to construct the Metropolis-adjusted Langevin algorithm. In this case the index set is one dimensional and continuous: X =R+.

Furthermore, the discretisation of the diffusion is itself also a stochastic process, but defined on a discrete spaceX.

2.2.2 Characterisations of Stochastic Processes

Now that we have introduced stochastic processes, we can ask ourselves how to characterise and classify them further. There are two main ways in which we can characterise stochastic processes, through their finite-dimensional distributions, and through their Karhunen-Lo`eve expansion.

Characterisation via Finite-Dimensional Distributions

The finite-dimensional distributions of a stochastic process is the family of distributions of theYn_{-valued random variables (}_g₍_x

1,·), . . . , g(xn,·)) for all n∈N and

{xi}ni=1 ⊂ X.

There are several important properties of stochastic processes which are usually specified using finite-dimensional distributions of the process. First, we say that a stochastic process is stationary if and only if the finite-dimensional distributions are invariant with respect to shifts in the index set. In other words, the process is stationary if the distribution of (g(x1,·), . . . , g(xn,·)) is the same as

that of (g(x1+x0,·), . . . , g(xn+x0,·)) for all x0 ∈ X such that xi +x0 ∈ X for all

i= 1, . . . , n and n∈_N. Clearly it is important to understand whether the relation between stochastic process and their finite-dimensional distributions is one-to-one. The answer is yes under certain regularity conditions provided by the theorem below. The result below will be given for real-valued stochastic processes, but this can be significantly generalised as in Dudley [2002].

Theorem 5 (Kolmogorov Consistency Theorem, Koralov and Sinai [2007], Theorem 12.8). Let {P{xi}n

i=1|{xi}

i=1 ⊂ X, n∈N} be a family of distributions each associated to the product σ-algebra B(Rn). Suppose these satisfy:

• For every permutation {x0_i}n

i=1 of {xi}in=1 ⊂ X and events A1, . . . , An ∈ F withn∈_N: P{xi}n i=1[(g(x1,·), . . . , g(xn,·))∈A1×. . .×An] = P{x0 i}ni=1 g(x0₁,·), . . . , g(x_n0 ,·)∈A1×. . .×An .

• For every points{xi}ni=1 ⊂ X and eventsA1, . . . , An∈ F with n∈N:

P{xi}n

i=1[(g(x1,·), . . . , g(xn,·))∈A1×. . .×An]

= P_{xi}n+1

i=1 [(g(x1,·), . . . , g(xn,·), g(xn+1,·))∈A1×. . .×An×Ω].

Then there is a unique stochastic process whose finite-dimensional distributions co- incide with this collection.

The first example goes back to the Markov chains introduced in the previous chapter (the random-walk Metropolis algorithm and Metropolis-adjusted Langevin algorithm). We notice that in both cases, their finite-dimensional distributions are

given by P{xi}n i=1[(g(x1,·), . . . , g(xn,·))∈A1×. . .×An] = Z A1 . . . Z An T(dx1,dx0)×. . .×T(dxn,dxn−1).

for any eventA1, . . . , An inF whereT denotes the transition kernel of the chain.

This theorem also allows us to introduce our first characterisation of GPs. A real-valued GP is a stochastic process g : X ×Ω → _R such that all the finite- dimensional distributions are Gaussian, i.e., (g(x1,·), . . . , g(xn,·)) is an N(mn, cn)

random variable for some vectorn-dimensional vectormnandcnann×nsymmetric

non-negative definite matrix∀n∈_Nand{xi}ni=1 ⊂ X. GPs will be the basis of most

of the work in later chapters. Extended introductions can be found in Adler [1990]; Stein [1999]; Rasmussen and Williams [2006]. An important property is that two GPs defined on the same measurable space are either equivalent or mutually singular [Feldman, 1958].

Another example of stochastic process are Dirichlet processes [Ferguson, 1973]. We say a stochastic process is a Dirichlet process with base measure G

and concentration parameterαif and only if its finite-dimensional distributions are Dirichlet distributions; i.e. given any finite measurable partition (X₁, . . . ,X_n) of

X, we have that (g(X1,·), . . . , g(Xn,·)) are Dir(αG(X1), . . . , αG(Xn)) distributed for

some concentration parameter α > 0. Here, the notation Dir is used to denote a Dirichlet distribution. Note that this case would require a more general version of the Kolmogorov extension theorem than that presented in this thesis (see for example Dudley [2002]).

Characterisation via the Karhunen-Lo`eve Expansion

A second characterisation of stochastic processes is as an infinite series of basis functions with random coefficients called a Karhunen-Lo`eve expansion [Lo`eve, 1978]. This expansion will depend on the first two moments of the stochastic process, which are the mean functionm:X → Y and covariance functionc:X × X → Y. Denote by EP[X] the expectation of some random variable X under P. The mean and covariance function are defined as:

m(x) := EP[g(x, ω)],

c(x,y) := EP[(g(x, ω)−m(x)) (g(y, ω)−m(y))].

pose that g : X ×Ω → _R is a stochastic process such that for all x,y ∈ X: (i)

g(x,·)∈L2(Ω;P), (ii) m(x) = 0 and (iii) the covariance function c(x,y) is a con-

tinuous function of bothx and y. Then:

g(x, ω) =

∞

j=1

Zj(ω)ψj(x),

where {ψj(x)}∞j=1 are orthonormal eigenfunctions of the Hilbert-Schmidt operator C:L2(X)→L2(X)defined asC[f] :=R

Xc(x,y)f(y)dyand the eigenvalues{λj}∞j=1 are non-negative (assumed without loss of generality to be ordered λ1 > λ2 > . . .). The convergence of the series is in L2_(Ω;

P) and uniform among compact families

of x∈ X, with:

Zj(ω) =

g(x, ω)ψj(x)dx.

Furthermore, the random variablesZj are centred, uncorrelated, and have variance

λj: EP[Zj] = 0and EP[ZjZk] =λjδjk.

This characterisation can be particularly useful for approximating the stochastic process. First, it orthogonalises the stochastic and deterministic parts of the stochastic process. Furthermore, since we have assumed that the eigenvalues are in decreasing order, a truncationPL

j=1Zj(ω)ψj(x) forL >0 of this series is the best

L-dimensional approximation of the stochastic process in anL2(Ω;P) sense. Such a

truncation is therefore the analogue of principal component analysis for stochastic processes. The truncation can also be useful for approximate sampling of a stochastic process. Indeed, all that is required is to sample IID random variables{Zj}Lj=1.

See Huang et al. [2001] for a detailed study.

The Karhunen-Loeve characterisation therefore provides us with a second definition of a GP as the seriesg(x, ω) :=P∞

j=1

λjjψj(x), where{j}∞_j₌₁ are IID N(0,1) random variables and {λj, ψj}∞j=1 are the eigenvalues and eigenfunctions of

the Hilbert-Schmidt operator.

2.2.3 Connection Between Kernels and Covariance Functions

As hinted at previously, there is a close relationship between reproducing kernels and covariance functions. Consider without loss of generality a stochastic process with

m= 0 and covariance functionc. We say that a stochastic process is a second-order stochastic process if EP[|g(x, ω)|2] < ∞ for all x ∈ X (i.e. the process has finite second moment). It turns out that reproducing kernels correspond to covariance

functions of second order stochastic processes:

Theorem 7 (Lo`eve’s Theorem. Lo`eve [1978], p132). A functionc:X × X →_R is the covariance function of a second-order stochastic process if and only if it is positive definite.

Focusing on the special case of Gaussian processes, we have that for any pair of mean functionm and reproducing kernel k, there exists a GP with meanm and covariancekand vice versa; see Theorem 12.1.3 in Dudley [2002].

An important point however is that any realisation of a Gaussian process (or in fact any second-order stochastic process) will usually not lie in the RKHS associated with its kernel/covariance function. Several conditions for these functions to lie in the RKHS are provided in [Driscoll, 1973; Luki´c and Beder, 2001; Pillai et al., 2007]. See also the extended discussion in Kanagawa et al. [2018].

In document Statistical computation with kernels (Page 46-51)