Markov Chain Monte Carlo (MCMC)

MCMC is a very general algorithm for sampling from any distribution. For example, there is no simple method for sampling models wfrom the posterior distribution except in specialized cases (e.g., when the posterior is Gaussian).

MCMC is an iterative algorithm that, given a sample xt ∼ p(x), modifies that sample to

produce a new sample xt+1 ∼ p(x). This modification is done using a proposal distribution q(x′_|x), that, given ax, randomly selects a “mutation” tox. This proposal distribution may be almost anything, and it is up to the user of the algorithm to choose this distribution; a common choice would be simply a Gaussian centered atx: q(x′_|x) = N(x′_|x, σ2_I₎_.

The entire algorithm is:

select initial point

x

₁

t

_←1

loop

Samplex

′

_∼

q(x

′

_|x

)

α

←

_PP∗∗₍(x_x′_t)₎ q(xt|x′) q(x′_|_x_t₎

Sampleu

∼Uniform[0,1]

ifu

≤

α

then

x

t+1

←

x

′

else

x

t+1

←

x

end if

t_←

t+ 1

end loop

Amazingly, it can be shown that, ifx₁ is a sample fromp(x), then every subsequentx_tis also a sample fromp(x), if they are considered in isolation. The samples are correlated to each other via the Markov Chain, but the marginal distribution of any individual sample isp(x).

So far we assumed that x₁ is a sample from the target distribution, but, of course, obtaining this first sample is itself difficult. Instead, we must perform a process called burn-in: we initialize with any x1, and then discard the first T samples obtained by the algorithm; if we pick a large

enough value ofT, we are guaranteed that the remaining samples are valid samples from the target distribution. However, there is no exact method for determining a sufficient T, and so heuristics and/or experimentation must be used.

0

0.5

1

1.5

2

2.5

3

0

0.5

1

1.5

2

2.5

3

Figure 19: MCMC applied to a 2D elliptical Gaussian with a proposal distribution consisting of a circular Gaussian centered on the previous sample. Green lines indicate accepted proposals while red lines indicate rejected ones. (Figure from Pattern Recognition and Machine Learning by Chris Bishop.)

13 Principal Components Analysis

We now discuss an unsupervised learning algorithm, called Principal Components Analysis, or PCA. The method is unsupervised because we are learning a mapping without any examples of what the mapping looks like; all we see are the outputs, and we want to estimate both the mapping

and the inputs.

PCA is primarily a tool for dealing with high-dimensional data. If our measurements are 17- dimensional, or 30-dimensional, or 10,000-dimensional, manipulating the data can be extremely difficult. Quite often, the actual data can be described by a much lower-dimensional representation that captures all of the structure of the data. PCA is perhaps the simplest approach for finding such a representation, and yet is it also very fast and effective, resulting in it being very widely used.

There are several ways in which PCA can help:

• Visualization: PCA provides a way to visualize the data, by projecting the data down to two or three dimensions that you can plot, in order to get a better sense of the data. Furthermore, the principal component vectors sometimes provide insight as to the nature of the data as well.

• Preprocessing: Learning complex models of high-dimensional data is often very slow, and also prone to overfitting — the number of parameters in a model is usually exponential in the number of dimensions, meaning that very large data sets are required for higher-dimensional models. This problem is generally called the curse of dimensionality. PCA can be used to first map the data to a low-dimensional representation before applying a more sophisticated algorithm to it. With PCA one can also whiten the representation, which rebalances the weights of the data to give better performance in some cases.

• Modeling: PCA learns a representation that is sometimes used as an entire model, e.g., a prior distribution for new data.

• Compression: PCA can be used to compress data, by replacing data with its low-dimensional representation.

13.1 The model and learning

In PCA, we assume we are givenN data vectors_{yi}, where each vector isD-dimensional: yi ∈

RD. Our goal is to replace these vectors with lower-dimensional vectors_{x_i_}with dimensionality

C, whereC < D. We assume that they are related by a linear transformation:

y=Wx+b=

j=1

wjxj+b (240)

The matrix Wcan be viewed as a containing a set of C basis vectors W = [w1, ...,wC]. If we

also assume Gaussian noise in the measurements, this model is the same as the linear regression model studied earlier, but now thex’s are unknown in addition to the linear parameters.

To learn the model, we solve the following constrained least-squares problem: arg min W,b,{xi} X i ||yi−(Wxi+b)||2 (241) subject toWTW=I (242)

The constraintWTW=Irequires that we obtain an orthonormal mappingW; it is equivalent to saying that

wT_i wj =

1 i=j

0 i6=j (243)

This constraint is required to resolve an ambiguity in the mapping: if we did not require W to be orthonormal, then the objective function is underconstrained (why?). Note that an ambiguity remains in the learning even with this constraint (which one?), but this ambiguity is not very important.

Thexcoordinates are often called latent coordinates.

The algorithm for minimizing this objective function is as follows: 1. Letb= _N1 P_iyi

2. LetK= 1

i(yi−b)(yi−b)T

3. LetVΛVT = Kbe the eigenvector decomposition of K. Λis a diagonal matrix of eigenvalues (Λ = diag(λ1, ...λD)). The matrixV contains the eigenvectors: V = [V1, ...VD]

and is orthonormalVTV =I.

4. Assume that the eigenvalues are sorted from largest to smallest (λi ≥λi+1). If this is not the

case, sort them (and their corresponding eigenvectors).

5. LetWbe a matrix of the firstC eigenvectors:W= [V₁, ...V_C]. 6. Letxi =WT(yi−b), for alli.

In document Machine Learning and Data Mining Lecture Notes - Free Computer, Programming, Mathematics, Technical Books, Lecture Notes and Tutorials (Page 78-81)