Inference - Machine Learning Background - Statistical models for unsupervised learning of morph

2.3 Machine Learning Background

2.3.8 Inference

Inference of the parameters is an essential part in a learning mechanism. Para- meters of a model or if needed, latent variables in the data, are inferred using various approaches, such as MAP or ML (discussed in Section 2.3.2 and Sec- tion 2.3.1), which give a point estimate for the parameters as mentioned above. However, sometimes it is needed to guess the true nature of the parameters by es- timating their posterior probabilities. A thorough Bayesian inference requires an estimation of the distributions over the possible values of the parameters instead of a point estimate.

One common way to estimate the parameters’ posterior distributions is to draw random samples from their posterior distributions. Drawing random samples from a distribution is called sampling. Markov Chain Monte Carlo (MCMC) methods constitute a big portion of the sampling algorithms in machine learning, and will be presented shortly in the following section.

2.3.8.1 Markov Chain Monte Carlo (MCMC)

MCMC algorithms are designed to find out about complex probability distributions. They are usually used in Bayesian statistics where the underlying posterior probability distribution is unknown. These probability distributions are generally posterior distributions that need to be modelled. In an MCMC algorithm, samples are drawn from a sequence of probability distributions where the samples form a Markov chain. A Markov chain is made up of a sequence of states. Let the sequence of states be X = {X1, X2, . . . , Xn}. With the Markov

property, each state is dependent only on the previous state:

p(Xn+1 = x|X1 = x1, . . . , Xn= xn) = p(Xn+1= x|Xn= xn) (2.30)

With a random sampling from the distribution that is being estimated, the Markov chain should converge to a distribution over states, which is called an equilibrium. Gibbs sampling and Metropolis-Hastings algorithm are two prom- inent examples of the MCMC algorithms.

2.3.8.1.1 Metropolis-Hastings Algorithm

The Metropolis-Hastings algorithm was first proposed by Metropolis et al. (1953) for the Boltzman distribution. The algorithm was enhanced for other types of distributions by Hastings (1970). The algorithm is based on random draws from a series of distributions, where after a number of iterations, the distribution from which samples are drawn becomes the target distribution. Each sample that is drawn is subjected to an acceptance-rejection rule, where the sample might be added to the Markov chain or might be rejected, and therefore another new sample is drawn. Let a Markov chain consist of states X = {. . . , X(t−2)_{, X}(t−1)_{, X}(t)_{} at various time intervals. To determine the following}

state X₁(t+1)in the next time interval (t + 1), a new state is generated that only de- pends on the current state X(t). The transition is based on a proposal distribution q(X|X₁(t)). The new state X₁(t+1)is accepted if:

α < p(X (t+1) 1 ) p(X₁(t)) q(X₁(t)|X₁(t+1)) q(X₁(t+1)|X₁(t)) (2.31) where α is a random value drawn from α ∼ U nif orm(0, 1). Otherwise, the system stays in the same state X₁(t)and a new sample is drawn to be added to the Markov chain.

Accepting the new state, although its probability is lower than the previous state, makes the sampler mix well. If the new state is not accepted in either case, then it is rejected. New states are suggested incrementally, until the distribution from which the new values are sampled converges to the target distribution.

One advantage of using the Metropolis-Hastings algorithm is that any integration that comes within a normalisation constant disappears due to the propor- tion of the probabilities of the two states. Therefore, the algorithm is convenient in problems where it is computationally expensive to calculate a normalisation constant through an integration.

The proposal distribution should be chosen to ensure that the Metropolis- Hastings algorithm produces an ergodic Markov chain. The ergodicity of a Markov chain assures that it converges to a stationary distribution after a number of iterations. An ergodic chain must be aperiodic and irreducible. A state in a Markov chain is called aperiodic if the greatest common divisor of return times to

the state is 1. If all the states in a Markov chain are aperiodic, the chain is called aperiodic. Irreducibility of a chain means that in any state it must be possible to reach any other state within a limited number of moves.

2.3.8.1.2 Gibbs Sampling

Gibbs sampling is a special case of Metropolis-Hastings algorithm. In contrast to the Metropolis-Hastings algorithm, Gibbs sampling accepts every new state to reach an equilibrium state. Every new sample in the Gibbs sampling is drawn from the distribution of the sample conditioned on the rest of the parameters or random variables of interest. Let X = {X₁(t−1), X₂(t−1), X₃(t−1), . . . , Xn(t−1)} be a

set of parameters that needs to be estimated through Gibbs sampling. The new value of each parameter is drawn from the conditional distribution on the rest of the parameters, such that:

X₁(t) ∼ p(X₁(t)|X₂(t−1), X₃(t−1), X₄(t−1), . . . , X_n(t−1)) X₂(t) ∼ p(X₂(t)|X₁(t), X₃(t−1), X₄(t−1), . . . , X_n(t−1)) X₃(t) ∼ p(X₃(t)|X₁(t), X₂(t), X₄(t−1), . . . , X_n(t−1)) X_n(t) ∼ p(X(t) n |X (t) 1 , X (t) 2 , X (t) 4 , . . . , X (t) n−1) (2.32) Until the joint distribution of the parameters p(X1, X2, . . . , Xn) converges

to an equilibrium distribution, Gibbs sampling continues to sample new values for the parameters. The reader should note that in Gibbs sampling only one change can be applied at a time. Therefore, in the example given above, only one parameter’s value can be updated at a time. There are other types of sampling algorithms, such as block sampling, where a set of parameters can be sampled together.

A difference between the Metropolis-Hastings algorithm and Gibbs sampling is the need for a normalisation. As mentioned above, in the Metropolis-Hastings algorithm, normalisations can be ignored due to the division operation. However, Gibbs sampling requires a normalisation, to draw from a conditional distribution which must be normalised beforehand.

In document Statistical models for unsupervised learning of morphology and POS tagging (Page 61-64)