Forward-backward algorithm and latent parameters estimation . 47

3.2 Estimation principles

3.2.4 Forward-backward algorithm and latent parameters estimation . 47

Before considering the possible versions of the M-step of the EM algorithm, we need to detail how the likelihood is computed during the E-step. From the initial values (and after each re-estimation) of the visible-level parameters, one needs to compute the transition probabilities of the latent states given the entire observed sequences and the specification of the model. This is done using the Forward-Backward algorithm introduced by Rabiner [117].

Once the visible-level parameters g,t and ✓g,t are initialised or re-estimated, one needs to estimate the latent parameters. For this purpose, the forward-backward algo-rithm (FB) is employed. The main objective is to estimate the latent state probabilities (transition probabilities A and initial probabilities ⇡) when the visible-level parameters are considered as known. It is a commonly used tool in HMM which computes the posterior marginal probability of the latent variables given the observed sequences (and the current model M with the visible part parameters) PM(St|X^1:T) at every time t2 1, ..., T . The algorithm consists in two dynamic computation passes: a forward pass and a backward one. The computation is carried out a first time forward, starting from t=1, and then a second time backward, starting from t = T . Both sets of probabilities are then combined by ”smoothing” the information obtained from the forward pass and the one obtained from the backward pass.

The forward part is estimating the probabilities to be in a given latent state at a given time, knowing the observations up to this time: PM(St|X^1:t). But in order to do this, we first need to estimate the joint probabilities

↵t(j) = PM(X0, . . . , Xt, St= j). obtain the following equation that is solved consecutively for each t until t = T :

↵t(j) = PM(X0, . . . , Xt, St= j) = 1

The probabilities of the latent states at each time t, given the observations up to this time are:

PM(St= j|X¹, . . . , Xt) = PM(X0, . . . , Xt, St= j)

P_M(X₁. . . X_t) = ↵t(j) P_M(X₁. . . X_t)

3.2. ESTIMATION PRINCIPLES 49 After calculating the forward probabilities ↵t, we need to proceed in the exact same manner for the backward pass, but starting from the end of the sequence t = T up to t = 1. This will provide us with the backward probabilities t. We start from a given latent state and we look for the probabilities of observing all the future observations up from this state. We consider the initial state as known and therefore each t(i) = 1.

Continuing backwards we obtain:

A normalization is applied in the computations of ↵t(j) and t(i) to correct the numerical problems that occur if one state has an excessively small probability.

After passing the algorithm in both directions, we can compute the marginal proba-bilities t of the latent states at any time, knowing the entire sequence of observations:

t(i) = P_M(S_t= i|X1:T) = P_M(S_t= i|X1:t, X_(t+1):T)

Combining the forward and backward probabilities results in a “smoothing” proba-bility computation of t. The latter represent an estimation of the most probable state of the latent variable at each time t of the observed sequence. However, this does not result in the most probable sequence of hidden states. The reason is that even though the latent level transition probabilities are used in the calculation of ↵t and t, they are not respected when combining both to obtain t. In other words, we have the most probable states independently, but we do not know how likely they are to occur successively in this exact sequence i.e. P (St = i)P (St+1 = j) 6= P (S^t = i, St+1 = j).

Fortunately, there exists another tool called the Viterbi algorithm, which can provide us with this optimal latent sequence and which we will describe later.

After computing t over the entire sequence, there is one more set of probability that we need in order to estimate the latent transition matrix A. The joint probability of two successive states (i and j) given the entire sequence of observations is called

✏t(i, j) and represents a three dimentional array of size [k^`⇥ k ⇥ n 1], where ` is the order of dependence of the hidden Markov chain. It is computed as:

✏t(i, j) = PM(St = i, St+1 = j|X0, . . . , XT) easy to re-estimate the latent part transition probability array A. Its estimation is provided by the ratio of the sums over all periods of all ✏_t-s and _t-s:

For what concerns the vector of initial probabilities for each latent state ⇡i at time t = 0, they are computed from the sums of all t:

⇡_i = PT 1

t=1 t(i)

T 1

It is important to precise that since longitudinal data are often composed of multiple data sequences, the latent level parameters A and ⇡i are estimated separately on each sequence and then aggregated at the end:

3.2. ESTIMATION PRINCIPLES 51

A^tot = Xn

i=1

wiA⁽ⁱ⁾

where wi is the weight of a sequence i and A⁽ⁱ⁾ indicates its corresponding estimation of A. The weights are either provided with the data (from the design of the survey), or proportional to the length of each sequence otherwise.

The above formulas consider the estimation in the most common latent specification where the order of the hidden Markov chain is ` = 1. However in general, ✏_t denotes the probability of ` + 1 successive states of the latent chain. Therefore, for a second order chain for instance, we have an array of size (k² ⇥ k ⇥ n 1) for ✏_t(i, j, k) = P_M(S_t = i, S_t+1 = j, S_t+2 = k|X0, . . . , X_T), from which one can estimate the matrix A of size (k²⇥ k²) giving the transition probabilities conditionally on the two previous states.

After re-estimation of all latent parameters, the log-likelihood equation is:

L(X0. . . XT) =

To provide a more concrete example with a specified model, if all components have two lags for both the mean and the standard deviation (pk = 2 and qk = 2) and two visible-level covariates c1 and c2, then the above equation becomes:

L(X0. . . XT) =

As seen before, solving the likelihood derivative equations is complex for the stan-dard deviation parameters, because of the lack of unique solutions. An additional complexity is that every component may often use its own numbers of lags for the mean and the standard deviation. Depending on the data and the objectives, it is possible to choose a component with constant mean and variance, together with an-other one with a two period memory for the mean and one for the standard deviation (for instance we may have: µg=1 = 1,0 and g=1 = ✓1,0 for the first component and µg=2 = 2,0+ 2,1⇥ X^{t 1}+ 2,2⇥ X^{t 2} and 2,t =p

✓2,0+ ✓2,1x²_{t 1}). Thus, in order to

allow the HMTD model to be as flexible as possible (allow heterogenious modelling), we attempted to implement an estimation procedure that is not the fastest for some given specifications, but that is as generalisable as possible over the variety of model specifications and uses. This is why we explored the use of heuristic methods (that does not use the derivatives of the likelihood function) within the E-step of a GEM algorithm, in order to optimize the log-likelihood function.

3.2.5 Viterbi algorithm

In HMMs, one is often interested in the most probable sequence of latent states that could lead to the observed sequence. The most popular solution to this problem is the algorithm proposed by Andrew Viterbi Bishop [23], Viterbi [165] (and other authors simultaneously).

Suppose we have a sequence of length T and k latent states. This leads us to a set of k^T possible paths, a number that grows exponentially with the number of time periods. Even though we could compute the path probability using the initial ⇡i, the transition probability matrix A and the probability distribution for each state, it would be difficult to do this for all the paths. The Viterbi algorithm makes the task easier computationally by following only k paths at each time. Suppose that we need to find the optimal path up to time t for state St = i. Even though many paths lead to this point, only one of them is the most probable. Therefore, at t we need to consider only k optimal paths. While we move to t + 1, this number becomes k², but again only one of them is the most likely for each state, and therefore we keep only k of them. At time T , only one state will be the most likely, and only one optimal path will lead to it. If we call V^t,i the probability of the path that is the most likely up to the state i at time t, we can calculate these probabilities iteratively, starting with

V^i,1 = P (X1|S¹ = i)⇥ ⇡ⁱ and maximizing

Vj,t= max

j2{1,...,K}P (X_t|St= j)⇥ ai,j ⇥ Vi,t 1 for each t 2 {1, . . . , T }

By tracking all the optimal paths, we can then find the sequence corresponding toVk,T^⇤ .

3.3. VISIBLE PARAMETERS ESTIMATION PROCEDURE 53

In document Latent Markovian Modelling and Clustering for Continuous Data Sequences (Page 58-64)