2.4 Background on Machine Learning
2.4.12 Hidden Markov Models
A Hidden Markov Model (HMM) is a statistical method for determining a model representation of a nonlinear system. HMMs are a variation on a basic Markov Model or Markov Chain. In a Markov chain the state space representation of a system is used and the system is characterized by the probability of transitions between various states. In an HMM, the states are not directly observed and instead of using the state transition probabilities, the state transitions are inferred by the sequence of output probabilities. The HMM theory was originally proposed by Baum et al. in the 1960’s as method for recovering a state matrix via a set of observations [85]. HMM’s have been applied extensively to speech recognition problems but have also been implemented in certain surgical skill evaluation methods [44, 86]. Additionally, some groups have utilized the more simplified Markov Chain for surgical skill evaluation [87].
The core definition of the HMM involves an underlying stochastic process in which the states are unobservable and only the output is observable. The elements of an HMM include the number of states in the model M, the number of distinct observation symbols per state M, and the state transition probability A which embeds the probability of transition between any two distinct states. Additionally, an HMM representation requires a probability distribution for the observations symbols B = bi(k) and an initial state distribution estimate π.
Using good choices for the HMM parameters should allow the HMM to compile an obser- vation sequence which is given by O = [O1, O2, ..., Ot]. This observation sequence is computed
via the following steps [88]:
1. Choose an initial state estimate according to π.
2. Choose Ot = vkbased on the observation symbol probability (bi(k))
3. Transition to a new state qt+1= Siaccording to the state transition probability distribution.
4. Move to the next time step t = t + 1 and repeat.
These steps require two main probability distributions, the probability of the observation sequence P(O|λ ) , and an optimal estimate of the state sequence Q = [q1, q2, ..., qt] given the
observation sequence. The observation sequence probability has several possible procedures. The simplest estimate of P(O|λ ) is given directly by summing the joint probability of all possi- ble state sequences:
P(O|λ ) =
∑
q1...qT
πq1bq1(O1)aq1q2bq2(O2) · · · aqT−1qTbqT(OT) (2.66)
While Eq. 2.66 is not particularly computationally efficient it is the most straightforward solution. Other more efficient procedures also exist. Given the observation sequence probabil- ity, the state sequence estimate can be expressed as the probability of being in state Sigiven the
observation sequence and the model parameters λ = (A, B, π).
P(qt = Si|O, λ ) =
αt(i)βt(i)
P(O|λ ) (2.67)
Where αt(i) and βt(i) account for portions of the observation sequence. The solution to Eq.
2.67, i.e. the likely estimate for the state at time t (qt), can be solved as:
qt = argmax(P(qt = Si|O, λ )), 1 ≤ t ≤ T (2.68)
The final consideration in designing a HMM is the selection of optimal parameters λ = (A, B, π) such that P(O|λ ) is maximized. Several methods exist for determining these param- eters including Baum-Welch [89], gradient techniques, and expectation modification [88]. The
majority of these algorithms are not analytical and require iterative numerical approaches and maximum likelihood estimates.
The use of HMMs in discriminant analysis and system identification has been considered in several research fields. HMMs excel in systems with a known finite number of states and where system identification (generative model) is the goal. However HMM methods are ill- suited for systems with an indeterminate number of states. Conventional HMM methods are also not specifically designed for use as a discriminant model. The use of maximum likelihood estimates for solving for parameter values explicitly causes HMM algorithms to concentrate on general similarities in data as opposed to inherently discriminant algorithms which focus on regions of maximum separability between classes.
Recent research has focused shifting the use of HMMs to discriminant analysis settings. In order to achieve maximum discriminative ability the HMM must use a training set which focuses on maximizing the separability of classes and there the optimal parameter selection. Bourlard et al. proposed an initial discriminant based HMM which consisted of a hybrid ap- proach with the use of Artificial Neural Networks (ANN) [90]. The goal of the HMM-ANN hybrid is to improve discrimination by training each model parameter set with consideration given to all other models, thus identifying maximum separability. In the work of Bourlard the training method is performed such that the HMM can estimate the probability of the observed data vector, given a hypothesized HMM state. Using a modified version of this probability, namely the posteriori probability of an HMM state given a data vector P(qk|xn), the HMM
parameters can be estimated by minimizing the Mean Square Error (MSE).
The formulation of the HMM with neural networks allows an initial extension of HMMs to the discriminant problem. Other groups have explored similar avenues for the use of HMMs in discriminant settings. Quan et al. utilized a similar HMM-Neural Network approach in order to improve separability in Signature Verification applications [91].
A different extension of the HMM is the discriminative model HMM which involves the use of Maximum Mutual Information (MMI) when designing the model parameters [92]. The joint MMI-HMM approach involves training the HMM model parameters while considering all other observations and models. This is in order to maximize the discriminative abilities of each model using the Bayesian discriminant function [93]. The Bayes discriminant function is the probability of a correct classification minus the probability of an incorrect classification (Eq.
2.69).
g0(x) = p(C1|x) − p(C2|x) (2.69)
The mutual information comes into focus when choosing parameters that optimally dis- tinguish between observations generated by the appropriate model and observations generated by incorrect models; i.e. the parameters are chosen in order to maximize the mutual infor- mation I between the set of observation sequences O = O1O2...OT and the set of all models
λ = λ1λ2...λv. This can be expressed as summing over all observations given all possible model parameters (Eq. 2.70).
I= max λ V
∑
v=1 [logP(Ov|λv) − log V∑
w=1 P(Ov|λw)] (2.70)The hybrid HMM-MMI approach allows for an improved method of training an HMM to focus on the discrimination of particular classes. This method however cannot be implemented analytically and must be approached with numerical methods. However, the concept of maxi- mized discrimination of incorrect classification is appealing.
While HMMs do present certain benefits including time series model estimation and no required knowledge of the internal states, HMMs also impose undesirable drawbacks. These drawbacks include the requirement of feature extraction in order to convert a time series to a finite number of states. This feature extraction can result in the possible loss of key informa- tion. Other drawbacks are presented in systems with indeterminate number of states. Addi- tional drawbacks can include classification difficulties when dealing with systems that contain extremely similar signals and only moderate amounts of differentiating time series information.