CHAPTER 5 : Sequential Learning and Probability Estimation
5.2 Using the Marginal Distribution
Given that we believe our knowledge or estimation of the marginal class probabilities is relevant to improving our classifiers’ performance, what is the appropriate way to use them? Depending on which piece we know we will want to use them differently. In the context of the preceding simulations, there are four “things” that we could know. First, we might know the Markov model, which also implies knowledge of the marginal class probabilities. Second,
we might only know the true emission distribution P(Xt|Yt). Third, we could know
the marginal class probabilities, P(Y) and nothing else. Finally, we could know both the
emission distribution and the marginal class probabilities, but not the process that generated
(Yt), which also implies a knowledge of the true conditional class probabilities. Given the
data’s generating process and assuming we can only know pieces of that, this last case is the only situation in which we have access to the true conditional class probabilities. Notice that this is an expanded view of what could be known in comparison to the original view from the preceding simulations. It also clarifies the distinction between knowing the emission distribution (a piece of the model) and truly knowing the conditional class probabilities. With this new view, we now detail each of the above scenarios, describing what exactly we know and potential methods to incorporate the additional information in the context of our simulations. Following this, we present an extended set of simulations to assess whether this view and methodology yield a resolution to the previously unexpected results.
5.2.1. When Only Marginal Class Probabilities are Known
If we know only the true pi = P(Yt=i) then there are two places this information could
be used. First, even though we used logistic regression, which is a purely discriminative classifier, we will see that knowledge of the marginals can nevertheless affect output on a test set. Further, we have reason to believe this is generally true for most classifiers. The general method used in the next set of simulations involves rescaling by a ratio of the true and sample marginal distribution and is discussed in depth in section 5.4.
Second, we would want to change how we handle the estimation of the Markov process on
(Yt). Ideally, we would like to fit the best model constrained to have a given stationary
distribution. However, we have not seen any work on this in the literature for even simple Markov chains, let alone more complicated models like GMMs and such problems lie beyond the scope of the current work. Therefore we do not discuss this case further or address it directly in the following simulations. However, there is a relatively simple way we could consider incorporating the marginals. When smoothing nonsequential probabilities with
the discriminative Forward-Backward algorithm, recall that we use the term Pˆ(Yt=i|Xt)
ˆ
pi in
each update step. Since the denominator is the marginal probability of observing class i,
we replace it withpi, the true marginal probability of classi.
5.2.2. When The Emission Distribution Is Known
In this condition there are also two sub-scenarios; we could either estimate or know the marginal class distributions. The first scenario would result in us estimating the nonse-
quential CCPs asP(Xt|Yt) ˆP(Yt). This may be worth looking into, but we do not discuss
it further at present. In the second, we actually know the true CCPs and could set our
initial, nonsequential estimates to the trueP(Yt|Xt). Were there no sequential model and
hence no smoothing to be done, there would be no more to do since this would be the truth. When we know the true nonsequential conditional class probabilities, however, we need to be careful at the smoothing step. If we just used the emission distribution in the smoothing step, the nonsequential CCPs wouldn’t even be relevant. But since sequential learning uses discriminative models, the marginal class probabilities show up in this step. Treating the nonsequential CCPs as known but the marginal probabilities as unknown and using an estimate in the denominator is thus an inconsistent use of available information.
Formally, we useP(Xt|Yi=i) or an estimate thereof in the generative version of Forward-
Backward but recall that in the generative case we use the fact that P(Xt|Yt=i) ∝
P(Yt=i|Xt)
gives us only the true conditional class probabilities but not the true marginal distribution,
we end up and incorrect emission distribution proxy of P(Yt=i|Xt)
ˆ
P(Yt=i) because the correct
marginals were used in creating the CCPs. The general principle seems to be that we want to pair truth with truth and estimates with estimates.
5.2.3. When the state transitions are known
The final and most interesting condition is when we knowAand the holding time parameters
— in other words the complete Markov model. With this information, the marginal class distribution is included for free in the form of the process’s stationary distribution. However, the way in which we use this information is less direct. A discriminative classifier that
directly estimates ˆP(Yt=i|Xt) as was used in the preceding simulations makes no explicit
mention of pi. However, differences in marginal distribution from sample to sample may
still affect this estimate so we would like a way to incorporate this extra knowledge into any estimates since it may improve out of sample performance.
The method we use, while less straightforward, is still relatively simple. First, recall that ˆ
P(Yt=i|Xt) ∝ fˆ(Xt|Yt=i) ˆpi. However, we don’t explicitly have the emission dis-
tribution which would let us just replace ˆpi by pi and ˆf by f to exactly reconstruct the
conditional class probabilities. Besides, this would mean we know the whole model anyway.
Instead, starting from any estimate ˆP(Yt=i|Xt) of the CCPs we let ˆP∗(Yt=i|Xt) =
ˆ
P(Yt=i|Xt)ppˆii. Notice that whenP∗ is used to smooth probabilities, since we know the
true pi this correction yields
ˆ
P(Yt=i|Xt)
ˆ
pi . This is in contrast to the preceding where we
ended up with truth over truth.
Finally, we point out that this type of correction to class probability estimates requires further investigation and we discuss this in much greater depth in section 5.4