4.2 Projection Methods for Discrete Data
4.2.2 Latent Dirichlet Allocation (LDA)
It is argued that pLSI is not awell-definedmodel, since it treats each document as an index and thus is not generalizable to new documents. Another problem of pLSI is that longer documents get higher weights in the model, which also indicates that the documents are not independently sampled. To solve the problem, Blei et al. [10] introduced the latent Dirichlet allocation (LDA) model which shows better performance than pLSI.
The plate model for LDA is shown in Figure4.4. Given the Dirichlet priorα, document iis sampled independently through atopic mixtureθiwhich defines the mixing weights for
this document. Then all the Ni words within this document are sampled independently
by first choosing a topic z given the topic mixture θi, and then sampling the word based
on the projection matrix β. Since not document indices but the contents are modeled directly, the model is generalizable to new documents.
Since there is a coupling between latent variablesθiandzi,n, learning and inference are
not tractable for LDA model. Blei et al. [10] adopts a variational EM learning algorithm which updates the posterior of θi and zi,n iteratively through a variational distribution.
Remark 4.2.1. The LDA-type models are “correct” for document modeling since it has
the correct independence assumptions compared to pLSI model. This is called the discrete PCAin [12] because it can be viewed as using Multinomial distribution instead of Gaussian in a PCA-type representation, as clarified in the beginning of this section.
4.3. ORGANIZATION OF THE FOLLOWING CHAPTERS 65
Figure 4.4: Plate model for LDA model
4.3
Organization of the Following Chapters
In Chapter5 we review a probabilistic explanation of PCA which is known as PPCA, and then extend it to kernel PCA which can handle non-linear projections. An EM learning algorithm is derived for this model which is faster and has potential to apply to large data sets. An incremental kernel PCA can also be straightforwardly derived.
Then in Chapter 6 we go beyond unsupervised projection and introduce the MORP algorithm forsupervised projection. Here we assume there are not only features associated with each data point, but also some output labels. MORP is motivated from a latent variable model and can handle both linear projection and non-linear projection. Experi- ments show that the model outperforms other supervised projection methods in various data sets.
Finally in Chapter 7 we consider a probabilistic version of MORP and introduce the SPPCA and S2PPCA models for supervised and semi-supervised projections, respectively. An efficient EM algorithm can be derived which can handle large data sets. The semi- supervised effect of S2PPCA model makes it applicable to many applications like face recognition and text classification.
Projection models for discrete data will be considered in Part III where we jointly perform clustering and projection for discrete data.
Chapter 5
Probabilistic Kernel Principal
Component Analysis
Various projection methods exist for continuous data, among which the principal compo- nent analysis (PCA) is a linear projection method and becomes very popular in the last several decades. To relax the linearity of PCA, kernel PCA is introduced to generalize lin- ear PCA to non-linear mappings via non-linear kernel functions. Both of the two methods turn out to solve an eigenvalue problem and have clear mathematical formulations, but the following questions occur for some real-world problems:
• How to apply PCA if the input data have missing entries?
• How to apply these methods if the dimensionality M or the number of data points N is too large?
• Is it possible to perform PCA locally instead of globally?
For deterministic methods like PCA and kernel PCA, it is difficult to deal with these questions directly. A probabilistic model, on the other hand, can handle them easily: the missing entries can be integrated out for learning; the EM algorithm can be used to solve the problem iteratively and possibly incrementally; localized PCA can be done via a mixture model.
At the end of the last century, several authors proposed a probabilistic version of PCA, which in this thesis we call the probabilistic PCA (PPCA) [70, 59]. They show that their models achieve the canonical linear PCA in the asymptotic case, and that PPCA provides these many benefits that canonical PCA does not have. However, this probabilistic interpretation is only for linear PCA, and there is no similar model which gives the same benefits for kernel PCA. This is exactly the goal of this chapter. In Section5.1we first review the PPCA framework and discuss the EM algorithm for learning with PPCA model. This algorithm is extended in Section5.2to non-linear cases and is proved to solve a kernel PCA problem. Some discussions are given thereafter, along with the benefits of
68CHAPTER 5. PROBABILISTIC KERNEL PRINCIPAL COMPONENT ANALYSIS
Figure 5.1: Illustration of the PPCA model. Xdenotes the input matrix, where each row is one data point. fx1, . . . , fxM are the M input features. On the top fz1, . . . , fzK are the K latent variables. They are all in circles because they are variables in the probabilistic models. The arrows denote probability dependency.
the proposed framework. To show one of these benefits, we discuss an incremental version of the algorithm in Section 5.3and illustrate its usage on some toy data.
5.1
Probabilistic Principal Component Analysis (PPCA)
While PCA originates from the analysis of data variances, the PPCA model emerges from the statistics community and acts as a probabilistic explanation for PCA [70,59]. PPCA is a latent variable model and defines a generative process for each data point x as (see Figure 5.1for an illustration)x=Wz+µ+,
where z ∈ RK are called the latent variables, and W is a M ×K matrix called factor
loadings. In this probabilistic model, latent variables z are conventionally assumed as a Gaussian distribution with zero mean and unit variance, i.e.,z∼ N(0,I), anddefines a noise process which also takes an isotropic Gaussian form as ∼ N(0, σ2I), withσ2 the noise level. Additionally, we have parameters µ ∈ RM which allow non-zero means for
the data.
The PPCA model indicates that given the latent variablez,xis Gaussian distributed: x|z∼ N(Wz+µ, σ2I).
With zintegrated out, it turns out that observationx is also Gaussian distributed:
5.1. PROBABILISTIC PRINCIPAL COMPONENT ANALYSIS (PPCA) 69 Based on the Bayes’ rule, the a posteriori distribution of z given observation x is also a Gaussian:
z|x∼ N(W>W+σ2I)−1W>(x−µ), σ2(W>W+σ2I)−1
. (5.2)
Remark 5.1.1. The generative model for PPCA is similar to the factor analysis [5]. The
only difference is the noise process: In factor analysis the noise levels for different dimen- sions can be different, leading to a noise process∼ N(0,Σ) withΣ=diag(σ12, . . . , σ2M). Both models assume that in the noise process every two dimensions are independent. For a detailed comparison of these two models please refer to [70].
It is shown that PPCA has strong connections to PCA. We summarize the related results in the following proposition without proof, since this is a corollary of Theorem7.3.1
in the later Chapter7. A detailed proof can also be found in the Appendix of [70].
Proposition 5.1.1. Let S= N1 PN
i=1(xi−µ)(xi−µ)
> be the sample covariance matrix
for data{xi}Ni=1, and λ1 ≥. . .≥λM be its eigenvalues with eigenvectors u1, . . . ,uM, then
if the latent space in PPCA model is K-dimensional,
(i) The maximum likelihood estimates of the mean µand the noise level σ2 are respec- tively µ= 1 N N X i=1 xi, σ2= 1 M −K M X j=K+1 λj. (5.3)
(ii) The maximal likelihood estimate of W is given as W =UK(ΛK−σ2I)
1
2R, (5.4)
where ΛK = diag(λ1, . . . , λK), UK = [u1, . . . ,uK], and R is an arbitrary K×K
orthogonal matrix.
(iii) The mean projections z∗ for new input x∗ is given as z∗=R> ΛK−σ2I
12
Λ−K1U>K(x∗−µ).
It is seen that the mean vector µ is simply the sample mean, and the noise level σ2 is the average of the minorM−K eigenvalues. The loading matrix W has an arbitrary factor R which is an orthogonal matrix. This indicates that the latent space is invariant under an arbitrary rotation. One can perform an SVD toW>Wto recoverRif necessary.
70CHAPTER 5. PROBABILISTIC KERNEL PRINCIPAL COMPONENT ANALYSIS