• No results found

3.4 Empirical Studies

3.4.2 Image Data

The images are fitted with Gaussian mixtures withK pre-defined to be 1021, the number of images in the whole data set. Model parameters are set as β0 = 0.1, ν0 = 5, with m0 and W0 defined as above. Using VBDMA method, we obtain respectively 18 and

3.5. SUMMARY 49

α = 1, K = 1021 α= 500, K = 1021

Figure 3.7: Component weights after learning with VBDMA on the Corel image data set.

56 mixture components with α = 1 and α = 500, for which the soft weights are shown in descending order in Figure 3.7. It can be read that for α = 1, only 5 out of the 18 components have weights larger than 0.01. Forα = 500 this number is 13. To show the characteristic of each component, we illustrate the 10 images which have the highest weight for each component, one row per component, in Figure 3.8 and Figure 3.9, respectively forα = 1 and α = 500. It can be seen that images of different color histograms can be grouped reasonably, especially forα = 1. Setting α= 500 will lead to more components, among which, for instance, castles are split into two components 1 and 3 due to different colors. In general, largerα leads to more but still meaningful components in the mixture modeling. There are noises in this mixture modeling, since fitting such a flexible Gaussian mixture model in image data with a reasonable number of dimensionality is not an easy task.

The same parameter settings are used for the VBTDP method, and the results are shown in Figure 3.10. For the two differentα values, VBTDP obtains 16 and 20 compo- nents, respectively, and among them 10 and 11 have weights larger than 0.01. We also show the images of each component in Figure3.10, and it can be seen that...

3.5

Summary

This chapter answers the important question of how to identify the number of mixture components, and builds connections to infinite mixture modeling with Dirichlet processes. It turns out that the symmetric Dirichlet prior proposed in the last chapter for the mixing weights is the key point to approximate a DP in the limiting case ofK → ∞. While this prior has been used in several previous works [30, 16], its connection to non-parametric Bayesian framework with Dirichlet process is never uncovered. Several researchers have observed the phenomena that their models (which use this prior) can automatically detect the number of mixtures, but they either provide no explanation for this [16], or give the wrong one [30].

50 CHAPTER 3. INFINITE MIXTURE MODELS

Figure 3.8: The 5 clusters of largest weights using VBDMA with α= 1, one per row with 10 images of largest membership weights.

We emphasize that this connection to Dirichlet process is important, in that:

• This explains why parametric Bayesian modeling of finite mixtures obtains similar results as non-parametric modeling. Automatically detecting number of mixtures is one important result among others.

• This makes it possible to borrow known results about Dirichlet process from statistics community and apply them on finite parametric modeling. For example, the con- centration parameter α in DP is known for controlling the flexibility of generating new mixtures. It is also the case in finite modeling, as seen in this chapter.

• Variational methods originally developed for finite mixture models can (in general) be applied for model inference and learning with DP prior. This makes it possible to propose complicated models with possibly infinite number of mixtures, and then use the known variational methods to solve it with the approximation of finite mixtures. This can be applied, for instance, for infinite mixtures of probabilistic PCA [69] and infinite mixtures of latent Dirichlet allocation (LDA) [10]. These are however left as future works.

Another variational Bayesian method for infinite mixture model is VBTDP which is based on TDP as a finite approximation to DP. Some empirical studies are performed in this chapter to compare these two variational solutions. VBDMA is shown to be superior to VBTDP, and has a nice property of controlling the sparsity of mixture modeling with a single hyperparameterα.

For future work one can apply the proposed model to more complex exponential family distributions for various applications, e.g., infinite mixture of probabilistic PCA models

3.5. SUMMARY 51

Figure 3.9: The 13 clusters of largest weights using VBDMA with α = 500, one per row with 10 images of largest membership weights.

and infinite mixture of latent Dirichlet allocation models. The usage of Dirichlet processes in relational learning is also worth investigating [78].

52 CHAPTER 3. INFINITE MIXTURE MODELS

α= 1, K = 1021 α= 500, K= 1021

Part II

Probabilistic Projection Models

Chapter 4

Overview of Projection Models

Dimensionality reduction aims to find a low-dimensional representation of the data. This is important for many machine learning and data mining problems, either because the original dimensionality is too high to apply certain algorithms, or because there are noisy dimensions which we want to remove. Other motivations include improving the efficiency of learning, and accelerating certain algorithms. One typical example of high-dimensional data is the text document corpus for information retrieval. In the vector space model (VSM), each document is represented as a vector of terms, and each term defines one dimension in the space. Hence the dimensionality of the space is the number of distinct terms in the corpus, which could easily reach ten thousands for a middle-sized corpus. Another example is in face recognition, where each face image is cropped to 32×32 pixels and has at least one feature for each pixel. This also generates a very high dimensional space for the face images.

One way of reducing the number of features is to select a subset of features which are informative enough or sufficient for the learning algorithms. This is in general called feature selection and has been extensively studied in the data mining community. For information retrieval, removing the stop-words (which are the common words likethe and of) is one popular feature selection technique. Feature selection is in general fast and easy to implement, but in many applications it is not the individual features that are important, but a (weighted) combination of them. Extracting these features is not only important for the learning algorithms, but also useful for understanding the data. This so-calledfeature transformation orprojection is the main topic of this overview.

Most of the research on projection methods works oncontinuous data, i.e., each feature dimension can take a real number in some domain. The face image data belongs to this category. In this chapter, however, we consider both continuous data and discrete data, where fordiscrete data the domain of each feature dimension is the set of (non-negative) integers. One typical example is the bag-of-words representation of the document-word matrix, where each entry denotes the occurrences of the corresponding word in the docu- ment. In the rest of this chapter we summarize different projection methods for these two

56 CHAPTER 4. OVERVIEW OF PROJECTION MODELS

types of data (in Section4.1and Section4.2respectively), and in Section4.3we point out the road-map of our contributions in the following chapters.

We summarize our notations here for clarity. In what follows we consider a set of N data (e.g., documents or images), and each data i is described by an M-dimensional feature vector xi ∈ X. For continuous dataX ⊂RM, and for discrete data X ⊂ZM+. For projection we aim to derive a mapping Ψ : X 7→ Z which maps the input features into a K-dimensional space Z.

4.1

Projection Models for Continuous Data

In this section we give an overview of projection models for continuous data, mostly from the perspective of global projection. Some local projection methods are summarized at the end of this section.