• No results found

2.5 Deep Autoencoding Density Model

2.5.3 Rejection

ˆ Σk = Ni=1 ˆ γik(zi−ˆµk)(zi−ˆµk)T Ni=1 ˆ γik , (2.59)

where p is the output of the multi-layered perceptions M LP (·), softmax(·) denotes the soft-max function, and ˆϕk, ˆµk, ˆΣk are mixing proportion, mean, and covariance for the kthcomponent in GMM.

The loss function of the estimation networks is inferred by the negative log-likelihood with these estimated parameters

E(z) =− log{ Kk=1 ˆ ϕkN (z; ˆµk, ˆΣk) } . (2.60)

In addition, the regularisation P ( ˆΣk) =

K k=1

d

j=1( ˆΣkjj)−1 alleviates the singular-

ity problem by penalising small values on the diagonal entries, where d is the low- dimensional representations’ dimensions. Given the above, this objective function can be constructed as follows J (θc, θd, θp) = 1 N Ni=1 L(x, x) + λ1 N Ni=1 E(zi) + λ2 NP ( ˆΣk) , (2.61)

where λ1 and λ2 are the meta parameters which are usually set to λ1 = 0.1 and λ2 = 0.005.

Since the model is mainly used for unsupervised anomaly detection7, the reconstruc- tion error features are added as the input to the GMM [86]. In essence, the model takes advantage of the fact that anomalies deviate from the cluster in low-dimensional space and are difficult to reconstruct. Furthermore, anomaly samples can be predicted when theirE(z) is higher than a pre-chosen threshold. Compared with other anomaly detection models, such as Deep clustering network (DCN)and DSEBM-r [87, 88], DAGMM is an end-to-end model which optimises the parameters of deep AE and GMM simultaneously.

2.5.3

Rejection

In real-world applications, some samples may not belong to any known class. Therefore, it might be necessary to refuse to make decisions on these samples, which can further reduce

the error rate. These rejected samples can be discarded or hold on for more information. This process is called rejection recognition.

DAGMM is mainly applied for the one-class detection by using the value of log- likelihood as the detection criteria. In this case, most of the data are considered normal and modelled in an unsupervised way, and then the abnormal data are detected if the value of log-likelihood is smaller than a pre-defined threshold. Another case is the so-called out- of-distribution detection which is applied for the multi-class detection. It rejects the test samples from different distributions of training data by training a prediction confidence. Recent work has demonstrated that the common multi-class classifiers (neural networks) tend to make highly confident predictions of all test samples, even if they are completely unrecognisable or irrelevant inputs [89–92]. In recent years, the emerging approaches have been proposed to improve the classifier so that such uncertainty can be considered. One seemingly straightforward approach is to enlarge the training set, but the number of out-of-distribution examples can be infinitely many. It keeps an challenge [86, 93] to detect out-of-distribution examples without further re-training networks.

Chapter 3

Density Model with Finite Mixture for

Unsupervised and Supervised Learning

A density model with finite mixture usually utilises latent variables to represent the p- resence of sub-populations, e.g., various components, within an overall population, e.g., the mixture model. Typically, the finite mixture models provide a convenient and formal setting for the model-based unsupervised learning, i.e., the Gaussian Mixture Model [30] and the Mixtures of Factor Analysers [50]. These methods can also be used in the model- based supervised learning. For example, when the sub-populations cannot be approximat- ed by a simple or known distribution, a finite mixture model can offer a better fit for each sub-population.

In this chapter, two finite density models will be introduced for unsupervised and su- pervised learning respectively. More specifically, we will first discuss how to establish a joint learning method which performs the dimensionality reduction and the following learning task simultaneously. We then verify the effectiveness of this model for unsu- pervised learning, i.e., clustering. Next, we discuss how to reduce the free parameters typically for high-dimensional complicated data. To this end, we propose a latent vari- able model that uses a hierarchical structure, while assuming a common dimensionality reduction matrix for each component. This model is verified in the setting of supervised learning on various data.

The rest of this chapter is organised as follows: in Section 3.1, a joint learning model is introduced by embedding a common loading matrix in a finite mixture model, high- lighting the point that the learned low-dimensionality representations can be calibrated for subsequent learning tasks. Experiments are reported for the joint learning models on several real-world datasets in Section 3.2. In Section 3.3, we develop a mixture discrimi- nation model for the high-dimensional but small sample sized data. Finally, we conclude this chapter in Section 3.4 and also discuss the limitations and future work.

3.1

Unsupervised Dimensionality Reduction for Gaussian

Mixture Model

Dimensionality Reduction (DR) has been an important yet active research area in infor- mation theory, pattern recognition, and machine learning. Among them are Principal Component Analysis (PCA), Independent Component Analysis (ICA), Fisher Discrimi- nant Analysis (FDA), Latent Dirichlet Analysis (LDA), Maxi-Min Discriminant Analysis (MMDA) [94], and 1-norm based feature selection approach. This is especially the case for high-dimensional data since such data usually contain much redundant information. DR can be engaged to map these high-dimensional data into a low dimensional space, where meaningful or semantic features could be available. Such latent features, better reflecting the relationship within data, can be input to any learning models, e.g., Gaus- sian mixture model (GMM) [95], and may lead to performance improvement. In the past, there has been a great deal of works in this field [94, 96, 97]. In the context of classifica- tion or regression [98], DR could be conducted in the supervised style by utilising certain supervised information (e.g., class labels) so as to find a subspace where different classes of data could be separated as far as possible. These methods include the above mentioned FDA and MMDA. On the other hand, when the class information is not available, DR is performed in an unsupervised way. This family of approaches includes the famous PCA and independent component analysis [44].

In practice, some dimensionality reduction are usually performed independently be- fore the low-dimensional features are fed to available learning models. For example, when GMM is utilised for high-dimensional data, PCA could be conducted beforehand. Then the reduced features are input to a GMM so as to obtain the best parameters. The purpose is both to reduce the computational time for high dimensional data and to find a suitable subspace where better clustering or classification performance could be achieved due to the removal of possible noisy features. In this setting, the optimal subspace and the following optimal parameters of GMM are searched independently. Consequently, the optimal subspace obtained by the independent DR may not be appropriate for the fol- lowing GMM. This is particularly the case in the context of unsupervised learning, e.g., clustering. In supervised learning, class labels could be used for deriving a good sub- space, whilst in unsupervised learning, the principles used for DR (e.g., maximisation of the variance in PCA) may not be appropriate for GMM. Figure 3.1(a) in Section 3.1.3 illustrates the best 2-dimensional subspace obtained by PCA in one synthetic data. Ob- viously, the original clustering information among data was less obvious after PCA. The detailed discussion can be later seen in the experimental section.

ducted independently and separately), we propose to learn both the optimal subspace and the parameters for GMM jointly. Specifically, we engage the Mixtures of Factor Analy- sers (MFA) [99] where a common factor loading is assumed to exist for all latent factors. Importantly, when this special MFA called MCFA is optimised via the modified EM al- gorithm, the common factor loading could be regarded as the dimensionality reduction matrix, while the mixtures of latent factors can be regarded as GMM. When GMM is used for unsupervised clustering, its joint learning with the DR subspace will make the clustering properties clearly reserved and even clear. To see the advantages, we also show in Figure 3.1(b) of Section 3.1.3 the subspace obtained by the joint learning method. Ob- viously, it could lead to much better clustering performance, especially compared with PCA. We will also discuss this comparison later in the experimental section. Despite its good properties, the EM algorithm is widely known as a local optimizer, guaranteing on the global optimum. Hence, the engaged algorithm used in this paper also leads to local- minimum solution. Nonetheless, the experimental results showed that the EM can still generate satisfactory results.

It should be noted that although MFA has been earlier discussed for literature such as [100], it was presented from the viewpoint of data analysis rather than dimensionality reduction. More importantly, the idea of using common loadings, or the joint learning, could also be applied in other mixture models [54]. This presents one important contri- bution of this section. The rest of this section is organised as follows. First of all, we present the preliminaries used in this section and also briefly review the finite mixture model. In Section 3.1.2, we then introduce a novel MFA model with the common factor loading. The model definition and the optimisation method will be described in turn. In Section 3.1.3, we compare the proposed new joint learning model on five datasets against the other two competitive methods. This work can also be seen in [16, 18] for a short version.