• No results found

A Generative Model for Semi-Supervised Learning

N/A
N/A
Protected

Academic year: 2021

Share "A Generative Model for Semi-Supervised Learning"

Copied!
30
0
0

Loading.... (view fulltext now)

Full text

(1)

Fall 2019

A Generative Model for Semi-Supervised Learning

A Generative Model for Semi-Supervised Learning

Shang Da

[email protected]

Follow this and additional works at: https://lib.dr.iastate.edu/creativecomponents

Part of the Artificial Intelligence and Robotics Commons

Recommended Citation Recommended Citation

Da, Shang, "A Generative Model for Semi-Supervised Learning" (2019). Creative Components. 382.

https://lib.dr.iastate.edu/creativecomponents/382

This Creative Component is brought to you for free and open access by the Iowa State University Capstones, Theses and Dissertations at Iowa State University Digital Repository. It has been accepted for inclusion in Creative Components by an authorized administrator of Iowa State University Digital Repository. For more information, please contact [email protected].

(2)

Shang Da

A Creative Component submitted to the graduate faculty in partial fulfillment of the requirements for the degree of

Master of Science in Computer Science

Program of Study Committee:

Dr. Jin Tian, Major Professor Dr. Jia Liu, Committee Member

Dr. Wensheng Zhang, Committee Member

The student author, whose presentation of the scholarship herein was approved by the program of study committee, is solely responsible for the content of this dissertation. The Graduate College will ensure this dissertation is globally accessible and will not permit alterations after a degree is conferred.

Iowa State University Ames, Iowa

2019

(3)

TABLE OF CONTENTS

Page

LIST OF FIGURES ... iii

LIST OF TABLES ... iv ACKNOWLEDGMENTS ...v ABSTRACT ... vi CHAPTER 1. INTRODUCTION ...1 CHAPTER 2. BACKGROUND ...3 2.1 Variational Inference ... 3

2.2 The evidence lower bound (ELBO) ... 3

2.3 Variational auto-encoders ... 4

CHAPTER 3. RELATED WORK ...6

3.1 Berkhahn et al. ... 6

3.2 Kingmaโ€™s M2 model ... 7

CHAPTER 4. PROPOSED MODEL ...9

4.1 Probabilistic model ... 9

4.2 Loss function ... 11

4.3 Model architecture ... 13

CHAPTER 5. EXPERIMENTS ...15

5.1 Experiment Setup ... 15

5.2 Semi-supervised learning performance ... 16

5.2.1 Benchmarking ... 16

5.2.2 The effect of unlabeled data ... 18

5.3 Conditional data generation ... 19

5.3.1 Style and content separation ... 19

5.3.2 Conditional data generation performance ... 20

CHAPTER 6. CONCLUSION...22

(4)

LIST OF FIGURES

Page

Figure 1. Architecture of Berkhahnโ€™s model ... 6

Figure 2. Architecture of Kingma M2 model ... 8

Figure 3. Probabilistic model for data generation ... 9

Figure 4. Probabilistic model for data generation in Berkhahnโ€™s and Kingmaโ€™s work ... 10

Figure 5. Probabilistic model for inference ... 11

(5)

LIST OF TABLES

Page Table 1. ๐‘†๐‘†๐ฟ๐บ๐‘€๐‘“๐‘ model structure ... 15

Table 2. ๐‘†๐‘†๐ฟ๐บ๐‘€๐‘๐‘›๐‘› model structure ... 16

(6)

ACKNOWLEDGMENTS

I would like to thank my committee chair, Dr. Jin Tian, who has provided me with valuable guidance in every stage of the writing of this thesis. His keen and vigorous academic observation enlightens me not only in this thesis but also in my future study. I would also like to express my gratitude towards my committee members, Dr. Jia Liu, and Dr. Wensheng Zhang, for their valuable support.

In addition, I would also like to thank my friends, colleagues, the department faculty and staff for making my time at Iowa State University a wonderful experience. I want to also offer my appreciation to my parents for supporting me in my study.

(7)

ABSTRACT

Semi-Supervised learning is of great interest in a wide variety of research areas, including natural language processing, speech synthesizing, image classification, genomics etc. Semi-Supervised Generative Model is one Semi-Semi-Supervised learning approach that learns labeled data and unlabeled data simultaneously. A drawback of current Semi-Supervised Generative Models is that latent encoding learnt by generative models is concatenated directly with predicted label, which may result in degradation in representation learning. In this paper we present a new Semi-Supervised Generative Models that removes the direct dependency of data generation on label, hence overcomes this drawback. We show experiments that verifies this approach, together with comparison with existing works.

(8)

CHAPTER 1. INTRODUCTION

Nowadays massive raw data is generated everyday thanks to the development of data gathering and storage techniques. However manual labeling of the large dataset is very time- and labor-consuming. In practice the number of unlabeled data is often far greater than that of labeled data. Hence, Semi-Supervised learning, which considers the problems of utilizing unlabeled data to assist supervised learning tasks, is of great interest in a wide variety of research areas,

including natural language processing [6], speech synthesizing [7], image classification [8], genomics [9] etc. Existing semi-supervised learning models can be categorized into three main categories: unsupervised feature learning approach, graph-based regularization approach and multi-manifold learning approach.

Unsupervised feature learning approach achieves semi-supervised learning in two separate stages: feature representation learning stage and classification stage. In the first stage a set of latent representations are learnt from both labeled and unlabeled data with unsupervised generative models. In the second stage unlabeled data is classified based on learnt latent representations. For example, Kingmaโ€™s M1 model [1] first learn latent representations with auto-encoders then use an SVM to classify the results, and Johnson Rie et al. [10] use Local Region Convolution block to learn Two-View Embedding feature and then use a Convolution neural network for classification.

Study of Berkhahn et al [2] shows that classification results can serve as a regularizer to generative model, while generative model can provide extra information to the classifier. They achieve better performance with regard to both classification accuracy and feature representation learning by allowing mutual influence between the two models and training them

(9)

called Semi-Supervised Generative Model. It is first proposed by Kingma [1]. One drawback of most existing Semi-Supervised Generative Models is that they assume data generation is directly influenced by label. As a consequence, such models concatenate classification result with the latent encodings directly. This may result in degradation of representation learning [3].

Therefore, in this paper we present a new flavor of Semi-Supervised Generative Model which overcomes this problem. The proposed model is also able to handle both labeled and unlabeled data. Similar to Kingma and Berkhahnโ€™s work, it utilizes the power of variational auto-encoder (VAE) for representations learning. But unlike existing works, the direct dependency of data generation on label is removed thanks to a new probabilistic modeling of data generation process.

In the future sections, we start with background knowledge, including Variational Inference and Variational auto-encoders in chapter 2. In chapter 3 we introduce some related approach prior to this work. In chapter 4 we propose the new model and in chapter 5 we present some experiments validating the new model.

(10)

CHAPTER 2. BACKGROUND

This section covers the necessary background knowledge to the proposed work, including a brief introduction to Variational Inference, the evidence lower bound and Variational auto-encoders.

2.1 Variational Inference

The goal of unsupervised learning is to learn a set of latent variables ๐’› = ๐‘ง1:๐‘›to represent observed data ๐’™ = ๐‘ฅ1:๐‘›. This requires learning the posterior distribution ๐‘(๐’›|๐’™). Directly

computing ๐‘(๐’›|๐’™) is often intractable since it often requires computing integration โˆซ ๐‘(๐’›, ๐’™) ๐‘‘๐’›

where ๐’› is often a high dimensional variable.

Variational Inference [4] is one approach to estimate the intractable posterior distribution

๐‘(๐’›|๐’™). The core idea of variational inference is:

1. Introduce a tractable hypothesis probability distribution ๐‘ž(๐’›; ๐œ†) parameterized by ๐œ†. 2. Find a ๐œ† that makes ๐‘ž(๐’›; ๐œ†) approximate ๐‘(๐’›|๐’™), namely

๐œ†โˆ—= ๐‘Ž๐‘Ÿ๐‘”๐‘š๐‘–๐‘›๐œ† ๐ท(๐‘(๐’›|๐’™), ๐‘ž(๐’›; ๐œ†)), (1)

where ๐ท(โ‹…) is a measurement of distance between two probability distributions. In practice one of the most wildly used measurement in Variational Inference is KL divergence.

๐œ†โˆ— = ๐‘Ž๐‘Ÿ๐‘”๐‘š๐‘–๐‘›๐œ† ๐พ๐ฟ(๐‘ž(๐’›; ๐œ†)||๐‘(๐’›|๐’™)) (2)

2.2 The evidence lower bound (ELBO)

Directly minimizing ๐พ๐ฟ(๐‘ž(๐’›)||๐‘(๐’›|๐’™)) in (2) is intractable because of the dependency on the evidence log ๐‘(x):

(11)

๐พ๐ฟ(๐‘ž(๐’›)||๐‘(๐’›|๐’™)) = ๐ธ๐’›~๐‘ž(๐’›)log ๐‘ž(๐’›) โˆ’ ๐ธ๐’›~๐‘ž(๐’›)log ๐‘(๐’›|๐’™)

= ๐ธ๐’›~๐‘ž(๐’›)log ๐‘ž(๐’›) โˆ’ ๐ธ๐’›~๐‘ž(๐’›)log ๐‘(๐’›, ๐’™) + log ๐‘(๐‘ฅ) (3)

Therefore, the evidence lower bound (ELBO) is introduced. It consists of the negative KL term plus evidence log ๐‘(x), which is a constant with respect to ๐‘ž(๐’›). Maximizing the ELBO is equivalent to minimizing the KL term, which is the goal of variation inference:

log๐‘(๐‘ฅ) โˆ’ ๐พ๐ฟ(๐‘ž(๐’›)||๐‘(๐’›|๐’™)) = ๐ธ๐’›~๐‘ž(๐’›)log ๐‘(๐’›, ๐’™) โˆ’ ๐ธ๐’›~๐‘ž(๐’›)log ๐‘ž(๐’›) โˆถ= ๐ธ๐ฟ๐ต๐‘‚ (4)

2.3 Variational auto-encoders

Variational auto-encoder (VAE) is proposed by Kingma & Welling [5]. They first start with the ELBO in (4).

๐ธ๐ฟ๐ต๐‘‚ = ๐ธ๐’›~๐‘ž(๐’›)log ๐‘(๐’›, ๐’™) โˆ’ ๐ธ๐’›~๐‘ž(๐’›)log ๐‘ž(๐’›)

= ๐ธ๐’›~๐‘ž(๐’›)log ๐‘(๐’™|๐’›) + ๐ธ๐’›~๐‘ž(๐’›)log ๐‘(๐’›) โˆ’ ๐ธ๐’›~๐‘ž(๐’›)log ๐‘ž(๐’›)

= ๐ธ๐’›~๐‘ž(๐’›)log ๐‘(๐’™|๐’›) โˆ’ ๐ธ๐’›~๐‘ž(๐’›)log๐‘ž(๐’›)

๐‘(๐’›)

= ๐ธ๐’›~๐‘ž(๐’›)log ๐‘(๐’™|๐’›) โˆ’ ๐พ๐ฟ(๐‘ž(๐’›)||๐‘(๐’›)) (5)

To optimize the ELBO in deep learning framework. The author used an autoencoder-decoder structure. A generative model (autoencoder-decoder) ๐‘๐œƒ(๐’™|๐’›) is chosen to learnt ๐‘(๐’™|๐’›) while simultaneously an inference model (encoder) ๐‘ž๐œ™(๐’›|๐’™) is chosen for the arbitrary probability distribution ๐‘ž(๐’›) in variational inference setting. The objective becomes:

๐ธ๐ฟ๐ต๐‘‚ = ๐ธ๐’›~๐‘ž๐œ™(๐’›|๐’™)log ๐‘๐œƒ(๐’™|๐’›) โˆ’ ๐พ๐ฟ(๐‘ž๐œ™(๐’›|๐’™)||๐‘๐œƒ(๐’›)) (6)

Then maximizing the ELBO becomes minimizing reconstruction error and a KL term.

๐‘ž๐œ™(๐’›|๐’™) and ๐‘๐œƒ(๐’›) are often chosen to be multivariate gaussian distribution with diagonal variance matrix so that the KL term is easily computable.

(12)

In this setting ๐’› becomes a probability distribution. Rather than output latent encoding ๐’›

directly, the encoder estimates parameters (๐, ๐ˆ) of a gaussian distribution. Latent encoding is first sampled from the distribution and then fed to decoder. However, directly sampling from gaussian is not differentiable. In order to train the model with stochastic gradient descent. The author introduced a method called reparameterization trick: Instead of directly sample

๐’› ~ ๐‘(๐, ๐ˆ), we compute ๐‘ง = ๐ + ๐ˆ2โ‹… ๐, where ๐ ~ ๐‘(๐ŸŽ, ๐‘ฐ). This is equivalent to sampling from ๐‘(๐, ๐ˆ), but itโ€™s differentiable hence trainable.

(13)

CHAPTER 3. RELATED WORKS

In this chapter we introduce two Semi-Supervised Generative Model given by Berkhahn et al. [2] and Kingma [1]. These two pieces of work serve as inspiration for this paper. A further comparison to the proposed model is given in chapter 5.

3.1 Berkhahn et al.

Berkhahn et al [2] present a Semi-Supervised Generative Model makes minimal changes to vanilla variational-autoencoder structure: the only addition is a classification layer ๐œ‹ that is attached to the topmost encoder layer. The input of decoder becomes concatenation of ๐œ‹(๐‘ฅ) and

๐’›:

๐’™๐‘Ÿ๐‘’๐‘๐‘œ๐‘›~๐‘๐œƒ(๐œ‹, ๐’›) = ๐‘๐œƒ(๐œ‹ โŠ• ๐’›)

Figure 1. Architecture of Berkhahnโ€™s model

The loss function of this model is traditional ELBO term plus classification error:

๐ฟ = ๐ฟ๐ธ๐ฟ๐ต๐‘‚+ ๐ฟ๐‘๐‘™ (7)

๐ฟ๐‘๐‘™ = โˆ’ ๐›ผ(๐‘ฆ)

#๐‘™๐‘Ž๐‘๐‘’๐‘™๐‘’๐‘‘โˆ‘ ๐‘ฆ๐‘– โ‹… log(๐œ‹๐‘–)

๐‘–

(14)

3.2 Kingmaโ€™s M2 model

Kingma M2 model is proposed in [1]. It is the first Semi-Supervised Generative Model. It assumes that data is generated by a latent class variable y in addition to a continuous latent variable ๐’›. Posterior is modeled by a decoder network taking ๐’š and ๐’› as input:

๐‘๐œƒ(๐’™|๐’š, ๐’›) = ๐‘“(๐’™; ๐’š, ๐’›, ๐œฝ) (9)

The approximate posterior ๐‘ž๐œ™(๐’š, ๐’›|๐’™) has a factorized form:

๐‘ž๐œ™(๐’š, ๐’›|๐’™) = ๐‘ž๐œ™(๐’›|๐’™)๐‘ž๐œ™(๐’š|๐’™) ๐‘ž๐œ™(๐’š|๐’™) = Cat (๐’š|๐œ‹๐œ™(๐’™))

๐‘ž๐œ™(๐’›|๐’™) = ๐‘ (๐’›|๐๐œ™(๐‘ฅ), ๐‘‘๐‘–๐‘Ž๐‘” (๐ˆ๐œ™2(๐‘ฅ))) (10)

where ๐‘ž๐œ™(๐’›|๐’™) is modeled by an encoder network and ๐๐œ™, ๐ˆ๐œ™2, ๐œ‹๐œ™ are modeled by neural networks.

The model uses two different loss functions to handle both labeled and unlabeled data in semi-supervised learning setting.

For labeled data:

log pฮธ(x, y) โ‰ฅ Eqฯ†(z|x, y) [log pฮธ(x|y, z) + log pฮธ(y) + log p(z) โˆ’ log qฯ†(z|x, y)] (11)

= โˆ’L(x, y)

For unlabeled data:

log pฮธ(x) โ‰ฅ Eqฯ•(๐’š, ๐’›|๐’™) [log pฮธ(๐ฑ|๐ฒ, ๐ณ) + log pฮธ(๐ฒ) + log p(๐ณ) โˆ’ log qฯ†(๐ฒ, ๐ณ|๐ฑ)] (12)

(15)

Figure 2. Architecture of Kingma M2 model

Figure 2. shows the architecture of Kingma M2 model. Addition to vanilla VAE model, a classifier is applied to handle labeled data. Latent encoding ๐’› and label๐’š or ๐’šโ€ฒare concatenated before fed to decoder.

(16)

CHAPTER 4. PROPOSED MODEL

This chapter describes the proposed model in detail. We first introduce a new

probabilistic model for data generation and inference. Then we show the corresponding loss functions for labeled and unlabeled data. Finally, we propose the model architecture.

4.1 Probabilistic model Data generation model

We assume the following data generation process: first a label ๐’šis chosen, aset of latent encoding ๐’› is generated conditioned on ๐’š. Then data ๐’™ is generated given purely latent encoding

๐’›. Figure 2. shows the probabilistic model of the above generation process.

Figure 3. Probabilistic model for data generation Hence, we have:

๐‘(๐’™, ๐’š, ๐’›) = ๐‘(๐’™|๐’›)๐‘(๐’›|๐’š)๐‘(๐’š)

(17)

Similar to VAE. We choose a deep generative network ๐‘๐œƒ(๐’™|๐’›; ๐œฝ) to learn the true probability distribution ๐‘(๐’™|๐’›).

Compared with Berkhahnโ€™s model and Kingmaโ€™s M2 model, the proposed generative model removes the direct dependency of data ๐’™ on label ๐’š. Therefore, latent encodings ๐’›have all the information about reconstructing ๐’™. This is a more desirable property for representation learning [3].

Figure 4. Probabilistic model for data generation in Berkhahnโ€™s and Kingmaโ€™s work

Inference model

In variation inference setting we use an inference model to approximate posterior distribution ๐‘(๐’›|๐’™, ๐’š) and ๐‘(๐’›, ๐’š|๐’™). Figure 4. describes the inference model. We assume ๐’šcan be inferenced by latent encoding๐’›while๐’›can be inferenced directly from ๐’™. Namely, we have:

๐‘ž๐œ™(๐’™, ๐’š, ๐’›) = ๐‘ž๐œ™(๐’›|๐’™)๐‘ž๐œ™(๐’š|๐’›)๐‘ž๐œ™(๐’™)

(18)

Figure 5. Probabilistic model for inference

This is a reasonable assumption because all information about data ๐’™, including

information about label, should be encoded by ๐’›. Therefore, we can infer the label directly from

๐’›. Thissetting also allows us to utilize the nice latent encoding weโ€™ve learnt from generative model for the downstream classification task. We can benefit from the potential disentanglement provided by VAE model.

4.2 Objective function

Similar to Kingmaโ€™s work. We also use two different loss function to handle labeled and unlabeled data.

Labeled Data

For labeled data. ๐’šis treated as observed variable. We minimize the following KL term:

๐พ๐ฟ(๐‘ž๐œ™(๐’›|๐’™, ๐’š)||๐‘๐œƒ(๐’›|๐’™, ๐’š)) = ๐ธ๐’›~๐‘ž๐œ™(๐’›|๐’™, ๐’š)log๐‘ž๐œ™(๐’›|๐’™, ๐’š) ๐‘๐œƒ(๐’›|๐’™, ๐’š) = ๐ธ๐’›~๐‘ž๐œ™(๐’›|๐’™)log

๐‘ž๐œ™(๐’›|๐’™)๐‘๐œƒ(๐’™, ๐’š) ๐‘๐œƒ(๐’™, ๐’š, ๐’›)

(19)

= ๐ธ๐’›~๐‘ž๐œ™(๐’›|๐’™)log ๐‘ž๐œ™(๐’›|๐’™)๐‘๐œƒ(๐’™, ๐’š) ๐‘๐œƒ(๐’™|๐’›)๐‘๐œƒ(๐’›|๐’š)๐‘๐œƒ(๐’š) = log ๐‘๐œƒ(๐’™, ๐’š) โˆ’ ๐ธ๐’›~๐‘ž๐œ™(๐’›|๐’™)log ๐‘๐œƒ(๐’™|๐’›) โˆ’ log ๐‘๐œƒ(๐‘ฆ) +๐พ๐ฟ(๐‘ž๐œ™(๐’›|๐’™)||๐‘๐œƒ(๐’›|๐’š)) (15)

Hence, we have the evidence lower bound for labeled data:

ELBOlabeled = log ๐‘๐œƒ(๐‘ฅ, ๐‘ฆ) โˆ’ ๐พ๐ฟ(๐‘ž๐œ™(๐’›|๐’™, ๐’š)||๐‘๐œƒ(๐’›|๐’™, ๐’š))

= ๐ธ๐’›~๐‘ž๐œ™(๐’›|๐’™)log ๐‘๐œƒ(๐’™|๐’›) โˆ’ ๐พ๐ฟ(๐‘ž๐œ™(๐’›|๐’™)||๐‘๐œƒ(๐’›|๐’š)) (16)

A classification error ๐ฟ๐‘๐‘™ = ๐ถ๐‘Ÿ๐‘œ๐‘ ๐‘ ๐ธ๐‘›๐‘ก๐‘Ÿ๐‘œ๐‘๐‘ฆ(๐’›, ๐’š) is also added in order to learn conditional probability ๐‘(๐’š|๐’›).

๐ฟ(๐’™, ๐’š) = ๐ฟ๐‘๐‘™โˆ’ ๐ธ๐ฟ๐ต๐‘‚๐‘™๐‘Ž๐‘๐‘’๐‘™๐‘’๐‘‘ (17)

Unlabeled Data

For unlabeled data. ๐’šis treatedas unobserved latent variable. We minimize:

๐พ๐ฟ(๐‘ž๐œ™(๐’›, ๐’š|๐’™)||๐‘๐œƒ(๐’›, ๐’š|๐’™)) = ๐ธ๐’›~๐‘ž๐œ™(๐’›|๐’™) ๐’š~๐‘ž๐œ™(๐’š|๐’™) log๐‘ž๐œ™(๐’›, ๐’š|๐’™) ๐‘๐œƒ(๐’›, ๐’š|๐’™) = ๐ธ๐’›~๐‘ž๐œ™(๐’›|๐’™) ๐’š~๐‘ž๐œ™(๐’š|๐’™) log๐‘ž๐œ™(๐’›|๐’™)๐‘ž๐œ™(๐’š|๐’›)๐‘๐œƒ(๐’™) ๐‘๐œƒ(๐’™, ๐’š, ๐’›) = ๐ธ๐’›~๐‘ž๐œ™(๐’›|๐’™) ๐’š~๐‘ž๐œ™(๐’š|๐’™) log๐‘ž๐œ™(๐’›|๐’™)๐‘ž๐œ™(๐’š|๐’›)๐‘๐œƒ(๐’™) ๐‘๐œƒ(๐’™|๐’›)๐‘๐œƒ(๐’›|๐’š)๐‘๐œƒ(๐’š) = log ๐‘๐œƒ(๐’™) โˆ’ ๐ธ๐’›~๐‘ž๐œ™(๐’›|๐’™)log ๐‘๐œƒ(๐’™|๐’›) โˆ’ ๐ธ๐’š~๐‘ž๐œ™(๐’š|๐’™)log ๐‘๐œƒ(๐‘ฆ) +๐ธ๐’š,๐’›~๐‘ž ๐œ™(๐’š, ๐’›|๐’™)log ๐‘ž๐œ™(๐’š|๐’›) +๐ธ๐’š~๐‘ž๐œ™(๐’š|๐’›)๐พ๐ฟ(๐‘ž๐œ™(๐’›|๐’™)||๐‘๐œƒ(๐’›|๐’š)) (18)

Hence, we have the evidence lower bound for unlabeled data:

(20)

โˆ’๐ธ๐’š,๐’›~๐‘ž๐œ™(๐’š, ๐’›|๐’™)log ๐‘ž๐œ™(๐’š|๐’›)

โˆ’๐ธ๐’š~๐‘ž๐œ™(๐’š|๐’›)๐พ๐ฟ(๐‘ž๐œ™(๐’›|๐’™)||๐‘๐œƒ(๐’›|๐’š)) = โˆ’๐‘ˆ(๐‘ฅ) (19)

Total Loss function

Finally, the total loss function for the entire dataset is now:

๐ฟ๐‘œ๐‘ ๐‘  = โˆ‘ ๐ฟ(๐‘ฅ, ๐‘ฆ) (๐‘ฅ,๐‘ฆ) โˆˆ ๐‘™๐‘Ž๐‘๐‘’๐‘™๐‘’๐‘‘ + โˆ‘ ๐‘ˆ(๐‘ฅ) ๐‘ฅ โˆˆ ๐‘ข๐‘›๐‘™๐‘Ž๐‘๐‘’๐‘™๐‘’๐‘‘ (20) 4.3 Model architecture

As is shown in Figure 5. Our model is a modified version of vanilla

variational-autoencoder. In order to optimize (16) and (19), some addition structures are added, including a classify-layer that outputs conditional probability ๐‘ž๐œ™(๐’š|๐’›) and predicts label ๐’šโ€ฒ, and a

discriminator taking ๐’š or ๐’šโ€ฒ as input and outputs ๐๐’›|๐’š and ๐ˆ๐’›|๐’š.

When a labeled input ๐’™๐’is fed to the model. The classify-layer tries to learn ๐‘(๐’š|๐’›) by minimizing cross entropy loss. The true label ๐’šis fed to discriminator therefore

๐พ๐ฟ(๐‘ž๐œ™(๐’›|๐’™)||๐‘๐œƒ(๐’›|๐’š)) is then computed.

When an unlabeled input ๐’™๐’–is fed to the model. The classify-layer outputs probability

๐‘ž๐œ™(๐’š|๐’›). Then predicted label ๐’šโ€ฒis fed to the discriminator so that

๐ธ๐’š~๐‘ž๐œ™(๐’š|๐’›)๐พ๐ฟ(๐‘ž๐œ™(๐’›|๐’™)||๐‘๐œƒ(๐’›|๐’š)) is computed.

For both labeled and unlabeled data. Reconstruction error is computed, which is equivalent to term ๐ธ๐’›~๐‘ž๐œ™(๐’›|๐’™)log ๐‘๐œƒ(๐’™|๐’›).

(21)
(22)

CHAPTER 5. EXPERIMENTS

In this chapter, we show an empirical evaluation of the proposed model, referred to as SSLGM (Semi-Supervised Learning with Generative Model) in this report. We show that the proposed model effective in Semi-Supervised learning, as well as competitive comparing with other related works. Open source code, together with most important figures and results, is available at: https://github.com/Fuu3214/SSLGM.git.

5.1 Experiment Setup Dataset

We use the well-known MNIST hand written digit dataset for experiment. The data set for semi-supervised learning is created by randomly splitting 60000 training data into labeled and unlabeled set. The size of labeled data is chosen to be 100, 600 and 1000 separately.

Model Setup

We experiment with two SSLGM models. The first model consists of shallow MLP encoder and decoder, with 10-dimensional latent variable ๐’›. The secondmodel uses more complex Convolutional Neural networks as encoder and decoder, with 50-dimensional latent variable. Table 1 shows the detail for two models.

Table 1. ๐‘†๐‘†๐ฟ๐บ๐‘€๐‘“๐‘ model structure

๐‘บ๐‘บ๐‘ณ๐‘ฎ๐‘ด๐’‡๐’„ Layer Activation function Output shape

Encoder dense relu 512

dense relu 256

dense None (10, 10)

Decoder dense relu 256

dense relu 512

dense sigmoid 784

(23)

Classifier (๐‘ž(๐’š|๐’›)) dense relu 256

dense None 10

Discriminator (๐‘(๐‘ง|๐‘ฆ)) dense relu 512

dense relu 512

dense None (10, 10)

Table 2. ๐‘†๐‘†๐ฟ๐บ๐‘€๐‘๐‘›๐‘› model structure

๐‘บ๐‘บ๐‘ณ๐‘ฎ๐‘ด๐’„๐’๐’ Layer Activation

function

Number of filters

Encoder 3 ร— 3 convolutional relu 32 max_pooling2d 3 ร— 3 convolutional relu 64 max_pooling2d 3 ร— 3 convolutional relu 32 max_pooling2d dense None (30, 30)

Decoder dense relu 256

dense relu 512

dense sigmoid 784

Classifier (๐‘ž(๐’š|๐’›)) dense relu 256

dense relu 256

dense None 30

Discriminator (๐‘(๐‘ง|๐‘ฆ)) dense relu 512

dense relu 512

dense None (30, 30)

5.2 Semi-supervised learning performance 5.2.1 Benchmarking

For benchmarking, we compare our model with supervised models such as fully

connected neural work and convolutional neural network trained on labeled data only, as well as related semi-supervised learning models such as Kingmaโ€™s M1 and M2 model and Berkhahn et

(24)

Result

Table 3 shows the benchmarking result. Given fewer labeled examples, both of our models perform better than other models, including purely supervised models as well as semi-supervised models, demonstrating the effectiveness of latent representation learnt by VAE on the downstream classification tasks. However, when number of labels becomes larger, the power of CNN on image classification tasks becomes more significant. It performs better compared with all semi-supervised learning models.

Comparing two of our models ๐‘บ๐‘บ๐‘ณ๐‘ฎ๐‘ด๐’‡๐’„and๐‘บ๐‘บ๐‘ณ๐‘ฎ๐‘ด๐’„๐’๐’. ๐‘บ๐‘บ๐‘ณ๐‘ฎ๐‘ด๐’„๐’๐’performs slightly better than ๐‘บ๐‘บ๐‘ณ๐‘ฎ๐‘ด๐’‡๐’„ thanks to the power of CNN on extracting image features as well as preventing from overfitting.

Table 3. Semi-supervised MNIST classification results

N 50 100 600 1000 NN 36.32 26.42 12.44 10.64 CNN 24.45 11.25 4.41 2.84 Berkhahn et al โˆ’ 18.9 (ยฑ0.6) โˆ’ 5.5 (ยฑ0.1) Kingmaโ€™s M1 โˆ’ 11.82 (ยฑ 0.25) 5.72 (ยฑ 0.049) 4.24 (ยฑ 0.07) Kingmaโ€™s M2 23.00 (ยฑ3.50) 11.97 (ยฑ 1.71) 4.94 (ยฑ 0.13) 3.60 (ยฑ 0.56) ๐‘บ๐‘บ๐‘ณ๐‘ฎ๐‘ด๐’‡๐’„ 18.49 (ยฑ1.13) ๐Ÿ–. ๐Ÿ‘๐Ÿ“ (ยฑ ๐ŸŽ. ๐Ÿ–๐Ÿ“) 6.79(ยฑ0.35) 6.15 (ยฑ0.25) ๐‘บ๐‘บ๐‘ณ๐‘ฎ๐‘ด๐’„๐’๐’ ๐Ÿ๐Ÿ•. ๐Ÿ–๐Ÿ– (ยฑ๐Ÿ. ๐Ÿ–๐Ÿ—) 9.52 (ยฑ 1.05) 6.81 (ยฑ 0.69) 5.32 (ยฑ0.31)

(25)

5.2.2 The effect of unlabeled data

We show that unlabeled data improves classification performance in our model. We compare our model with control groups CG๐‘“๐‘ and CG๐‘๐‘›๐‘›. The control group models are obtained by first removing the unsupervised components and then train the model only on labeled data. We choose exactly the same hyper-parameters, including number of hidden units, dimension of latent variables, number of layers of MLP and number of filters for CNN. We use classification error rate on testing dataset as our evaluation metric.

Result

The result in Table 4 shows that, in our model setting, unlabeled data is extremely helpful in improving semi-supervised classification performance, especially when we are given less labeled data. We also observe that, without unlabeled data, both of the control group models tend to overfit after a small number of training epochs. However, ๐‘บ๐‘บ๐‘ณ๐‘ฎ๐‘ด๐’‡๐’„and๐‘บ๐‘บ๐‘ณ๐‘ฎ๐‘ด๐’„๐’๐’hardly overfit even after a few hundred epochs. This shows that unlabeled data serves as a regularizer to the model.

Table 4. MNIST classification results compared with control groups

N 50 100 600 1000

CG๐‘“๐‘ 37.41 26.46 12.10 10.25

CG๐‘๐‘›๐‘› 26.26 20.24 7.57 6.22

๐‘บ๐‘บ๐‘ณ๐‘ฎ๐‘ด๐’‡๐’„ 18.49 (ยฑ1.13) ๐Ÿ–. ๐Ÿ‘๐Ÿ“ (ยฑ ๐ŸŽ. ๐Ÿ–๐Ÿ“) ๐Ÿ”. ๐Ÿ•๐Ÿ—(ยฑ๐ŸŽ. ๐Ÿ‘๐Ÿ“) 6.15 (ยฑ0.25) ๐‘บ๐‘บ๐‘ณ๐‘ฎ๐‘ด๐’„๐’๐’ ๐Ÿ๐Ÿ•. ๐Ÿ–๐Ÿ– (ยฑ๐Ÿ. ๐Ÿ–๐Ÿ—) 9.52 (ยฑ 1.05) 6.81 (ยฑ 0.69) ๐Ÿ“. ๐Ÿ‘๐Ÿ (ยฑ๐ŸŽ. ๐Ÿ‘๐Ÿ)

(26)

5.3 Conditional data generation 5.3.1 Style and content separation

To demonstrate that model is capable of learning useful latent representation with a small number of labeled data. We use the decoder of ๐‘บ๐‘บ๐‘ณ๐‘ฎ๐‘ด๐’‡๐’„model trained on 100 labeled data. In the first experiment we examine the style and content separation. First, we first randomly pick a class label and feed it to discriminator to acquire (๐๐’›|๐’š, ๐ˆ๐’›|๐’š). Then we randomly select two dimensions [๐‘ง๐‘Ž, ๐‘ง๐‘] of the latent encoding ๐’›and we range them as follow:

๐‘ง๐‘Ž = ๐‘ƒ๐‘ƒ๐น๐‘(๐œ‡ ๐‘ง|๐‘ฆ ๐‘Ž ,๐œŽ ๐‘ง|๐‘ฆ๐‘Ž )(๐›ผ) ๐‘ง๐‘ = ๐‘ƒ๐‘ƒ๐น๐‘(๐œ‡ ๐‘ง|๐‘ฆ ๐‘ ,๐œŽ ๐‘ง|๐‘ฆ๐‘ )(๐›ฝ) (20)

Where ๐‘ƒ๐‘ƒ๐น is the Percent Point Function and ๐›ผ and ๐›ฝ are evenly spaced numbers over

[0.05, 0.95].

For other dimensions we simply sample each of them from gaussian distribution

๐‘ง๐‘–~๐‘(๐œ‡๐‘ง|๐‘ฆ๐‘– , ๐œŽ๐‘ง|๐‘ฆ๐‘– ) once and then fix them. This guarantees that only two dimensions changes. To

generate MNIST examples we simply feed ๐’› to the decoder.

Result

In Figure 7 we choose dimension (๐‘ง0, ๐‘ง1) and generate 3 sets of digits from label 2,3,7. We see that as ๐‘ง0 becomes larger, the width of digit tends to becomes smaller, while as ๐‘ง1

becomes larger the digit tends to becomes more rounded. In Figure 8 we choose dimension

(๐‘ง0, ๐‘ง5) and generate digits from label 0, 5. We observe that as ๐‘ง0 becomes larger, the width of digit also becomes smaller, while as ๐‘ง5 becomes larger, the angle of digits change.

(27)

Figure 7. Conditional generation by fixing other dimensions while varying ๐‘ง0, ๐‘ง1

Figure 8. Conditional generation by fixing other dimensions while varying ๐‘ง0, ๐‘ง5

5.3.2 Conditional data generation performance

In the second experiment we use the decoder of ๐‘บ๐‘บ๐‘ณ๐‘ฎ๐‘ด๐’‡๐’„model trained on 100 labeled data to evaluate the conditional generation performance. To generate data, we simply sample ๐’›

(28)

Result

Figure 9 demonstrate an example of conditional data generation performance of our

๐‘บ๐‘บ๐‘ณ๐‘ฎ๐‘ด๐’‡๐’„model trained on 100 labeled data. Our model creates more accurate and more diverse data conditional on the label compared with Kingmaโ€™s M2 model shown in Figure 10.

Figure 9. Conditional data generation performance of ๐‘บ๐‘บ๐‘ณ๐‘ฎ๐‘ด๐’‡๐’„. Data is generated conditioned

on label 0, 5, 6 respectively.

Figure 10. Conditional data generation performance of Kingmaโ€™s M2 model. Data is generated conditioned on label 0, 5, 6 respectively.

(29)

CHAPTER 6. CONCLUSION

We propose a new semi-supervised generative model to overcome a drawback of existing approaches. We remove the direct dependency of data generation on label. Compared with related works that directly concatenate latent encoding with label, our new approach improves quality of latent representations. Moreover, unlike existing works, our approach is capable of utilizing the nice latent representations obtained by variational autoencoder in the downstream classification tasks. Experiments also demonstrate the efficiency of our approach with respect to semi-supervised learning performance and conditional data generation given a relatively small number of labeled data.

One drawback of our model is overfitting. In some scenario such as few-shot or one-shot learning, the number of labeled data is extremely small. In these situations, the performance of our model will drop sharply. This is because the supervised component is overfitting while unsupervised component is still underfitting. Given unlabeled data, classifier gives wrong predictions frequently and influence conditional generation, hence impacts the quality of latent representation. We observe the same problem in other existing semi-supervised generative models as well. By adopting methods such as adding dropout layer and batch normalization, our model partially overcome this problem. But further investigation is still needed to improve performance on fewer labeled data. This can be a direction for our future work. Another potential direction for future work is applying our model in other domains. For example, text image

transformation, where image is generated conditioned on text information obtained by some NLP models.

(30)

REFERENCES

[1] D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling, โ€œSemi-supervised learning with

deep generative models,โ€ in Advances in neural information processing systems, 2014, pp. 3581โ€“3589.

[2] F. Berkhahn, R. Keys, W. Ouertani, N. Shetty, and D. GeiรŸler, โ€œOne model to rule them all,โ€

arXiv preprint arXiv:1908.03015, 2019.

[3] Y. Bengio, A. Courville, and P. Vincent, โ€œRepresentation learning: A review and new

perspectives,โ€ IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 8, pp.

1798โ€“1828, 2013.

[4] D. M. Blei, A. Kucukelbir, and J. D. McAuliffe, โ€œVariational inference: A review for statisticians,โ€ Journal of the American Statistical Association, vol. 112, no. 518, pp. 859โ€“877, 2017.

[5] D. P. Kingma and M. Welling, โ€œAuto-encoding variational bayes,โ€ arXiv preprint

arXiv:1312.6114, 2013.

[6] Y. Cheng, โ€œSemi-supervised learning for neural machine translation,โ€ in Joint Training for Neural Machine Translation. Springer, 2019, pp. 25โ€“40.

[7] Y.-A. Chung, Y. Wang, W.-N. Hsu, Y. Zhang, and R. Skerry-Ryan, โ€œSemi-supervised training for improving data efficiency in end-to-end speech synthesis,โ€ in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 6940โ€“6944.

[8] R. Fergus, Y. Weiss, and A. Torralba, โ€œSemi-supervised learning in gigantic image

collections,โ€ in Advances in neural information processing systems, 2009, pp. 522โ€“530.

[9] M. Shi and B. Zhang, โ€œSemi-supervised learning improves gene expression-based prediction

of cancer recurrence,โ€ Bioinformatics, vol. 27, no. 21, pp. 3017โ€“3023, 2011.

[10] R. Johnson and T. Zhang, โ€œSemi-supervised convolutional neural networks for text

categorization via region embedding,โ€ in Advances in neural information processing systems,

Figure

Figure 1. Architecture of Berkhahnโ€™s model
Figure 2. Architecture of Kingma M2 model
Figure 3. Probabilistic model for data generation  Hence, we have:
Figure 4. Probabilistic model for data generation in Berkhahnโ€™s and Kingmaโ€™s work  Inference model
+7

References

Related documents