Multinomial Adversarial Networks ( MAN ) - Learning Deep Representations for Low-Resource Cross

In this chapter, we tackle the text classification problem in the real-world setting in which texts come from a variety of domains, each with a varying amount of labeled data. Specifically, assume we have a total of N domains, N1 labeled do-

mains (denoted as ∆L) for which there is some labeled data, and N2 unlabeled

domains (∆U) for which no annotated training instances are available. Denote

∆ = ∆L∪∆U as the collection of all domains, withN = N1+N2. The goal of this

work, and of MDTC in general, is to improve the overall classification perfor- mance across all N domains, measured in this paper as the average3 _classifica-

tion accuracy across theNdomains in∆.

3_{In this work, we use macro-average over domains, but}_MAN _{can be readily adapted for} micro-average or other (weighted) averaging schemes.

Forward and backward passes when updating the parameters of Fs, Fd and C Forward and backward passes when updating the parameters of D

Mini-batch of documents from domain di ∈ Δ

Shared Feature Extractor Domain Feature Extractor Text Classifier C Domain Discriminator D Class Label Domain Label J_FD_s J_D

J

(if

d

2 )

Fdi Fs

Figure 4.1:MANfor MDTC. The figure demonstrates the training on a mini- batch of data from one domain. One training iteration consists of one such mini-batch training from each domain. The parameters of Fs,Fd,Care updated together, and the training flows are illustrated by the green arrows. The parameters of D _are

updated separately, shown in red arrows. Solid lines indicate forward passes while dotted lines are backward passes. J_FD

s is

the domain loss for F_{s, which is anticorrelated with} JD (e.g.

J_FD

s = −JD). (See Section 4.2 and Section 4.3)

4.2.1 Model Architecture

As shown in Figure 4.1, the Multinomial Adversarial Network (MAN) adopts the Shared-Private paradigm of Bousmalis et al. (2016) and consists of four components: ashared feature extractorF_{s, a}_{domain feature extractor}Fdi for each labeled

domaindi ∈ ∆L, a text classifierC, and adomain discriminator D. The main idea ofMAN_{is to explicitly model the domain-invariant features that are beneficial to}

byF_{s), as well as the domain-specific features that mainly contribute to the clas-}

sification in its own domain (thedomain features, extracted byF_{d). Here, the ad-}

versarial domain discriminatorD_{has a multinomial output that takes a shared}

feature vector and predicts the likelihood of that sample coming from each domain. As seen in Figure 4.1, during the training ofFs(green arrows denote the training flow),Fs aims to confuseDby minimizing JD_F

s, which is anticorrelated

toJD(detailed in Section 4.2.2), so thatDcannot predict the domain of a sample given its shared features. The intuition is that if even a strong discriminatorD

cannot tell the domain of a sample from the extracted features, those features

Fslearned are essentially domain invariant. By enforcing domain-invariant features to be learned byF_{s, when trained jointly via backpropagation, the set of}

domain feature extractorsFd will each learn domain-specific features beneficial within its own domain.

The architecture of each component is relatively flexible, and can be decided by the practitioners to suit their particular classification tasks. For instance, the feature extractors can adopt the form of Convolutional Neural Nets (CNN), Re- current Neural Nets (RNN), or a Multi-Layer Perceptron (MLP), depending on the input data (see Section 4.4). The input ofMAN_{will also be dependent on the}

feature extractor choice. The output of a (shared/domain) feature extractor is a fixed-length vector, which is considered the (shared/domain) hidden features of some given input text. On the other hand, the outputs ofCand D are label probabilities for class and domain prediction, respectively. For example, both

C_and D _{can be MLPs with a softmax layer on top. In Section 4.3, we provide}

alternative architectures for D _{and their mathematical implications. We now}

present a detailed description of theMAN_{training in Section 4.2.2 as well as the}

Require: labeled corpusX; unlabeled corpusU; Hyperpamameterλ >0,k∈N

1: repeat

2: .Diterations 3: forditer=1tokdo 4: lD=0

5: for alld ∈∆do .For allNdomains

6: Sample a mini-batchx∼_Ud

7: fs=Fs(x) .Shared feature vector 8: lD +=JD(D(fs);d) .AccumulateDloss 9: UpdateDparameters using∇lD

10: .Main iteration 11: loss=0

12: for alld∈∆Ldo .For all labeled domains 13: Sample a mini-batch(x,y)∼_Xd

14: f_s=Fs(x)

15: fd =Fd(x) .Domain feature vector 16: loss+=JC(C(fs,fd);y) .ComputeCloss 17: for alld∈∆do .For allNdomains

18: Sample a mini-batch x∼_Ud 19: fs=Fs(x)

20: loss+=λ·JD

Fs(D(fs);d) .Domain loss ofFs

21: UpdateFs,Fd,Cparameters using∇loss 22: untilconvergence

Algorithm 4.1:MANTraining

4.2.2 MAN_Training

Denote the annotated corpus in a labeled domaindi ∈∆LasXi; and(x,y)∼ Xiis a sample drawn from the labeled data in domaindi, wherexis the input andyis the task label. On the other hand, for any domaindi0 ∈ ∆, denote the unlabeled

corpus asUi0. Note for the choice of unlabeled data of a labeled domain, one can

use a separate unlabeled corpus or simply use the labeled data (or use both).

In Figure 4.1, the arrows illustrate the training flows of various components. Due to the adversarial nature of the domain discriminatorD, it is trained with a separate optimizer (red arrows), while the rest of the networks are updated

with the main optimizer (green arrows).C_{is only trained on the annotated data}

from labeled domains, and it takes as input the concatenation of the shared and domain feature vectors. At test time, for data from unlabeled domains with no

F_{d, the domain features are set to the}0vector forC_{’s input. On the contrary,}D

only takes the shared features as input, for both labeled and unlabeled domains. TheMANtraining procedure is described in Algorithm 4.1.

In Algorithm 4.1,LCandLDare the loss functions of the text classifierCand the domain discriminatorD_{, respectively. As mentioned in Section 4.2.1,}C_has

aso f tmaxlayer on top for classification. We hence adopt the canonical negative

log-likelihood (NLL) loss:

LC(ˆy,y)= −logP(ˆy= y) (4.1) whereyis the true label andyˆis theso f tmaxpredictions. ForD_{, we consider two}

variants of MAN_{. The first one is to use the NLL loss same as}C_{which suits the}

classification task; while another option is to use the Least-Square (L2) loss that was shown to be able to alleviate the gradient vanishing problem when using the NLL loss in the adversarial setting (Mao et al., 2017):

L_DNLL( ˆd,d)= −logP( ˆd=d) (4.2) L_DL2( ˆd,d)= N X i=1 ( ˆdi−1{d=i})2 (4.3) where d is the domain index of some sample and dˆis the prediction. Without loss of generality, we normalizedˆso thatPN

i=1dˆi =1and∀i: ˆdi ≥0. Therefore, the objectives ofCandDthat we are minimizing are:

JC = N X i=1 E (x,y)∼Xi _L C(C(Fs(x),Fd(x));y) (4.4) JD = N X i=1 E x∼Ui [LD(D(Fs(x));d)] (4.5)

For the feature extractors, the training of domain feature extractors is straightforward, as their sole objective is to helpC _{perform better within their}

own domain. Hence, JFd = JC for any domaind. Finally, the shared feature ex-

tractorFshas two objectives: to helpCachieve higher accuracy, and to make the feature distribution invariant across all domains. It thus leads to the following bipartite loss: JFs = J C Fs +λ·J D Fs (4.6)

whereλis a hyperparameter balancing the two parts. JD

Fs is the domain loss of

Fsanticorrelated to JD: (NLL) J_FD s = −JD (4.7) (L2) J_FD s = N X i=1 E x∼Ui         N X j=1 Dj(Fs(x))− 1 N !2        (4.8)

If D adopts the NLL loss (4.7), the domain loss is simply −JD. For the L2 loss (4.8), J_FD

s intuitively translates to pushing D to make random predictions.

See Section 4.3 for theoretical justifications.

In document Learning Deep Representations for Low-Resource Cross-Lingual Natural Language Processing (Page 65-70)