In this chapter, we tackle the text classification problem in the real-world setting in which texts come from a variety of domains, each with a varying amount of labeled data. Specifically, assume we have a total of N domains, N1 labeled do-
mains (denoted as ∆L) for which there is some labeled data, and N2 unlabeled
domains (∆U) for which no annotated training instances are available. Denote
∆ = ∆L∪∆U as the collection of all domains, withN = N1+N2. The goal of this
work, and of MDTC in general, is to improve the overall classification perfor- mance across all N domains, measured in this paper as the average3 classifica-
tion accuracy across theNdomains in∆.
3In this work, we use macro-average over domains, butMAN can be readily adapted for micro-average or other (weighted) averaging schemes.
Forward and backward passes when updating the parameters of Fs, Fd and C Forward and backward passes when updating the parameters of D
Mini-batch of documents from domain di ∈ Δ
Shared Feature Extractor Domain Feature Extractor Text Classifier C Domain Discriminator D Class Label Domain Label JFDs JD
J
C(if
d
i2
L)
Fdi FsFigure 4.1:MANfor MDTC. The figure demonstrates the training on a mini- batch of data from one domain. One training iteration consists of one such mini-batch training from each domain. The param- eters of Fs,Fd,Care updated together, and the training flows are illustrated by the green arrows. The parameters of D are
updated separately, shown in red arrows. Solid lines indicate forward passes while dotted lines are backward passes. JFD
s is
the domain loss for Fs, which is anticorrelated with JD (e.g.
JFD
s = −JD). (See Section 4.2 and Section 4.3)
4.2.1
Model Architecture
As shown in Figure 4.1, the Multinomial Adversarial Network (MAN) adopts the Shared-Private paradigm of Bousmalis et al. (2016) and consists of four compo- nents: ashared feature extractorFs, adomain feature extractorFdi for each labeled
domaindi ∈ ∆L, a text classifierC, and adomain discriminator D. The main idea ofMANis to explicitly model the domain-invariant features that are beneficial to
byFs), as well as the domain-specific features that mainly contribute to the clas-
sification in its own domain (thedomain features, extracted byFd). Here, the ad-
versarial domain discriminatorDhas a multinomial output that takes a shared
feature vector and predicts the likelihood of that sample coming from each do- main. As seen in Figure 4.1, during the training ofFs(green arrows denote the training flow),Fs aims to confuseDby minimizing JDF
s, which is anticorrelated
toJD(detailed in Section 4.2.2), so thatDcannot predict the domain of a sample given its shared features. The intuition is that if even a strong discriminatorD
cannot tell the domain of a sample from the extracted features, those features
Fslearned are essentially domain invariant. By enforcing domain-invariant fea- tures to be learned byFs, when trained jointly via backpropagation, the set of
domain feature extractorsFd will each learn domain-specific features beneficial within its own domain.
The architecture of each component is relatively flexible, and can be decided by the practitioners to suit their particular classification tasks. For instance, the feature extractors can adopt the form of Convolutional Neural Nets (CNN), Re- current Neural Nets (RNN), or a Multi-Layer Perceptron (MLP), depending on the input data (see Section 4.4). The input ofMANwill also be dependent on the
feature extractor choice. The output of a (shared/domain) feature extractor is a fixed-length vector, which is considered the (shared/domain) hidden features of some given input text. On the other hand, the outputs ofCand D are label probabilities for class and domain prediction, respectively. For example, both
Cand D can be MLPs with a softmax layer on top. In Section 4.3, we provide
alternative architectures for D and their mathematical implications. We now
present a detailed description of theMANtraining in Section 4.2.2 as well as the
Require: labeled corpusX; unlabeled corpusU; Hyperpamameterλ >0,k∈N
1: repeat
2: .Diterations 3: forditer=1tokdo 4: lD=0
5: for alld ∈∆do .For allNdomains
6: Sample a mini-batchx∼Ud
7: fs=Fs(x) .Shared feature vector 8: lD +=JD(D(fs);d) .AccumulateDloss 9: UpdateDparameters using∇lD
10: .Main iteration 11: loss=0
12: for alld∈∆Ldo .For all labeled domains 13: Sample a mini-batch(x,y)∼Xd
14: fs=Fs(x)
15: fd =Fd(x) .Domain feature vector 16: loss+=JC(C(fs,fd);y) .ComputeCloss 17: for alld∈∆do .For allNdomains
18: Sample a mini-batch x∼Ud 19: fs=Fs(x)
20: loss+=λ·JD
Fs(D(fs);d) .Domain loss ofFs
21: UpdateFs,Fd,Cparameters using∇loss 22: untilconvergence
Algorithm 4.1:MANTraining
4.2.2
MANTraining
Denote the annotated corpus in a labeled domaindi ∈∆LasXi; and(x,y)∼ Xiis a sample drawn from the labeled data in domaindi, wherexis the input andyis the task label. On the other hand, for any domaindi0 ∈ ∆, denote the unlabeled
corpus asUi0. Note for the choice of unlabeled data of a labeled domain, one can
use a separate unlabeled corpus or simply use the labeled data (or use both).
In Figure 4.1, the arrows illustrate the training flows of various components. Due to the adversarial nature of the domain discriminatorD, it is trained with a separate optimizer (red arrows), while the rest of the networks are updated
with the main optimizer (green arrows).Cis only trained on the annotated data
from labeled domains, and it takes as input the concatenation of the shared and domain feature vectors. At test time, for data from unlabeled domains with no
Fd, the domain features are set to the0vector forC’s input. On the contrary,D
only takes the shared features as input, for both labeled and unlabeled domains. TheMANtraining procedure is described in Algorithm 4.1.
In Algorithm 4.1,LCandLDare the loss functions of the text classifierCand the domain discriminatorD, respectively. As mentioned in Section 4.2.1,Chas
aso f tmaxlayer on top for classification. We hence adopt the canonical negative
log-likelihood (NLL) loss:
LC(ˆy,y)= −logP(ˆy= y) (4.1) whereyis the true label andyˆis theso f tmaxpredictions. ForD, we consider two
variants of MAN. The first one is to use the NLL loss same asCwhich suits the
classification task; while another option is to use the Least-Square (L2) loss that was shown to be able to alleviate the gradient vanishing problem when using the NLL loss in the adversarial setting (Mao et al., 2017):
LDNLL( ˆd,d)= −logP( ˆd=d) (4.2) LDL2( ˆd,d)= N X i=1 ( ˆdi−1{d=i})2 (4.3) where d is the domain index of some sample and dˆis the prediction. Without loss of generality, we normalizedˆso thatPN
i=1dˆi =1and∀i: ˆdi ≥0. Therefore, the objectives ofCandDthat we are minimizing are:
JC = N X i=1 E (x,y)∼Xi L C(C(Fs(x),Fd(x));y) (4.4) JD = N X i=1 E x∼Ui [LD(D(Fs(x));d)] (4.5)
For the feature extractors, the training of domain feature extractors is straightforward, as their sole objective is to helpC perform better within their
own domain. Hence, JFd = JC for any domaind. Finally, the shared feature ex-
tractorFshas two objectives: to helpCachieve higher accuracy, and to make the feature distribution invariant across all domains. It thus leads to the following bipartite loss: JFs = J C Fs +λ·J D Fs (4.6)
whereλis a hyperparameter balancing the two parts. JD
Fs is the domain loss of
Fsanticorrelated to JD: (NLL) JFD s = −JD (4.7) (L2) JFD s = N X i=1 E x∼Ui N X j=1 Dj(Fs(x))− 1 N !2 (4.8)
If D adopts the NLL loss (4.7), the domain loss is simply −JD. For the L2 loss (4.8), JFD
s intuitively translates to pushing D to make random predictions.
See Section 4.3 for theoretical justifications.