CHAPTER 4 : Learning a Discriminative Embedding for Unsupervised Domain Adapta-
4.4 Proposed Framework
4.4.2 Conditional Distribution Alignment
A main shortcoming of Eq. (4.3) is that minimizing the discrepancy between pS(φ(XS)) and pT(φ(XT))does not guarantee semantic consistency between the two domains because we are not
using any label information to minimize the discrepancy between the distributions. As a result, we can end up learning an embedding in which the two distributions have low discrepancy without having learned a discriminative embedding for both domains. To clarify this point, consider the source and target domains to be images corresponding to printed digits and handwritten digits. While the feature distributions in the embedding space could have low discrepancy, the classes might not be correctly aligned in this space, e.g., digits from a class in the target domain could be matched to a wrong class of the source domain or, even worse, digits from multiple classes in the target domain could be matched to the cluster of a single digit of the source domain. In such cases, the source
Algorithm 3DACAD (L, η, λ) 1: Input:dataDS = (XS,YS);DT = (XS), 2: Pre-training: 3: θˆ0= (w0,v0) = arg minθ P iL(fθ(xsi),yis) 4: foritr= 1, . . . , IT R do 5: DPL={(xti,yˆti)|yˆit=fθˆ(xti), p( ˆyit|xti)> τ}
6: for alt= 1, . . . , ALT do
7: Updateencoder parameters using pseudo-labels:
8: vˆ=PjD pS(φv(xS)|Cj), pSL(φv(xT)|Cj)
9: Updateentire model:
10: vˆ,wˆ = arg minw,vPNi=1L hw(φˆˆv(xsi)),ysi 11: end for
12: end for
classifier will not generalize well on the target domain. In other words, the shared embedding space, Z, might not be a semantically meaningful space for the target domain if we solely minimize SWD betweenpS(φ(XS))andpT(φ(XT)). To solve this challenge, we should learn the encoder function
such that the class-conditioned probabilities of both domains in the embedding space are similar, i.e.,
pS(φ(xS)|Cj)≈pT(φ(xT)|Cj), whereCj denotes a particular class. In other words, we need to
consider the labels when we compute the distance between two distributions. Given this, we can mitigate the class matching problem by using an adapted version of Eq. (4.3) as follows:
min v,w N X i=1 L hw(φv(xsi)),yis +λ k X j=1 D pS(φv(xS)|Cj), pT(φv(xT)|Cj) , (4.9)
where the discrepancy between distributions is minimized conditioned on classes, to enforce semantic alignment in the embedding space by considering the labels. Solving Eq. (4.9), however, is not tractable in the UDA setting as the labels for the target domain are not available and the conditional distribution,pT(φ(xT)|Cj)
, is not known.
To tackle the above issue, we introduce a surrogate of the objective in Eq. (4.9) that can be computed in the UDA setting. Our idea is to approximatepT(φ(xT)|Cj)
by generating pseudo-labels for the target data points and use these pseudo-labels to align the domains class-conditionally. The
pseudo-labels are obtained from the source classifier prediction, but only for the portion of target data points that the source classifier provides a confident prediction. Such data points presumably would exist as the two domains are related. As a result, a classifier learned merely on the source domain is able to classify some of the target data points correctly. Our idea is to use an iterative loop in which the confident pseudo-labels are used to align the distributions class-conditionally, and as a result of more alignment, the number of confident pseudo-labels can be increased. More specifically, we solve Eq. (4.9) in incremental gradient descent iterations. In particular, we first initialize the encoder and then classifier networks by training them solely on the source data. We then update the networks by using the data from both domains in an iterative scheme. At each iteration, we alternate between optimizing the classification loss for the source data and SWD loss term at each iteration. At each iteration, we pass the target domain data points into the classifier learned on the source data and analyze the label probability distribution on the softmax layer of the classifier. We choose a thresholdτ, e.g.,τ = 0.99, and assign pseudo-labels only to those target data points that the classifier predicts the pseudo-labels with high confidence, i.e.,p(yi|xti)> τ. Since the source and the
target domains are related, it is sensible that the source classifier can classify a subset of target data points correctly with high confidence. We use these data points to approximatepT(φ(xT)|Cj)
in Eq. (4.9) and update the encoder parameters,v, accordingly using only the data points with confident pseudo-labels. We then updatevby using the source labeled data points and the target data points that have been pseudo-labeled, i.e., we consider pseudo-labeled data points to minimize the classification loss. In our empirical experiments, we have observed that as more optimization iterations are performed, the number of data points with confident pseudo-labels increases, suggesting a gradual alignment of distributions as more optimization iterations are performed, and our approximation for Eq. (4.9) improves and becomes more stable. This enforces the source and the target distributions to align class conditionally in the embedding space, making the learned embedding discriminative for both domains. Figure 10 visualizes this process using real data (to be explained in more details in the Experimental Validations section). Our proposed framework, named Domain Adaptation with Conditional Alignment of Distributions (DACAD) is summarized in Algorithm 3.