CHAPTER 4 : Learning a Discriminative Embedding for Unsupervised Domain Adapta-
4.3 Problem Statement
We focus on supervised learning setting, where the goal is to learn a classifier in a domain using a training dataset. In particular, consider a source domain,DS = (XS,YS), withN labeled samples,
whereXS = [xs1, . . . ,xsN]∈ X ⊂R
d×N denotes the samples andY
S = [y1s, ...,ysN]∈ Y ⊂R k×N
contains the corresponding labels. Note that labelysnidentifies the membership ofxs
nto one or
multiple of thekclasses (e.g., digits 1, ...,10 for digit recognition). We assume that the source samples are drawn i.i.d. from the source joint probability distribution, i.e.,(xs
i,yi)∼p(xS, y). We
denote the source marginal distribution overxS withpS. Additionally, we have a related target
domain withM unlabeled data pointsXT = [xt1, . . . ,xtM]∈Rd×M. The same type of labels in the source domain hold for the target domain, and we assume that the samples are drawn from the target marginal distributionxt
i ∼pT. Despite describing similar classification problems, we also know that
distribution discrepancy exists between these two domains, i.e.,pS 6=pT, and hence, they are distinct
domains. The main goal is to learn a classifier for the target domain such that its generalization error, i.e., true risk, is minimal but since the target data points are unlabeled, the strategy is to learn such a classifier through knowledge transfer from the source domain, where the data points are labeled. The main challenge would be how to transfer knowledge effectively and selectively from the source domain to the target domain?
Training a parametric classifier using the source labeled data points is a straightforward supervised learning problem with known conditions for guaranteed generalization performance. Given a large enough number of source samples, N, we can consider a family of parametric functions
fθ :Rd→ Y, e.g., a deep neural network with concatenated learnable parametersθ, and solve for an optimal parameter to map the data point samples to their corresponding labels using standard supervised learning with minimal true risk, defined as follows:
e=E(x,y)∼p(xS,yS)(L(fθ(x), y)) , (4.1)
optimalθˆparameter for the mapping via minimizing the empirical risk on the training dataset,ˆeθ, as
a surrogate for the true risk:
ˆ θ= arg min θ ˆeθ = arg minθ X i L(fθ(xsi),ysi) . (4.2)
The learned classifierfθˆgeneralizes well on testing data points if they are drawn from the training
data point’s distributions. Only then, the empirical riskˆeis a suitable surrogate for the true risk function, e(θ). Given the discrepancy between the source and target distributions, fθˆdoes not
generalize well to the target domain. Therefore, there is a need for adapting the training procedure forfθˆby incorporating unlabeled target data points such that the learned knowledge from the source
domain could be used to classify the data points drawn from the target domain distribution by reducing the discrepancy between the source and the target domain distributions.
The main challenge is to circumvent the problem of discrepancy between the source and the target domain distributions. To that end, the mappingfθ(·)can be decomposed into a feature extractor φv(·)and a classifierhw(·), such thatfθ =hw◦φv, wherewandvare the corresponding learnable
parameters, i.e.,θ= (w,v). The feature extracting functionφv :X → Z, maps the data points from
both domains to an intermediate embedding spaceZ ⊂ Rf (i.e., feature space) and the classifier
hw :Z → Ymaps the data points representations in the embedding space to the label set. Note that,
as a deterministic function, the feature extractor functionφvcan change distribution of its input data.
The core idea is to learn the feature extractor function,φv, for both domains such that the domain
specific distribution of the extracted features to be similar to one another. Therefore, ifφv is learned
such that the discrepancy between the source and target distributions is minimized in the embedding space, i.e., discrepancy betweenpS(φ(xs))andpT(φ(xt)), then the embedding becomes agnostic
and the classifierhwwould generalize well on the target domain and could be used to label the target
domain data points even only trained using the source laebeled data points. This is the core idea behind various prior domain adaptation approaches in the literature [129].
Figure 8: The architecture of the proposed unsupervised domain adaptation framework: note that the model is completely shared across the two domains. The shared encoder maps the data points from both domains to a shared embedding space which is models as the output space of the encoder. The classifier sub-network is trained using solely the label data from the source domain. Since the distributions of the domains are matched in this embedding space, the classifier sub-network generalizes well on the target domain.