Problem Statement - : Learning a Discriminative Embedding for Unsupervised Domain Adapta-

CHAPTER 4 : Learning a Discriminative Embedding for Unsupervised Domain Adapta-

4.3 Problem Statement

We focus on supervised learning setting, where the goal is to learn a classifier in a domain using a training dataset. In particular, consider a source domain,_DS = (XS,YS), withN labeled samples,

whereXS = [xs1, . . . ,xsN]∈ X ⊂R

d×N _{denotes the samples and}_Y

S = [y1s, ...,ysN]∈ Y ⊂R k×N

contains the corresponding labels. Note that labelys_nidentifies the membership ofxs

nto one or

multiple of thekclasses (e.g., digits 1, ...,10 for digit recognition). We assume that the source samples are drawn i.i.d. from the source joint probability distribution, i.e.,(xs

i,yi)∼p(xS, y). We

denote the source marginal distribution overxS withpS. Additionally, we have a related target

domain withM unlabeled data pointsXT = [xt₁, . . . ,xt_M]∈Rd×M. The same type of labels in the source domain hold for the target domain, and we assume that the samples are drawn from the target marginal distributionxt

i ∼pT. Despite describing similar classification problems, we also know that

distribution discrepancy exists between these two domains, i.e.,pS 6=pT, and hence, they are distinct

domains. The main goal is to learn a classifier for the target domain such that its generalization error, i.e., true risk, is minimal but since the target data points are unlabeled, the strategy is to learn such a classifier through knowledge transfer from the source domain, where the data points are labeled. The main challenge would be how to transfer knowledge effectively and selectively from the source domain to the target domain?

Training a parametric classifier using the source labeled data points is a straightforward supervised learning problem with known conditions for guaranteed generalization performance. Given a large enough number of source samples, N, we can consider a family of parametric functions

fθ :Rd→ Y, e.g., a deep neural network with concatenated learnable parametersθ, and solve for an optimal parameter to map the data point samples to their corresponding labels using standard supervised learning with minimal true risk, defined as follows:

e=E(x,y)∼p(xS_,yS₎(L(f_θ(x), y)) , (4.1)

optimalθˆparameter for the mapping via minimizing the empirical risk on the training dataset,ˆeθ, as

a surrogate for the true risk:

ˆ θ= arg min θ ˆeθ = arg minθ X i L(fθ(xsi),ysi) . (4.2)

The learned classifierf_θˆgeneralizes well on testing data points if they are drawn from the training

data point’s distributions. Only then, the empirical riskˆeis a suitable surrogate for the true risk function, e(θ). Given the discrepancy between the source and target distributions, f_θˆdoes not

generalize well to the target domain. Therefore, there is a need for adapting the training procedure forf_θˆby incorporating unlabeled target data points such that the learned knowledge from the source

domain could be used to classify the data points drawn from the target domain distribution by reducing the discrepancy between the source and the target domain distributions.

The main challenge is to circumvent the problem of discrepancy between the source and the target domain distributions. To that end, the mappingfθ(·)can be decomposed into a feature extractor φv(_·)and a classifierhw(_·), such thatfθ =hw◦φv, wherewandvare the corresponding learnable

parameters, i.e.,θ= (w,v). The feature extracting functionφv :X → Z, maps the data points from

both domains to an intermediate embedding spaceZ ⊂ Rf (i.e., feature space) and the classifier

hw :Z → Ymaps the data points representations in the embedding space to the label set. Note that,

as a deterministic function, the feature extractor functionφvcan change distribution of its input data.

The core idea is to learn the feature extractor function,φv, for both domains such that the domain

specific distribution of the extracted features to be similar to one another. Therefore, ifφv is learned

such that the discrepancy between the source and target distributions is minimized in the embedding space, i.e., discrepancy betweenpS(φ(xs))andpT(φ(xt)), then the embedding becomes agnostic

and the classifierhwwould generalize well on the target domain and could be used to label the target

domain data points even only trained using the source laebeled data points. This is the core idea behind various prior domain adaptation approaches in the literature [129].

Figure 8: The architecture of the proposed unsupervised domain adaptation framework: note that the model is completely shared across the two domains. The shared encoder maps the data points from both domains to a shared embedding space which is models as the output space of the encoder. The classifier sub-network is trained using solely the label data from the source domain. Since the distributions of the domains are matched in this embedding space, the classifier sub-network generalizes well on the target domain.

In document Learning Transferable Knowledge Through Embedding Spaces (Page 86-88)