Discriminative Density Ratio - Adaptive Learning Algorithms for Non-stationary Data

Existing work on using the reweighting strategy for domain adaptation is based on the simplified assumption that pts(y|x) = ptr(y|x) and on the estimation of the density ratio

of the marginal distributions as β(x) = pts(x)/_p_tr₍_x₎. However, realistic domain adaptation problems are more complex than the above assumption. According to Bayes’ rule, the prior, posterior, marginal, likelihood, and joint distributions are tightly related asp(x, y) =

p(y|x)p(x) =p(x|y)p(y). The actual learning settings usually cause more than one type of distribution change simultaneously.

Focusing on the classification tasks, the objective is to discriminatively separate the instances into different classes. However, in the conventional weighting approach dealing with adaptation, the distribution matching is performed on the whole input space. In other words, the existing algorithms focus on matching the training and test distributions without considering to preserve the separation between classes in the reweighted space.

Moreover, the effectiveness of conventional density-ratio estimation approach is limited by another constraint, the support condition (i.e. ∀x, ptr(x) = 0 ⇒ pts(x) = 0) [62,91].

The model, therefore, cannot generalize well for regions where pts(x)6= 0 but ptr(x) = 0.

These two problems hold back the effectiveness of the weighting methods in the domain adaptation problem severely, especially for classification tasks. Several studies have reported this problem [28,53], but none of them have presented a clear solution.

Motivated by these observations, we propose a Discriminative Density-Ratio (DDR) approach to learn the weights of training data discriminatively by estimating the density ratio of joint distributions in a class-wise manner to preserve the separations between classes. The DDR model aims to achieve the objectives of 1) approximating test domain risk with the reweighted training data according to joint distributions, 2) minimizing the

distribution discrepancy between the training and test data in a class-wise manner, and 3) guiding the decision boundary to the sparse regions of the test data.

4.2.1 Learning Objectives

Objective 1: Minimization of approximated test domain risk. First, we discuss the situation of supervised learning where there is no distribution change between the training and test data. The general purpose of supervised learning is to minimize the expected risk

R(h, p(x, y), L(x, y, h)) = Z Z

L(x, y, h)p(x, y)dxdy , (4.1)

where h is the model hypothesis, L(x, y, h) is a loss function, and p(x, y) is the joint distribution over x and y.

Given the presence of distribution changes between the training and test data, i.e.

pts(x, y) 6= ptr(x, y), we will seek to obtain the optimal model in the test domain by

approximating the test domain risk using the following reweighting scheme:

Rts(h, pts(x, y), L(x, y, h)) = Z Z L(x, y, h)pts(x, y)dxdy = Z Z L(x, y, h)pts(x, y) ptr(x, y) ptr(x, y)dxdy = Rtr h, ptr(x, y), pts(x, y) ptr(x, y) L(x, y, h) . (4.2)

Defining weights to reflect the joint distribution ratiosw(x, y) = pts(x,y)

ptr(x,y), we have

Rts =Rtr(h, ptr(x, y), w(x, y)L(x, y, h)). (4.3)

With ntr observed training samples Str = {(xi, yi)|i= 1, . . . , ntr} from ptr(x, y) and

using a regularized risk scheme, the test domain risk can be approximated as ˜ Rts ≈ Rˆtr(h,Str, w(x, y)L(x, y, h)) = 1 ntr X (xi,yi)∈Str w(xi, yi)L(xi, yi, h) +λΩ(h), (4.4)

the training data, andλ is the trade-off parameter.

Objective 2: Minimization of class-wise distribution discrepancy. The conventional sample reweighting approach assumes that the posterior distributions are unchanged (pts(y|x) =ptr(y|x)) and simplifies the weights as the density ratio of the marginal distri-

butions of x as w(x, y) = pts(x, y) ptr(x, y) ≈ pts(x) ptr(x) . (4.5)

Instead of aggressively reweighting training samples by the density ratios of the marginal distributions, our approach is to preserve the separations between classes by estimating the density ratio of joint distributions in a class-wise manner. We decompose the joint distribution from the perspective of class likelihood and class prior as

ptr(x|y) be the density ratio of class conditional distributions between the same class, and γ(y) = pts(y)

ptr(y) be the ratio of priors. Then, Eq. 4.6 can be written as

w(x, y) =β(x, y)γ(y) (4.7)

However, the fact that the test data do not have label information means that estimating

β and γ directly is not possible. The solution is to use the current model prediction on the test data Xts to estimate β and γ. The details are given in Section 4.3.2. Here, the

objective is to minimize the following class-wise distribution discrepancy Dcw(β(x, y), ptr(x, y), pts(x), h) =X c∈C D [β(x, y =c)ptr(x|y=c), pts(x|y=c)] =X c∈C D β(x, y =c)ptr(x|y =c), pts(y=c|x) pts(y=c) pts(x) , (4.8)

where pts(y=c|x) andpts(y=c) are the posterior and prior of the test data estimated by

the current model h, and

is the distribution discrepancy for class c between the weighted training data β(x, y =

c)ptr(x|y=c) and the test data pts(x|y =c) = pts(y= c|x)

pts(y=c) pts(x).

With the training and test collection Str and Xts, the empirical class-wise distribution

discrepancy can be approximated as

Dcw(β(x, y),Str,Xts, h) =X c∈C D βXtr|ytr=c, pts(y =c|x0 ∈ Xts) P x0_∈X tspts(y=c|x 0₎Xts . (4.9)

Using different measures to express the distribution discrepancies will lead to different density-ratio estimation algorithms (see Chapter 3). For example, using Least Square Error (LSE) as the objective function results in the uLSIF-based algorithm to solve the class-wise density-ratio estimation.

Objective 3: Maximization of test data margin. For the shifted but unknown distributions, we also intend to simultaneously force the classification boundary to lie at sparse regions of the unlabeled test data. This means that it is preferable to maximize the test data margin. Making use of this characteristic over the test data can alleviate the model generalization limits on the unsupported regions where pts(x)6= 0 but ptr(x) = 0.

Maximizing the test data margin coincides with minimizing the margin loss over the test data. As a result, the idea of hinge loss from semi-supervised learning [68] can be used to express the margin loss, which is defined as

MarginLoss(h,Xts) =

x0_j∈Xts

max 0,1− |h(x0_j)|

, (4.10)

whereh(x0_j) is the decision value of the model output over the given test samples.

4.2.2 DDR Optimization Problem

Combining the aforementioned three objectives, we formulate theDiscriminative Density- Ratio (DDR) Optimization Problem as:

{h∗_ts, w∗} = argmin_h,β,γ,w{Rˆtr(h,Str, wL(x, y, h))

+λ1Dcw (β(x, y),Str,Xts, h) +λ2MarginLoss (h,Xts)}

whereλ1 and λ2 are trade-off parameters to balance the importance of the three terms.

The DDR problem is not trivial to solve since the first two terms are convex and the last term is concave. We will present two effective solutions to the DDR problem in the next section.

In document Adaptive Learning Algorithms for Non-stationary Data (Page 48-52)