Transfer learning - Related work for label efficient learning

2.2 Related work for label efficient learning

2.2.2 Transfer learning

Transfer learning is an emerging field in machine learning research, that aims to improve the sample complexity of learning problems with the help of additional information and knowledge sources. The main idea is to transfer useful knowledge/information learned in one domain or task (source) to another (target) domain or task. This would help to increase the amount of information that the target learner can learn form, which leads to improved predictive performance. This is not a trivial problem because we need to find out what information is useful and beneficial for the target learner, and how to transfer that information.

2.2.2.1 Core concepts For the discussion of transfer learning we need to give the def- initions of the core concepts: domain, task and transfer learning. For illustration let us consider an application example: disease diagnosis, where given patient health records, the goal is to classify them into certain classes.

Domain, denoted byD, consists of two components: the feature spaceX and the marginal distributionP(X).

Task, denoted by T, consists of two components: the label space Y and a predictive

function f that maps the feature space to the label space: f :X _→Y. From probabilistic

point of view, f is the conditional probability P(Y_|X). The function f needs to be learned from the training data, and can be used to predict the label of a future example.

Source domainandsource taskare denoted byDS andTS, whereastarget domain

andtarget taskare denoted byDT andTT, respectively.

Transfer Learning is a learning technique that aims to improve the predictive per-

formance of the function f_T in DT by using the information and knowledge inDS andTS,

whereDS6=DT and/orTS6=TT.

GivenDS,TS,DT andTT, we may have the following cases:

• The case when DS =DT and TS =TT. This is the traditional classification learning,

where we have one domain and one task.

• The case whenDS6=DT. In this case either (1) the feature spaces in source and target

bio characteristics, or (2) the feature space is the same but feature distributions are different: P(X_S)6=P(X_T), e.g. patient data in source and target domains focus on different patient demographics (different countries, different age groups, etc.)

• The case when TS6=TT. In this case we have either (1) the label spaces in source and

target domains are different, e.g. there are two classes in the source domain and five classes in the target domains, or (2) the label space is the same but the label distributions are different: P(YS|XS)6=P(YT|XT), e.g. the source and the target set of patient records

have very different ratios of positive/negative examples.

• The case whenDS6=DTandTS6=TT. This is the combination of the previous two cases.

2.2.2.2 Types of information to be transferred The most important question of transfer learning is "What to transfer ?". According to [Pan and Yang, 2010], there can be four different types of information to be transferred from the source domain/task to the target domain/task:

• Instance transfer. The main idea of this approach is to transfer a set of labeled training instances from the source data to the target data. The predictive performance is expected to increase because the target learner has acquired more labeled data to learn from. The source data usually cannot be used directly for the target task, so typically some re-weighting technique is developed to assign the weights or importance of the transferred data instances in the new domain/task. Some related works in this area are [Dai et al., 2007], [Jiang and Zhai, 2007], [Zadrozny, 2004].

• Feature representation transfer. The main idea of this approach is to utilize information from the source and target domains to learn a good feature representation that reduces the difference between these domains and decrease error rates. The predictive performance is expected to increase with the new (better) feature set. Some related

works are [Argyriou et al., 2007a], [Lee et al., 2007], [Ruckert and Kramer, 2008]. A

representative work is [Argyriou et al., 2007a], where the authors proposed to learn a

low-dimensional feature representation that is shared between source and target tasks. • Parameter transfer. The main idea of this approach is to assume that the source and target tasks have some shared parameters or prior distributions of hyper-parameters

and the approach aims to increase the performance of the target learner by exploiting

these parameters. Some related works for the parameter transfer approach are ( [Evge-

niou and Pontil, 2004], [Gao et al., 2008], [Lawrence and Platt, 2004]). In Section2.2.2.3

we will give more details about [Evgeniou and Pontil, 2004] - a representative work in

this area.

• Relation transfer. The main idea of this approach is to assume that some relations among the data in the source and target domains are similar. Therefore, the learner in the target domain can benefit by transferring these relations from the source domain. Methods for relation transfer typically use some statistical learning technique (e.g. Markov Logic network) to transfer relation in data from the source to the target

domain. Some related works in this area are ( [Mihalkova et al., 2007], [Davis and

Domingos, 2009]). Note that data are usually in relational domains (e.g. entities are predicates and their relations are in first-order logic), and not assumed to be indepen- dent and identically distributed (i.i.d).

2.2.2.3 Multitask learning Multitask learning is a special type of transfer learning. The reason we have a special interest in multitask learning is that it motivated our so-

lution for the multi-annotator learning problem (Section 2.2.4 and Chapter 4). The main

differences between multitask learning and (general) transfer learning are:

• In multitask learning we have many different tasks and the domain of all tasks is the same, whereas in transfer learning the relations between tasks and between domains could be in any combination (Section2.2.2.1).

• While transfer learning focuses on improving the predictive performance of the target learner, multitask learning learns all tasks simultaneously and aims to improve the performance of all involved learners.

The main idea of multitask learning is that it assumes involving tasks are different but related, i.e. they share some related information that can help to improve the learning of all tasks simultaneously. Related works in this area aim to discover this shared information

the information among tasks through the shared hidden layer nodes in Neural Networks. [Yu et al., 2005], [Argyriou et al., 2007b], [Evgeniou and Pontil, 2004] transfer information through shared parameters of task-specific models.

A representative work in this area is [Evgeniou and Pontil, 2004], where the authors

proposed a SVM-based method for multitask learning. The idea is to separate the weight

vector w to be learned by the SVM into two components: a shared weight vector that is

common for all tasks; and a task specific weight vector, one for each task:

wt=w0+vt, t∈{1..T}

where T is the number of tasks, w0 is the shared common weight vector, wt is the weight

vector for tasktandv_t is the difference betweenw₀ andv_t.

w0 andwt are incorporated into a single optimization problem and learned simultane-

ously during the optimization process:

min w0,vT,ξit T X t=1 m X i=1 ξit + λ1 T T X t=1 kv_t_k2₊λ2_kw₀_k2 s.t. ∀i_∈{1, 2, ...,m} and t_∈{1, 2, ...,T}: y_it(w₀₊v_t)·x_it _≥ 1−ξit, ξit ≥ 0

where mis the number of training examples,ξit are slack variables that penalize misclas-

sification errors,λ1andλ2are regularization constants andxitare feature vectors.

Note that we adopt the idea of learning the common and specific tasks for solving the

multi-annotator learning problem (Section2.2.4). Our approach learns a common consensus

model and annotator-specific models simultaneously. However, we go beyond the Evgeniou and Pontil’s multitask learning approach by also incorporating the consistency and bias of different annotators into the optimization process. Details of our multi-annotator learning

In document Efficient Learning with Soft Label Information and Multiple Annotators (Page 35-39)