Multiple Task Learning - and T j have different labels [Milan et al., 2013a] The penalty is pro

T i and T j have different labels [Milan et al., 2013a] The penalty is proportional to the spatial-

3.1 Multiple Task Learning

Generally, there are multiple tasks in Multiple Task Learning (MTL) [Caruana, 1997], whereas each task is related to other tasks. The motivation of multi-task learning is that learning multiple related tasks simultaneously outperforms learning them independently. The benefits are: (1) sharing information among multiple tasks; (2) joint feature learning; (3) capability of training without sufficient training data; (4) and so on. Let me take an example to give an intuitive illustration of multi-task learning. Assuming there are a few schools and student properties such as student ID, student age, student height as training samples associated with each school. For each school, there is a set of training samples. There are multiple tasks here, i.e., for each

school, the task is to predict the score of a student in this school. Obviously, these multiple tasks are related to each other as they are dealing with the same problem. One issue is that there are scarce training samples for each task. Thus training a model for each task with the limited training data probably leads to poor generalization ability. Multi-task learning can help here to train these multiple tasks at the same time in order to share information among them.

The most important issue of multi-task learning is how to model the relevance among multiple tasks. Appropriate modeling of the relevance among multiple tasks would lead to performance boost, which is the motivation and benefit of learning multiple tasks simultaneously rather than independently. However, if the relationship among multiple tasks is not modeled appropriately, decrease in the performance would probably happen.

Assuming there are m tasks and m learners, let these multiple learners be W =

[w₁, w2, ..., wm], where W∈ Rd×m andd is the dimension of feature space. For the i-th task,

there are training samples as Xi ∈Rd×Ni and labels or groundtruth yi ∈RNi. In most cases, to

learn multiple learners at the same time, one should minimize a cost function composed of two sub-parts. One is a cost term f(•)from the training data, and the other one is a regularization termg(•)which models the relationship among multiple learners. It can be written as

L = f(W, X, y) +g(W). (3.1)

The cost term is the same as an ordinary term that usually adopts the least square form, or the Hamming distance to measure the difference between the model prediction and the groundtruth of training data. For instance, the least square form of the cost term is:

f(W, X, y) = m

∑

i=1 1 Ni kXT_i W_i−y_i k2 . (3.2)

In recent years, typical ways of associating multiple tasks include:

• Mean regularized MTL[Evgeniou and Pontil, 2004] which assumes that all tasks are related to each other, and all tasks are regularized to not drift away from the mean of all

3.1. MULTIPLETASKLEARNING

tasks. Intuitively, the regularization term penalizes the deviation of each task from the mean, which would try to make them as close as possible.

• Embedded feature selection [Liu et al., 2009, Obozinski et al., 2010] which aims to learn/select some features more expressive for multiple tasks, so it is also called joint feature learning. Usually these features are selected by assuming that all the models share a set of common features. In formulation, this constraint is modeled as group sparsity of model vectors W. Obviously, as the sparsity works, some dimensions of model vectors W would be zero. This procedure chooses the features corresponding to the non-zero dimensions of W.

• Low-rank subspace learning [Ji and Ye, 2009] which captures the relatedness among multiple tasks. Assuming all the model vectors share a subspace, the regularization term is usually represented by the rank of the model vectors W asRank(W). However, as the rank minimization is NP hard in practice, it is usually relaxed to the trace norm which is theoretically shown to be a good approximation for the rank function.

• ClusteredMTL [Zhou et al., 2011b] which supposes that tasks have a clustered structure, and tasks in the same cluster are closer to each other compared with the ones in another cluster. Based on this, the clusteredMTLcaptures the relevance among multiple tasks similar to the K-means clustering.

• Tree regularizedMTL[Kim and Xing, 2010] which employs the tree structure to model the relevance among multiple tasks. Within the tree structure, tasks corresponding to the nodes with the same parent node are close to each other, and the similarity between nodes/tasks are determined by the common depth that these nodes share in the tree structure.

• Graph regularizedMTL[Chen et al., 2010] which utilizes the graph structure to represent the relationship among task models. In the graph structure, each vertex indicates a task model, and the edge connecting two vertexes measures the similarity between the two

tasks by the weight associated with it. One way to regularize the multiple tasks is to penalize the difference between two tasks.

In terms of applications in the computer vision community, MTL is combined with the boosting framework to learn the features shared by multiple classes to conduct multi-class detection. By doing so, it can avoid construction of a specialized classifier for each class in [Torralba et al., 2007]. MTL is also utilized to handle single object tracking in [Zhang et al., 2012] by treating representation of multiple particles based on the collected templates as multiple tasks.

In document Generic multiple object tracking (Page 71-74)