Thesis Outline - Visual Data Association: Tracking, Re-identification andRetrieval

The rest of this thesis is organised as follows:

Chapter

2

contains a full literature review of basic learning methods and the previous works relevant to visual data association. First of all, the basic knowledge of machine learning, including classification, regression and ranking are introduced. These methods construct the basis of our solutions for visual data association, considering different aspects of the essential problems. Then, the related work about single-camera object tracking, cross-camera person re-

identification and cross-modal retrieval will be described and analysed sequen- tially.

Chapter

3

presents a Learn++ (LPP) tracker which used to dynamically sample competitive classifiers for robust and long-term object tracking. In par- ticular, an efficient descriptor which can be selected by classifiers is firstly given. Then, an empirical analysis of “concept drift” problems is used to guide the design of tracker. Next, the Learn++ based tracker is proposed to overcome the challenges in the non-stationary environment for object tracking. Finally, extensive experiments show that LPP tracker yields state-of-the-art performance under various challenging environmental conditions and, especially, can overcome several challenges simultaneously.

Chapter

4

describes a winner-take-all (WTA) strategy to select a winner tracker (considering both accuracy and efficiency) from a set of prevailing methods to tackle the current challenge, according to features extracted from the present environment and an efficiency factor. To this end, firstly, a structural regression model to characterise the trackers is discussed. Then, this chapter introduces how to select the most suitable tracker, the ways to locate the target and how to update the trackers. The proposed WTA framework is tested on a large benchmark dataset and extensive experimental results illustrate that WTA can significantly improve both the performance and the efficiency.

Chapter

5

proposes a method to learn cross-view binary identities (CBI) for fast person re-identification. To achieve this, three aspects, including min- imising the distance in the Hamming space, maximising the cross-covariance and maximising the margin are considered simultaneously. This chapter gives a theoretical proof for when it is safe to transfer the problem in Hamming space to a problem in Euclidean space and what constraints need to be considered, as well. Extensive experiments are conducted on two public datasets to show CBI produces comparable results as state-of-the-art re-identification approaches, but is at least 2200 times faster than these non-hashing methods.

Chapter

6

provides a novel method termed hetero-manifold regulariza- tion (HMR) to supervise the learning of hash functions for efficient cross-modal searching. Hetero-manifold integrates multiple sub-manifolds, defined by homo- geneous data, with the help of cross-modal supervised information. In this chapter, at first, various definitions of hetero-graphs for different conditions are fully discussed. Next, a novel cumulative distance inequality, defined on the hetero- manifold, is introduced. Then, cross-modal hashing is transformed into a problem of hetero-manifold regularized support vector learning and solved by a sequen- tial optimisation method. Lastly, comprehensive experiments on four datasets show the proposed HMR achieves advantageous results over the state-of-the-art methods in several challenging cross-modal tasks.

To facilitate the understanding of the contents and structures in a holistic view for this thesis, an overview of main developments is given in Fig. 1.7.

An ensemble system for object tracking

Chapter 3

A winner-take-all strategy for object tracking

Chapter 4

Hetero-manifold regularisation for cross-modal retrieval

Chapter 6

Learning cross-view identities for person re-identification

Chapter 5 Chapter 1 Introduction Chapter 2 Literature review Chapter 7 Conclusion

A set of Bayesian classifiers in a same function space is considered. For every challenge, an optimal classifier can be approximated in a subspace spanned by the selected competitive classifiers which can address the current problem according to the distribution of the samples and recent performance.

To further improve the diversity of a system, a winner-take-all strategy is exploited to select a winner tracker which is most suitable and efficient to tackle the current challenge, according to motion features extracted from the current environment and an efficiency factor.

To address the problems in cross-camera person re-

identification, a set of hash functions for each view is learned to project all samples captured in different views into a common Hamming space. Then, person re-identification can be solved by efficiently computing and ranking the Hamming distances between the images.

By integrating the supervision information and the local structure of heterogeneous data, a novel method termed hetero-manifold regularisation (HMR) is proposed to learn hash functions for efficient cross-modal search. Thus, the similarity between each pair of heterogeneous data could be naturally measured by three order random walks on this hetero-manifold.

Chapter 2 Literature Review

This chapter will firstly provide a broad review of basic knowledge and concepts including classification, regression and ranking, which are used to support our solutions and discoveries in the following chapters and then introduce extensive backgrounds of present research in visual data association from the three levels: signal-camera setting, cross-camera setting and cross-modality setting.

2.1 General Learning Framework

Machine learning primarily focuses on the development of computer programs to deal with extensive problems with respect to some kinds of tasks and theoretical research of computational learning in artificial intelligence. The essential element of learning is to provide computers with the ability to iteratively learn from data (experience) and make decisions according to their learning and understanding, without explicitly being programmed. A dataset of observations is given:

X = {x1, x2, · · · , xi, · · · , xN, xi ∈ X }, (2.1)

and the corresponding latent variable:

Y = {y1, y2, · · · , yi, · · · , yN, yi ∈ Y }. (2.2)

The purpose of learning is to build a model from a hypothesis space f ∈ F to bridge the input x and output y, where the hypothesis space has F = {f |y = f (x)} and N is the number of samples. In general, the samples x could be the data captured by any sensor and have diverse structural forms, including vector, matrix and tensor. For example, an image is denoted by a matrix and captured by a camera. The output y generally refers to the supervised information, such as labels, clusters or other high-level semantic variables. The output has a variety of forms ranging from two values {+1, −1}, numbers and real values to more

Learning Sample Dataset Hypothesis Space Latent Sample Distribution Making Decision

( , )x y

x

k k

f

y Task: Classification Regression Clustering Representation ĂĂ

F

Figure 2.1: The general learning flow for classical tasks including classification, regression and clustering etc..

complex structures including vectors, matrices and tensors which derive from the complicated structural output learning. The general flow of learning to detail the procedure of training and making decisions is shown in Fig. 2.1.

In taxonomy of machine learning, most criteria depend on the output variable y. Firstly, if considering the presence status of y, most of machine learning methods can be classified into supervised learning in which all samples are given labels, semi-supervised learning in which a part of the samples (generally a little portion) are given labels and rest of the samples are unlabelled, and un-supervised learning in which all samples are unlabelled. Secondly, if considering the forms of output y, most of methods can be categorised into classifications in which latent variable is only from {+1, −1}, clustering in which latent variable denotes the index of clusters and regression in which latent variable is a real value. Finally, if considering the functional forms in hypothesis space, most of machine learning methods can be grouped into linear, non-linear and kernel-based methods. For example, neural networks in which a non-linear activation function has been adopted are non-linear methods. In fact, there are some other types of taxonomies, e.g., depending on the size of the hypothesis or the searching strategy.

In simple terms, classical machine learning methods seem to be uncorrelated to the task conducted in this thesis: data association. However, in fact, data association could be achieved if we consider these classical learning tasks as a transition procedure. Then, the straightforward strategy is that samples can be

associated by grouping the latent variable y in a certain way or by defining a similarity measurement for the latent variable. The next subsection will bridge the gap between the classical tasks and the task of data association.

In document Visual Data Association: Tracking, Re-identification and Retrieval (Page 33-39)