Convergence - Visual Data Association: Tracking, Re-identification andRetrieval

5.3 Optimisation

5.3.3 Convergence

In this section, a theoretical analysis is provided by rigorous proof of the convergence of the objective function in Proposition 2.

Proposition 4: L in Proposition 2 monotonically decreases with each opti- mization step for wk

a and ξaki, and therefore L converges to a local optimum.

Proof. Denote J(wk

a, ξkia|i = 1, · · · , n) as the objective function in Proposition 3

and R as the remaining which is unrelated to wk

a and ξaki in Proposition 2, respec-

tively. Then, we obtain the objective function in Proposition 2 L = J(wk

a, ξaki|i =

1, · · · , n)+R. At tth step of optimisation, suppose that wk

a(Otherwise, same con-

clusion can be also obtained for wk

b.) has been chosen. Then, we can denote Lt−1

as the objective function before optimising wk

a and Ltis the function after we ob-

tain the optimum (wk

a)∗ of J(wak, ξaki|i = 1, · · · , n). Since J(wak, ξaki|i = 1, · · · , n)

is a convex problem, there must be J(wk

a, ξkia|i = 1, · · · , n) ≥ J((wka)∗, ξaki|i =

1, · · · , n). Moreover, because R is fixed, the following inequality can be estab- lished.

5.4 Experiments

We test our proposed CBI for person re-identification on two public datasets: VIPeR [23] and CUHK01 [166]. Some example images of the three datasets are shown in Fig. 5.3. To illustrate the performance and efficiency of CBI, 17 recent algorithms, including 13 person re-identification methods and 4 multi-modal hash function learning methods, are used for comparison.

Image representation: In recent two years, various robust features have been proposed for person re-identification. Especially, the Salient Colour Names based Colour Descriptor (SCNCD) [161] and the Local Maximal Occurrence Fea- ture (LOMO) [160] have achieved promising performance. In this chapter, to reflect the advantage of our CBI to learn binary codes for different descriptors, three types of image representations including SCNCD, LOMO and ELF (Ensem- ble of Localised Features) which was proposed in [23], are adopted as the basic descriptors. (1) In SCNCD, 16 colour names are used and a colour distribution over the colour names in an image part is computed. SCNCD divides each image into six horizontal stripes of equal size and colour names’ distributions of all parts are fused to form an image-level feature. Only the descriptor for the VIPeR dataset is offered by the authors. (2) The LOMO feature analyses the horizontal occurrence of local features, and maximises the occurrence to make a stable representation against viewpoint changes. To handle both the colour constancy and dynamic range compression, a multi-scale Retinex transform is applied. The original dimension of LOMO feature is 26960. (3) ELF descriptor has been used in several methods, such as: [163,169] and [170]. Each image containing a person was divided into six horizontal stripes. For each stripe, the RGB, YCbCr and HSV colour features and two types of texture features extracted by 13 Schmid and 8 Gabor filters were computed. Thus, each person image was described by a feature vector in a 2784 dimensional feature space. More details are referred to the original paper [23]. CBI is not sensitive to the parameters for the two datasets and we set λ1 = 2 and C = 200 for all the experiments. However, λ2

will be set to 0.05, 10 and 5 for ELF, SCNCD and LOMO, respectively.

Evaluation protocol: We randomly partition a dataset into two parts with- out overlap on person identities, according to a certain percentage. The ex- pectation is reported by conducting 10 trials of evaluation. The parameters of other hashing algorithms are carefully tuned so that the best results are obtained. The results of other person re-identification methods either come from original papers or by running their offered codes, with exactly the same experimental setting. Same as most person re-identification publications, the standard Cumu- lated Matching Characteristics (CMC) [224] curves and the corresponding Area Under Curve (AUC) are used to illustrate the performance of different methods. Datasets: The VIPeR contains 632 pedestrian image pairs in an outdoor

Methods CBI-500 CBI-700 SDALF KISSME MLF Time(s) 1.1e-06 1.4e-06 3.6e+00 9.2e-03 0.98e+01 Methods PRDC eSDC PRSVM MRank SCNCD

Time(s) 9.3e-03 1.14e+01 3.2e-03 3.4e-02 4.2e-03

Table 5.1: Time comparison of computing the similarities between one probe sample and all the gallery samples (316) using the compared methods. CBI-500 denotes that only 500 hash codes have been learned.

environment. Each pair contains two images of the same individual taken from two different camera views. Changes of viewpoint, illumination and pose are the most significant causes of appearance change. Each image has been scaled to be 128 × 48 pixels. The experimental setting is the same as [170]. Half of the dataset including 316 images for each view is used for training the algorithms and the reminding (316 pedestrian) is used for testing. The CUHK01 contains 971 pedestrians and is also captured with two camera views in a campus environment but each pedestrian has two images from each camera view. Camera A captures the frontal view or back view of a pedestrian, while camera B captures the side view. All the images are normalized to 160 × 60 for evaluations. Our two settings follow [166] (100 test persons and 871 persons for training and [225] (486 test persons and the remaining as training samples).

In document Visual Data Association: Tracking, Re-identification and Retrieval (Page 112-114)