An Improved Kernelized Correlation Filter with the Histogram in Hue Saturation Value Color Space for Object Tracking

(1)

2017 3rd International Conference on Computer Science and Mechanical Automation (CSMA 2017) ISBN: 978-1-60595-506-3

An Improved Kernelized Correlation Filter with the Histogram in

Hue-Saturation-Value Color Space for Object Tracking

Wen-Qing HUANG

1,a,*

and Li MEI

1,b 1

School of Information Science and Technology, ZheJiang Sci-Tech University, Hangzhou, China a

[email protected], [email protected]

*Wen-Qing HUANG

Keywords: Visual tracking, Correlation filter, Drift problem, Histogram.

Abstract.Visual tracking is a significant problem in computer vision. It requires robustness and real-time. Correlation Filter has achieved state-of-the-art results on this problem. It utilizes the circulant structure of matrix and runs at a high frame rate. However, the circulant structure of matrix is severely impacted by the background, deformation and illumination variation. Therefore, it will generate drift when the algorithm determines the final position. In this paper, we propose a strategy to tackle drift problem. We introduce the histogram of Hue-Saturation-Value (HSV) color space into the kernelized correlation filter in order to obtain a new position for redetection. Finally, we evaluate our algorithm on online tracking benchmark(OTB) and a visual object tracking benchmark(VOT). On the two datasets, the proposed approach achieves an improvement compared with the original tracker and achieves promising performance compared with other state-of-the-art trackers.

Introduction

Visual object tracking is a very popular problem in computer vision. A lot of methods have been proposed and achieve state-of-the-art performance. However, it remains a challenging issue owing to full occlusion, illumination and scale variation, deformations, rotation and background clutters.

As for the tracking algorithms, we can divide them into two primary categories as generative and discriminative. The former category[1,2,3] extract features of objects to build a model and then search it in the next frame. The latter category[4,5,6] consider the tracking problem as a classification issue which classify the object and background. In addition, deep learning based methods and correlation filters are applied into object tracking. Deep Learning based methods[7,8,9] are based on neural networks which need a large amount of samples for training and takes a long time to determine the parameters. Therefore, Deep Learning based methods are arduous to achieve real-time. But these methods outperform all the other tracking algorithms. Correlation filter[10,11,12,13] utilizes the Fast Fourier Transformation(FFT) operation and a few element-wise products attaining a very high speed and promising performance. However, correlation filters depend on the circulant structure of matrix. And unfortunately, the factors including background, deformation and illumination variation severely influence the circulant structure of matrix in object tracking. Thus, location deviation exists in the final determined position of the tracked object, i.e., drift problem compared with the ground truth.

In this paper, we propose a strategy to tackle location deviation problem. We introduce the histogram of HSV color space to obtain a new position for redetection by the correlation filter. Besides, a simple scale estimation method is applied. Finally, we evaluate our algorithm on online tracking benchmark (OTB) and a visual object tracking benchmark(VOT). The proposed approach achieves an improvement compared with the original tracker and achieves promising performance compared with other state-of-the-art trackers.

Related Work

(2)

performance. The generative tracking algorithms learn an appearance model and then search it in the next frame using some similar measurement. Ross et al.[14] propose an incremental visual tracker and Javier[1] modifies it with giving different weights to samples of image for generating the subspace. The capacity of weighting the contribution of each single sample reduces the impact of unfavorable samples, alleviating the risk of model drift. Tian[2] presents a visual tracking framework based on an adaptive color attention tuned local spare model. The model utilizes color attention to tune the local sparse representation based appearance similarity measurement between the object template and candidates. Zhu et al.[3] propose a weighted part context learning method for visual tracking consisting of an appearance model, an internal relation model and a context relation model. These three models can effectively capture spatio-temporal relations to enhance the tracker’s performance. The discriminative tracking algorithms focus on how to distinguish the region of interest from background. Binh[4] presents an on-line boosting framework for efficient object tracking. This framework trains and boosts classifiers online with a supervised way. However, Online Boosting Method would suffer drifting problem because of accumulation of errors during tracking. MIL and SVM based tracking methods are also proposed for visual object tracking. Li et al.[5] introduce a tracker based on online multiple instance boosting, which employs Gaussian Mixture Model and single Gaussian distribution respectively to model features of instances in positive and negative bags and manifests a good performance to handle drift problem. Zhang et al.[6] develop a fuzzy least squares Support Vector Machine approach. The approach formulates tracking as a fuzzy classification problem rather than a binary classification problem.

Since Deep Learning[15] has been proposed by Hinton, tracking methods based on Deep Learning are put forward as well. Wang et al.[7] train a stacked denoising auto encoder offline to learn generic image features using auxiliary natural images. In [8], CNN features are pretrained on ImageNet and its properties are deeply studied on massive image data and classification task. Besides, a principled feature map selection method is developed to select discriminative features and discard noisy or unrelated ones. Nam et al.[9] use a large set of videos with tracking ground truth to pretrain a CNN and propose a novel tracking algorithm based on it.

Recent years, Correlation Filter has been very popular in visual object tracking because of efficiency and robustness. The filter generally utilizes gray template or HOG[16] feature to represent object of interest and discrete Fourier Transform to improve efficiency. David et al.[10] present the Minimum Output Sum of Squared Error (MOSSE) filter for the first time using only gray scale samples to train the filter. After MOSSE, many algorithms based on correlation filter are proposed demonstrating a notable improvement by applying a scale estimation strategy, the kernel trick and learning multi-channel filters on multi-dimensional features, such as HOG[16] and Color Names[17]. DSST[11] builds a scale pyramid with 33 scales to accomplish accurate scale estimation. Its scale estimation method is separated and generic. And consequently, it can be incorporated into any tracking method with no inherent scale estimation. KCF[12] takes full advantage of the properties of circulant matrix and FFT to reduce the computational cost and makes use of kernel trick to handle nonlinear cases. But KCF, other correlation filters as well, has an unexpected boundary effects which limits the performance of tracking. Therefore, Martin et al.[13] propose SRDCF to alleviate the boundary effects. In SRDCF, a spatial regularization component is introduced in the learning to penalize correlation filter coefficients depending on their spatial location. Consequently, the algorithm can learn a filter on larger regions, leading to a more discriminative appearance model and better performance.

Modified Tracker

In this section, we will give a detailed introduction of our tracker. Our tracker is based on the Kernelized Correlation Filter (KCF). The HSV color space histogram of tracked object is introduced into KCF for improving the performance.

(3)

detection. This filter utilizes the properties of circulant matrix and diagonalizes it with the discrete Fourier Transform. Therefore, it only needs quiet a few element-wise products in Fourier domain rather than convolution which consumes a large computational resource in spatial domain, thus running at a high frame rate.

As we know, the KCF depends on the circulant structure of matrices. But the structure is severely influenced by a host of factors such as background, deformation and illumination variation. These factors destroy cyclic structure so that the position directly determined by the filter is not accurate, existing drift. And the KCF is unsuitable for tracking the fast motion object for its limited padding area. In the theory, a large padding area benefits to get a high performance. Unfortunately, the large padding area contains more background noise which severely impacts the tracking performance. For alleviating the negative influence of factors above, we introduce the histogram of HSV color space to improve the tracking performance.

Now, we formulate the specific approach as follows. Let P0=(x0,y0) denote the position located

initially by KCF, Hist(0) denote the histogram of tracked object on P0. We produce a series of new

positions P = {P1, P2, ..., PN} near P0 by adding offset values to ordinate value and abscissa value

respectively,

∆

= +

i 0

x x x

y y y (1)

Where ∆x and ∆y are the offset values.

The histogram corresponding to every position in P will be calculated. Let Hist(i) denote the histogram of position Pi. The distance between Hist(0) and Hist(i) will be calculated using

Bhattacharyya coefficient.

( ( ), ( ))= −

∑

( ( ) ( )) i

S Hist 0 Hist i 1 sqrt Hist 0 Hist i (2)

In our approach, the position Psel with the smallest distance will be selected initially.

= =

i i

sel argmin{S },i 1,2,..., N₍₃₎

In a limited position set, the smallest distance invariably exists. With the consideration of this situation, a threshold ѱ is set to decide whether the new position Psel is indispensable to be detected

by correlation filter. If the distance of selected position Psel is smaller than threshold ѱ, it is detected

by the filter. Within P0 and Psel, the position corresponding to intenser response is selected as the

final tracking result. Meanwhile, the histogram of tracked object is updated. In our approach, we replace the histogram with the histogram on new position directly. In the procedure of tracking, the target object will undergo significant changes such as deformation, rotation. We should learn the changes for a robust tracking. In our approach, we just update the model of the kernelized correlation filter linearly.

(1- )

t t 1 t 1

M = γ ⋅M ₋ +γ⋅M₋ (4)

Where Mt and Mt-1 are models of time-step t and time-step t-1 respectively. γ is a learning rate.

Histogram Calculation. The histogram of tracked object is calculated in Hue-Saturation-Value (HSV) color space as in [18]. The author populates an HS histogram with NhNs bins using only the

pixels with saturation and value larger than two thresholds set to 0.1 and 0.2 respectively. The remaining pixels can however retain crucial information when tracked regions are mainly black and white. The author also populates Nv additional value-only bins with them. The resulting complete

histogram is thus composed of NhNs + Nv bins. In the experiment, Nh =Ns = Nv = 10.

(4)

[image:4.612.93.522.106.269.2]

object. For the purpose of unified calculation, we resize the region of tracked object into a rectangle template size.

Figure 1. Qualitative results for the proposed tracker (Ours), compared with KCF and SRDCF. The red, green and blue rectangles correspond to KCF, SRDCF and Ours respectively. The number on

the left-top corner in each sub-image means the frame number of each sequence.

Scale Estimation. The Scale Estimation is a very crucial problem in tracking because of scale variation of objects. In depth, scale variation makes the model unsuitable for current occasion. If the object becomes smaller, more background will be introduced into the model. And if the object becomes larger, the established model would represent a local area of the object rather than a whole object. In this paper, we applied a simple scale variation method similar to [19]. We integrate the translation and scale estimation filters together by resizing the region of interest into a same template size. In this procedure, we just apply two scale adjustment factors and two corresponding weights. Our method will adjust the regions with scale factors and calculate the related responses. The scale factor with larger weighted response is chosen as the current scale factor.

Experiments

Implementation details. Our approach is implemented in C/C++ language and all the algorithmic parameters are fixed throughout the experiments. Since our method is based on the kernelized correlation filter (KCF), the parameters are similar to those in KCF and 31-dimensional HOG features are extracted. Inour approach subsection, the offset values, i.e., ∆x, ∆y are set to the same, ranging from -12 to 12. The threshold ѱ is set to 0.1 and the learning rate γ is set to 0.012.In the calculation of HSV-histogram, the weight fallows a two-dimensional normal distribution. The two standard deviations are set to one sixth of width and height of the rectangle template size respectively. One side of the rectangle template is set to 100 pixels and the other one is computed via the ratio between width and height of the tracked region. And the longer side of the tracked region fits to the fixed template edge. It means the approach keeps the aspect ratio of the tracked object. In scale estimation, the two scale adjustment factors are set to 1.05 and its reciprocal respectively and the corresponding weights are both set to 0.95 that means we treat the two directions of scale variation equally.

Experiment Setup. To evaluate the robustness and accuracy of our approach, we carry out experiments on tracking benchmark datasets OTB50[20] and VOT2014[21].

(5)

union of tracked bounding box and the ground truth. The ratio of successful frames whose overlap is larger than the given threshold varied from 0 to 1 is showed in success plots. And each index will be used in three evaluation ways called One-Pass Evaluation (OPE), Temporal Robustness Evaluation (TRE) and Spatial Robustness Evaluation (SRE).

In experiments on VOT2014, two experiments called baseline and region noise are carried out. Accuracy and Robustness would be ranked in each experiment. The accuracy measures how well the bounding box predicted by the tracker overlaps with the ground truth bounding box. The robustness measures how many times the tracker loses the target (failures) during tracking. Baseline runs a tracker on all sequences in the VOT2014 dataset by initializing it on the ground truth bounding boxes. Region noise initializes with a noisy bounding box which has a random perturbation that is ten percent of the ground truth bounding box size in position and size. Apart from accuracy and robustness, the tracking speed is also an important property that indicates practical usefulness of trackers in particular applications. The VOT2014 introduces a new speed unit called equivalent filter operations (EFO) which reduces the influence of hardware.

Evaluation on OTB50. To demonstrate the performance improvement of our approach with proposed strategies above, we select another five trackers, i.e., MOSSE, CN, DSST, KCF, SRDCF and the 29 trackers which are already included in the framework corresponding to OTB50 for comparison with our method.

The Fig.1 shows the qualitative results of our tracker, KCF and SRDCF. On some sequences, we can intuitively find that our tracker outperforms the other two trackers. The Fig.2 shows the evaluation results of the top-10 trackers among the total 35 trackers with the two indexes on OPE.

[image:5.612.99.519.429.607.2]

On the one hand, our tracker that introduces a histogram of HSV color space into the kernelized correlation filter achieves an improvement of performance compared with KCF. It proves our strategy can reduce the influence of deformation, fast motion and other factors during object tracking, thus alleviating the drift problem.Table.1 gives out the quantitative results of improvement compared with KCF.

Figure 2. The success plot (OP) and precision plo t(DP) of our tracker and the other trackers on full datasets. For precision plot, an error threshold of 20 pixels is used for ranking. For success plot,

AUC scores are used for ranking. Only the top-10 trackers are shown in the fig.

Table 1. The quantitative improvement of our tracker compared with the KCF. The results are shown in percentage.

OPE TRE SRE

DP 6.2% 6.5% 5.2%

[image:5.612.145.468.695.739.2]

(6)

Table 2. The frame rate of several trackers on OTB50 tested on a regular PC with Intel Xeon E5-2620 CPU (2.40 GHz) and 32 GB memory.

tracker frame rate(fps)

SRDCF 5.43

Ours 7.78

KCF 255.51

DSST 40.91

CN 202.08

On the other hand, our tracker achieves a promising performance. It acquires a second place with a small difference from the tracker SRDCF. In addition, our algorithm is not optimized as perfect or effective as possible. Table.2 shows the frame rates of several trackers. Our tracker runs faster than SRDCF and is capable of running faster with optimization. Although several trackers run faster than ours, their performances are not better in evaluations.

Except the overall results of evaluations, the results of 11 attributes on OPE are given out in Table.3 and Table.4 showing the distance precision and overlap precision respectively. Among the selected trackers, our tracker achieves a promising performance. As for these eleven attributes, the scores of our tracker prove the effectiveness of our histogram redetection strategy. The redetection strategy is equal to generate an offset in advance for detection. Therefore, our tracker detects a larger area and is more capable of tracking the fast object compared with KCF. In scale estimation (SV), our simple scale estimation method applied in tracker attains a good performance. Compared with the KCF, our tracker achieves 7.5% and 12.7% gains on distance precision(DP) and overlap precision(OP) respectively. Although the DSST tracker utilizes 33 scales to tackle scale variation problem, there is no advantage over our tracker. However, the 28 sequences with scale variation property may contain other attributes which have negative impacts on accurate scale estimation. But there is no doubt that our tracker is more robust in tracking. The tracker CN utilizes color names for object tracking and is robust to motion blur. Nevertheless, our tracker achieves a better performance. The combination of HOG-feature and HSV-histogram makes a very important role on improvement of performance.

Table 3. The results of 11 attributes on OPE. The data in the table is distance precision (DP) with a center location error less than 20 pixels. The top, second and third highest scores are shown in red,

blue and green respectively.

tracker FM17 BC21 MB12 DEF19 IV25 IPR31 LR4 OCC19 OPR39 OV6 SV28

Ours 0.627 0.651 0.588 0.725 0.634 0.718 0.460 0.717 0.718 0.652 0.694

KCF 0.565 0.677 0.606 0.671 0.642 0.652 0.368 0.675 0.657 0.601 0.619

DSST 0.491 0.631 0.527 0.608 0.671 0.691 0.467 0.653 0.665 0.490 0.675

CN 0.488 0.581 0.503 0.558 0.540 0.619 0377 0.566 0.594 0.424 0.555

Struck 0.552 0.563 0.511 0.492 0.529 0.571 0.479 0.523 0.560 0.492 0.598

TLD 0.517 0.420 0.482 0.469 0.497 0.545 0.339 0.518 0.546 0.553 0.562

Table 4. The results of 11 attributes on OPE. The data in the table is overlap precision (OP) with a AUC score. The top, second and third highest scores are shown in red, blue and green respectively.

tracker FM17 BC21 MB12 DEF19 IV25 IPR31 LR4 OCC19 OPR39 OV6 SV28

Ours 0.514 0.535 0.487 0.575 0.512 0.572 0.386 0.569 0.572 0.579 0.554

KCF 0.459 0.535 0.497 0.534 0.493 0.497 0.312 0.514 0.495 0.550 0.427

DSST 0.435 0.517 0.464 0.510 0.563 0.560 0.409 0.534 0.535 0.459 0.541

CN 0.373 0.453 0.410 0.438 0.417 0.469 0.311 0.428 0.443 0.443 0.384

Struck 0.462 0.458 0.433 0.393 0.428 0.444 0.372 0.413 0.432 0.459 0.425

TLD 0.417 0.345 0.404 0.378 0.399 0.416 0.309 0.402 0.420 0.457 0.421

(7)

[image:7.612.83.531.172.339.2]

The results of the 13 trackers are shown in Table.5. We can see that our tracker outperforms other trackers except the robustness rank in baseline experiment. It further proves that the introduced histogram of HSV color space improves the tracking performance. However, the performance improvement brings a speed loss. The calculation of histogram costs a large computational resource. The speed of our tracker degrades 17 units approximately compared with KCF.

Table 5. The VOT2014 results with 13 trackers. The top, second and third highest scores are shown in red, blue and green respectively.

tracker baseline region noise overall speed(EFO)

accuracy robustness accuracy robustness accuracy robustness

Ours 1.48 2.20 1.28 1.96 1.38 2.08 6.84

KCF 1.52 1.96 1.32 2.24 1.42 2.10 24.23

MIL 7.48 4.60 8.68 5.04 8.08 4.82 1.94

CMT 5.08 4.76 5.60 4.48 5.34 4.62 2.51

LGT 13.00 3.04 13.00 2.60 13.00 2.82 1.23

FoT 4.48 5.52 4.00 6.24 4.24 5.88 114.64

Struck 4.00 4.48 3.72 3.76 3.86 4.12 5.95

CT 6.44 6.04 6.00 5.56 6.22 5.80 6.29

IVT 5.44 6.20 5.52 5.24 5.48 5.72 2.35

ACT 4.00 3.28 3.28 3.00 3.64 3.14 18.26

OGT 3.92 5.72 3.48 5.32 3.70 5.52 0.39

LT_FLO 4.16 6.16 3.32 5.68 3.74 5.92 1.10

EDFT 3.88 4.20 3.60 4.60 3.74 4.40 4.18

Conclusions

This paper proposes an improved kernelized correlation filter with the histogram in Hue-Saturation-Value color space. We introduce the histogram for calculating a new position to be redetected by the filter. The position corresponding to a larger response is chosen as the final tracked bounding box. Besides, a simple scale estimation is applied and is effective on handling scale variation problem. Experiments on OTB50 and VOT2014 demonstrate that our proposed tracker achieves a favorable performance compared with other state-of-the-art methods.

References

[1] Cruz-Mota J, Bierlaire M, Thiran J P. Sample and pixel weighting strategies for robust incremental visual tracking. IEEE Trans Circuits Syst Video Technol, 2013, 23(5): 898-911.

[2]Tian C, Gao X, Wei W, et al. Visual Tracking Based on the Adaptive Color Attention Tuned Sparse Generative Object Model. IEEE Trans Image Process, 2015, 24(12): 5236-5248.

[3]Zhu G, Wang J, Zhao C, et al. Weighted part context learning for visual tracking. IEEE Trans Image Process, 2015, 24(12): 5140-5151.

[4]Binh N D. Online Boosting-Based Object Tracking. Proceedings of the 12th International Conference on Advances in Mobile Computing and Multimedia. ACM, 2014, pp.194-202.

[5]Li N, Zhao X, Li D, et al. Object Tracking with Multiple Instance Learning and Gaussian Mixture Model. J. Inf. Comput. Sci., 2015, 12(11): 4465-4477.

[6]Zhang S, Zhao S, Sui Y, et al. Single object tracking with fuzzy least squares support vector machine. IEEE Trans Image Process, 2015, 24(12): 5723-5738.

[7]Wang N, Yeung D Y. Learning a deep compact image representation for visual tracking. Adv. neural inf. proces. syst., 2013, pp. 809-817.

(8)

[9]Nam H, Han B. Learning multi-domain convolutional neural networks for visual tracking. CVPR, 2016, pp. 4293-4302.

[10]Bolme D S, Beveridge J R, Draper B A, et al. Visual object tracking using adaptive correlation filters. CVPR, 2010, pp. 2544-2550.

[11]Danelljan M, Häger G, Khan F, et al. Accurate scale estimation for robust visual tracking. BMVC - Proc. Br. Mach. Vis. Conf. 2014.

[12]Henriques J F, Caseiro R, Martins P, et al. High-speed tracking with kernelized correlation filters. IEEE Trans Pattern Anal Mach Intell., 2015, 37(3): 583-596.

[13]Danelljan M, Hager G, Shahbaz Khan F, et al. Learning spatially regularized correlation filters for visual tracking. ICCV, 2015, pp. 4310-4318.

[14]Ross D A, Lim J, Lin R S, et al. Incremental learning for robust visual tracking. Int J Comput Vision, 2008, 77(1): 125-141.

[15]Hinton G E, Osindero S, Teh Y W. A fast learning algorithm for deep belief nets. Neural computation, 2006, 18(7): 1527-1554.

[16]Felzenszwalb P F, Girshick R B, McAllester D, et al. Object detection with discriminatively trained part-based models. IEEE Trans Pattern Anal Mach Intell., 2010, 32(9): 1627-1645.

[17]Danelljan M, Shahbaz Khan F, Felsberg M, et al. Adaptive color attributes for real-time visual tracking. CVPR, 2014, pp.1090-1097.

[18]Pérez P, Hue C, Vermaak J, et al. Color-based probabilistic tracking. ECCV, 2002, pp. 661-675.

[19]Li X, Liu Q, He Z, et al. A multi-view model for visual tracking via correlation filters. Knowledge-Based Systems, 2016, 113: 88-99.

[20]Wu Y, Lim J, Yang M H. Online object tracking: A benchmark. CVPR, 2013 ,pp. 2411-2418.

[21]Kristan M, Pflugfelder R, Leonardis A, et al. The visual object tracking VOT2014 challengeresults. ECCV, 2014, pp.191-217.

[22]Babenko B, Yang M H, Belongie S. Robust object tracking with online multiple instance learning. IEEE Trans Pattern Anal Mach Intell, 2011, 33(8): 1619-1632.

[23]Nebehay G, Pflugfelder R. Consensus-based matching and tracking of keypoints for object tracking. IEEE Winter Conf. Appl. Comput.Vis., WACV, 2014, pp. 862-869.

[24]Cehovin L, Kristan M, Leonardis A. Robust visual tracking using an adaptive coupled-layer visual model. IEEE Trans Pattern Anal Mach Intell, 2013, 35(4): 941-953.

[25]Matas J, Vojir T. Robustifying the flock of trackers. Proceedings of the 16th Computer Vision Winter Workshop, 2011, pp.91-97.

[26]Hare S, Saffari A, Torr P H S. Struck: Structured output tracking with kernels. ICCV, 2011, pp. 263-270.

[27]Zhang K, Zhang L, Yang M H. Real-time compressive tracking. ECCV, 2012, pp. 864-877.

[28]Nam H, Hong S, Han B. Online graph-based tracking. ECCV, 2014, pp. 112-126.

[29]Lebeda K, Hadfield S, Matas J, et al. Long-term tracking through failure cases. Proceedings of the IEEE International Conference on Computer Vision Workshops. 2013, pp. 153-160.