4.2 Using Multiple Classifiers
4.2.4 Calibrating Classifiers
When using multiple classifiers, an important issue is that the real valued scores output by different detectors are not always directly comparable. For example, two classifiers that output real valued scores for images could have completely different ranges for their output values. Thus, when using multiple classifiers in conjunction, it is necessary to adjust their output scores in a process that is often referred to as calibration. Calibration did not need to be addressed in Section 4.1, as the reflected counterpart of a classifier will have the same output range. Though sophisticated calibration methods exist [106], in this thesis a simpler method is used. For a set of classifiers {H1(I), . . . , HK(I)}, let αkr be
a weak classifier coefficient for the classifier k. Classifiers are calibrated by normalising all values of αkr so thatPR
r=1αr for all values of k is equal to max
k { R X r=1 αkr}. 4.2.5 Results
In this section, results are presented for using multiple detectors. Six different sets of multiple detectors are presented. Three were trained on clustered images produced by the default trees classifier, using values of K = 2, K = 3, and K = 5, and another three were trained on images clustered by the cieluv trees, using values of K = 2,
K = 3, and K = 5. All classifiers were trained using the default parameters defined in Section 3.4.1. Tests were carried out using calibration as described in Section 4.2.4, and without calibration. To accelerate the evaluation, as in Section 4.1.1, a constant rejection threshold of θreject= −10 is used.
The result are shown in Figure 4.5, and as can be seen, are rather mixed. There is no consistent pattern for the effect of calibration, and it seems to only cause small differences in performance. It can be seen that training more classifiers using more clusters seems to degrade the average miss rate, with Figures 4.5e and 4.5f showing that classifiers with K = 5 on sets that were clustered using the default trees and cieluv trees classifiers have average miss rates of 26% and 27% respectively. The best results are obtained by training classifiers on images from clusters created using the default trees classifier for K = 2, where the resulting classifier has an average miss rate of 19% as shown in Figure 4.5b.
The results obtained in Figure 4.5 seem to indicate that overfitting occurs when more clusters are used. To attempt to alleviate this problem, the classifiers trained via clustering were run again, but this time the default classifier from Section 3.4.1 was added to each set of classifiers, so that now each set of classifiers consists of those trained on clusters, and the default classifier. The results are shown in Figure 4.6, and it can be seen that running the classifiers trained on clusters along with the default classifier generates results that improve upon on the original results when K = 3 and K = 5. However, the best performance is still obtained with the classifier trained on clusters obtained using the default trees classifier with K = 2.
Figure 4.7 shows some of the results obtained using the set of classifiers trained on images that were clustered with the default trees classifier with K = 5. The same images that were used for Figure 3.15 are used again. The majority of detections shown are the result of just one of the detectors, for which the bounding boxes are shown in dark blue, but several of the detections correspond to the other classifiers, shown in red, yellow, green and cyan.
10−3 10−2 10−1 100 101 .05 .10 .20 .30 .40 .50 .64 .80 1
false positives per image
miss rate
24% CIELUV trees, K=2 − calibrated 23% CIELUV trees, K=2 − uncalibrated
(a) 10−3 10−2 10−1 100 101 .05 .10 .20 .30 .40 .50 .64 .80 1
false positives per image
miss rate
20% Normal trees, K=2 − uncalibrated 19% Normal trees, K=2 − calibrated
(b) 10−3 10−2 10−1 100 101 .05 .10 .20 .30 .40 .50 .64 .80 1
false positives per image
miss rate
23% CIELUV trees, K=3 − calibrated 22% CIELUV trees, K=3 − uncalibrated
(c) 10−3 10−2 10−1 100 101 .05 .10 .20 .30 .40 .50 .64 .80 1
false positives per image
miss rate
22% Normal trees, K=3 − calibrated 21% Normal trees, K=3 − uncalibrated
(d) 10−3 10−2 10−1 100 101 .05 .10 .20 .30 .40 .50 .64 .80 1
false positives per image
miss rate
27% CIELUV trees, K=5 − uncalibrated 27% CIELUV trees, K=5 − calibrated
(e) 10−3 10−2 10−1 100 101 .05 .10 .20 .30 .40 .50 .64 .80 1
false positives per image
miss rate
26% Normal trees, K=5 − uncalibrated 25% Normal trees, K=5 − calibrated
(f)
Figure 4.5: The results on the INRIA dataset for several classifiers trained on clustered datasets, with and without calibration, and with θreject= −10. The classifiers are sets of
classifiers trained on clusters created using the cieluv trees classifier with (a) K = 2, (c) K = 3 (e) and K = 5, and sets of classifiers trained on clusters created using the default trees classifier with (b) K = 2, (d) K = 3 (f) and K = 5.
10−3 10−2 10−1 100 101 .05 .10 .20 .30 .40 .50 .64 .80 1
false positives per image
miss rate
24% CIELUV trees, K=2, calibrated 23% CIELUV trees, K=2, calibrated + Baseline 23% Baseline (a) 10−3 10−2 10−1 100 101 .05 .10 .20 .30 .40 .50 .64 .80 1
false positives per image
miss rate
23% Baseline
20% Normal trees, K=2, calibrated + Baseline 19% Normal trees, K=2, calibrated
(b) 10−3 10−2 10−1 100 101 .05 .10 .20 .30 .40 .50 .64 .80 1
false positives per image
miss rate
23% Baseline
23% CIELUV trees, K=3, calibrated 21% CIELUV trees, K=3, calibrated + Baseline
(c) 10−3 10−2 10−1 100 101 .05 .10 .20 .30 .40 .50 .64 .80 1
false positives per image
miss rate
23% Baseline
22% Normal trees, K=3, calibrated 20% Normal trees, K=3, calibrated + Baseline
(d) 10−3 10−2 10−1 100 101 .05 .10 .20 .30 .40 .50 .64 .80 1
false positives per image
miss rate
27% CIELUV trees, K=5, calibrated 23% Baseline
22% CIELUV trees, K=5, calibrated + Baseline
(e) 10−3 10−2 10−1 100 101 .05 .10 .20 .30 .40 .50 .64 .80 1
false positives per image
miss rate
25% Normal trees, K=5, calibrated 23% Baseline
22% Normal trees, K=5, calibrated + Baseline
(f)
Figure 4.6: The results on the INRIA dataset for the calibrated classifiers from Figure 4.5 combined with the default classifier from Section 3.4.1. Each figure shows the results for the default classifier (under the moniker “Baseline”), one of the classifiers trained on clustered data, and the combination of the two.
(a) (b) (c) (d) (e)
(f) (g)
(h) (i)
(j) (k)
Figure 4.7: Examples of the results from the set of detectors trained on positive images clustered using the default trees classifier with K = 5. The bounding boxes are colour coded to show which classifier they correspond to, with the correspondence shown in (a), (b), (c), (d) and (e).
Training Set No threshold θreject = −10 θreject= −5 default trees, K = 2 8.1117 0.5686 0.4043 cieluv trees, K = 2 8.1241 0.6128 0.4163 default trees, K = 3 12.0094 0.6170 0.4318 cieluv trees, K = 3 12.0534 0.6845 0.4487 default trees, K = 5 19.8175 0.5493 0.4131 cieluv trees, K = 5 20.0119 0.8409 0.5305
Table 4.2: A table showing the time taken in seconds per 640 × 480 pixel image for different sets of multiple classifiers, with different constant rejection thresholds.
4.2.6 Classifier Speed
The speed of the multiple classifiers was measured. As with Sections 3.5.2 and 4.1.2, all results were computed on images from the INRIA test set that were 640×480 pixels, and once again, a 32-bit computer with an Intel Pentium Dual-Core 2.6GHz processor and 4GB of RAM. The results are shown in Table 4.2. Once again, when the full classifier is run, as shown in the second column of Table 4.2, the time taken per image scales in a linear fashion. As a single classifier takes roughly 4 seconds, with K = 2 the set of classifiers takes roughly 8 seconds, with K = 3 the time is around 12 seconds, and with K = 5 the time is 20 seconds. However, it can be observed again that when a constant rejection threshold is used such that θreject = −10 or θreject = −5, the time taken is
always less than one second, and scales much more gracefully as K increases.
4.3
Discussion and Summary
It has been shown in this chapter that multiple classifiers can be used to improve the results of person detection. It was first shown how reflected classifiers could be utilised to improve performance, without sacrificing too much speed. The speed of these joint classifiers could be further improved by exploiting the fact that any features that are symmetric with respect to the vertical axis only need to be computed once. Another improvement would be to attempt to alter the training process so the orientation of positive training images with respect to the vertical axis is a latent random variable, which can be inferred during training to create a classifier that is biased towards a particular pose. Such an approach has been applied to SVMs to yield better results [44]. It was shown that the output from the weak classifiers of a boosting classifier reflect
information on the appearance of input images, and can be used to construct vectors for clustering. It was also shown that these vectors can be converted to a compact binary form with little impact on the results of clustering. Future work could explore using more clusters to obtain even more distinct modes of appearance.
Finally, it was shown that multiple classifiers can be trained on sets of positive images obtained via clustering, and that combining multiple classifiers together can im- prove performance, if these clustered classifiers are combined with a normally trained classifier. With constant rejection thresholds, multiple classifiers can be run in conjunc- tion with only a small increase in running time. Future work could focus on modeling any redundancy between classifiers to improve the speed further
Tracking People in Video
The previous chapters have addressed the problem of detecting people in images rapidly. This chapter focuses on the problem of tracking people in video. The first approach that is presented is based on the mean shift algorithm that was described in detail in Section 2.2.1. It is shown that this algorithm can be adapted to the purpose of tracking based on colour characteristics. A novel method for mean shift tracking through scale is developed and tested. Finally, the problem of tracking multiple people is briefly examined based on a different approach involving the classifiers that were developed in Chapter 3 and Kalman filtering.
5.1
Mean Shift Tracking
Mean shift tracking [63] involves tracking a target based on its colour characterisitics. The target’s position must be manually initialised in the first frame, and in subsequent frames the best colour match for the target is found.
In this section, the theory behind the mean shift tracking technique from [63] is outlined. The target that is to be tracked is represented by a reference colour histogram in RGB space ˆq =
ˆ
q1 · · · qˆmb
T
where mb is the number of bins for the colour
histogram. The aim is to find the area centered on the pixel location y ∈ R2in each frame Itof a video {I1, . . . , IT} which yields a colour histogram ˆp(It, y) that is most similar to
ˆ
q. The similarity between two colour histograms is measured using the Bhattacharyya
coefficient, which is given by
BC(y) = BC(ˆp(It, y), ˆq) =
mb X i=1 p ˆ pi(I, y)ˆqi. (5.1) Since Pmb
i=1pˆi = 1 and Pmi=1b qˆi = 1, the range of the Bhattacharyya coefficient is 0 ≤
BC ≤ 1. The value of the Bhattacharyya coefficient is high when two histograms are very similar, and low when they are dissimilar. Thus, the tracking problem can be formulated as finding the value of y in each frame that maximises BC(y). The histograms ˆq and ˆ
p(y) are kernel weighted histograms, and the reason for this will be addressed shortly. A kernel weighted histogram modulates the value that a pixel contributes to the histogram based on the position of the pixel relative to a kernel, and so ˆpi ∈ R can be expressed as
ˆ
pi(I, y) = cH
X
x∈X
kkH(y − x)k2δ (bRGB(I(x)) − i) , (5.2)
where cH is a normalisation constant, k(·) is a finite support kernel function which was
explained in detail in Section 2.2.1, and x ∈ R2 are the pixel co-ordinates within the support X of the kernel function k(·). It should be noted that in this chapter, I(x) ∈ R3 will be used to denote a pixel, in contrast to the notation that was used in Chapter 3. Furthermore, δ is the Kronecker delta function, bRGB : R3 → {1, . . . , mb} is a function
that maps the pixel located at x to the correct histogram bin index based on its colour, and H is the bandwidth matrix given by
H = 1 h1 0 0 h1 2 , (5.3)
where h1 and h2 are parameters that determine the scale of the kernel, and also enable
the kernel to be elliptical, as opposed to just circular.
It is shown in [63] that the linear Taylor series expansion of the Bhattacharyya coefficient, subject to the condition that we consider histograms that are kernel weighted as was previously mentioned, is comprised of two terms. The first term is constant with respect to y, and the the second term takes the form of a kernel density estimate. As the mean shift algorithm can be used to find the local mode of a kernel density estimate,
it can be used as the basis of a tracking algorithm that maximises the Bhattacharyya coefficient. This is achieved by computing the location y iteratively using the following equation yj+1 = X x∈X xw(I, x)k0kH(yj− x)k2 X x∈X w(I, x)k0 kH(yj− x)k2 , (5.4)
where yj+1is the new location, yj is the previous location, k0(·) is the derivative of k(·),
and the weights w(I, x) are given by
w(I, x) = mb X i=1 s ˆ qi ˆ pi(I, ˆyj) δ(bRGB(I(x)) − i). (5.5)
It should be noted that for any value of i where, ˆpi = 0, the result of the division opera-
tion in Equation 5.5 is set to 0. Typically, Equation (5.4) will have to be iterated several times to converge to an estimate for the target position in a single frame. Iterations are continued until the mean shift vector mg(yj) given by Equation 2.8, which is the differ-
ence between two consecutive location estimates, has a magnitude below some specified threshold. In this thesis, this threshold is set to one pixel. It is important to note that the histogram ˆp(I, yj) will change at every mean shift iteration, as yj is changing.
A choice of kernel for k(·) must be made. The most common choice is the Epanech- nikov kernel given by Equation 2.10 in Section 2.2.1. By choosing k(·) to be the Epanechikov kernel, the derivative kernel k0(·) becomes the uniform kernel. This con- siderably simplifies the calculation of (2.12), as all values of k0(·) (within the support of the kernel) are constant. Thus, Equation 5.4 simplifies to
yj+1= X x∈X xw(I, x) X x∈X w(I, x) (5.6)