Applying Clustering to Image Windows - Using Multiple Classifiers

4.2 Using Multiple Classifiers

4.2.2 Applying Clustering to Image Windows

The K -means algorithm could be applied directly to the positive image windows used for training classifiers in Chapter 3 to obtain clusters based on different modes of appearance. However, the dimensionality of each image is equal to the number of pixels, which is 8192. To reduce the complexity of the clustering problem, the dimensionality of the input images could be reduced somehow.

In this section, the idea of clustering windows by using the output from a boosting classifier is explored. As has been shown, boosting classifiers consist of an ensemble of weak classifiers hr : I → {−1, +1}, and each weak classifier will make a decision

regarding the input, which by itself will often be inaccurate. A final decision on whether the target class is present or not is made by taking the consensus of the weak classifiers, as shown in Equation 3.13. Each weak classifier has an associated coefficient αr, and a

real valued score for an input is computed by multiplying each alpha value by the output of the corresponding hr, and summing over r. The sign of this value then determines

the final decision f . The values of αr are determined at training time as described in

Algorithm 1, with larger values of αr for weak classifiers that are more accurate. It is

interesting to note that for a window to be judged as representing a person, only some of the weak classifiers need to return that decision. Thus, for two different images featuring

two different people, the set of weak classifiers that return the label +1 may be different, and may reflect the differences in appearance between the two target instances. For an image window I, the output from a boosting classifier can be used to construct a vector α(I) defined as α(I) = α1h1(I) · · · αRhR(I) T , (4.5)

which summarises the response of the weak classifiers. This vector gives a compact summary of multiple feature values. This concept is not dissimilar to the popular “bag of words” paradigm [104] in computer vision, where histograms of feature frequencies are used as a higher level feature. Here, the vector of responses from the individual weak classifiers is used.

If the vector α can be used to summarise variations in appearance between different instances of people, then it is reasonable to suggest that instances with a similar appearance would have values of α that were close together, as measured by a distance metric, such as the Euclidean norm. Thus, if values of α were computed for a set of images containing people, these vectors could be clustered to reveal modes of appearance among those people. Using the vectors α presents several advantages over clustering the actual images. The most obvious is that the vector α is considerably more compact than an image. Another advantage is that as α is produced by a boosting classifier, the entries of the vector focus on discriminative information. Finally, by training different boosting classifiers with different sets of features, the nature of the clustering can be altered. By using a boosting classifier trained solely on colour features, α will only reflect information from these features.

In this section, positive training images are clustered by extracting the vectors α and applying K -means clustering. Let αH(I) denote a vector α extracted from an image I

by applying the boosting classifier H. By changing the boosting classifier H, different values of αH(I) can be obtained. For example, using a boosting classifier trained only on

CIELUV features will create a vector α that only reflects information from the CIELUV colour space. A set of positive images {I1, . . . , In} is used with a boosting classifier H

to generate a set of vectors {αH(I1), . . . , αH(In)} which are used as the input to the

K -means algorithm.

different results for clustering. To make the discussion in the following section easier to follow, each of these classifiers is assigned a moniker, and is described next:

• The first classifier is referred to as default trees, and is trained using the default settings described in Section 3.4.1.

• The second classifier is referred to as default stumps, and is the same as default trees, except for the fact that stump classifiers are used rather than depth two decision trees. The motivation behind using stump classifiers is that tree classifiers select features to test using branching, and so two inputs that are given the same label by a decision tree can have different visual characteristics. A stump classifier tests only a single feature, and so any two inputs that are assigned the same label will have responded to a single feature in the same way.

• The third classifier is referred to as cieluv trees, and is similar to default trees, except that it is trained only with CIELUV features. The motivation for using only CIELUV features is to obtain clusters based only on information from this colour space.

• The fourth classifier is referred to as cieluv stumps, and is similar to cieluv trees, but uses stump classifiers rather than depth two decision trees.

Experiments are also carried out by applying the K -means algorithm with the Ham- ming distance rather than Euclidean norm. The Hamming distance is a metric that measures the distance between two binary strings by counting the number of bits that differ. To convert α to a binary vector, positive entries become 1 and negative entries become 0. Thus, clustering with the Hamming distance can be used to test if the actual values of αrhave an impact on clustering, and whether it is possible to use an even more

compact representation.

As was mentioned in Section 4.2.1, the K -means algorithm is sensitive to initialisa- tion, and is usually run multiple times, with the result with the lowest within cluster sum of squares error being taken. In this section, all clustering experiments involve running the K -means algorithm 20 times. To visualise what each cluster might represent, the images within a cluster can be averaged together.

(a) (b) (c) (d) (e) (f) (g) (h) (i)

(j) (k) (l) (m) (n) (o) (p) (q)

Figure 4.2: The average image for different clusters. (a) The average of all the positive training images used. (b) and (c) show the averages for the clusters obtained with the default trees classifier using the Euclidean distance, and (d) and (e) show the same results when using the Hamming distance. (f) and (g) show the averages for the clusters obtained with the cieluv trees classifier using the Euclidean distance, and (h) and (i) show the same results when using the Hamming distance. (j) and (k) show the cluster averages obtained with the default stumps classifier using Euclidean distance and (l) and (m) show the cluster averages when using the Hamming distance. (n) and (o) show the cluster averages obtained with the cieluv stumps classifier using the Euclidean distance and (p) and (q) show the same results when using the Hamming distance.

Figure 4.2 shows the results of clustering with K = 2 for different classifiers using Euclidean and Hamming distances. The clustering was applied to the 2416 images of the INRIA person training set. Figure 4.2a shows the average of these 2416 images. As can be seen, there is little difference in the clusters obtained using different classifiers or different distance metrics. The two clusters that are obtained correspond to lighter and darker images, with the cluster of darker images containing approximately 60% of all the images on average.

Figure 4.3 shows the results of clustering for K = 3. It can be seen that small differences begin to emerge between the clusters for different classifiers. The clusters produced with classifiers that use CIELUV features have average appearances that place more emphasis on colour, with Figures 4.3n, 4.3q, 4.3t, and 4.3w being noticeably more blue in colour than Figures 4.3b, 4.3e, 4.3h, and 4.3k. Also, Figures 4.3m, 4.3p, 4.3s, and 4.3v are slightly more red in colour than Figures 4.3a, 4.3d, 4.3g and 4.3j. The results are very similar regardless of whether the Euclidean distance or the Hamming distance is used for clustering. Once again, the majority of images belong to the cluster

with the darker average appearance.

Figure 4.4 shows the results of clustering for K = 5. It can be seen that with this number of clusters, the results for different classifiers are now significantly different, and the distance metric used also affects the results. Some correspondences can be seen across the results, with all sets of clusters having a “dark” cluster, a “light” cluster, and a cluster representing images with people wearing clothes of a blue hue against a light background. However, many clusters are now unique to certain classifier and distance metric combinations. Figure 4.4a shows the cluster averages for the default trees classifier using the Euclidean distance. It can be seen that the center cluster seems to represent images of people against green backgrounds, such as natural environments with grass. Similar clusters appear when the Hamming distance is used, as shown in Figure 4.4b, although the central cluster is not as well defined. When the default stumps classifier is used, as shown in Figures 4.4c and 4.4d, the clusters are quite similar to those created using the default trees classifier, except that the cluster representing people against natural backgrounds is replaced by a less well defined cluster. When classifiers with CIELUV features are used, the average appearances of the clusters place a stronger emphasis on colour. It can be seen from Figures 4.4e, 4.4f, 4.4g and 4.4h that shades of blue and red are more prominent than in Figures 4.4a, 4.4b, 4.4c and 4.4d. It can also be seen that there is a cluster representing images of people against backgrounds with a brown hue. With CIELUV classifiers, the results are similar regardless of whether the Euclidean distance or the Hamming distance is used for clustering.

In this section, it has been shown that the output from a boosting classifier can be used to construct vectors αH(I) that can be used to cluster images of people into

different groups based on the visual characteristics tested for by different features.

In document Detecting and tracking people in real-time (Page 89-93)