• No results found

Figure 5.13: Structure of our appearance model. The model has 2 parts and each part consists of a fixed grid. Each position on the grid stores a list of clusters. Each square block in this figure represents a cluster. Clusters(C) can be uniquely indexed by a part (i), position (j) and depth(k). The number of clusters (k) may vary with position (j).

5.6 R

IGID COLOR PATCH BASED APPEARANCE MODELS

We introduce modified appearance models well suited for both textured and non textured sce-narios. Instead of MSCR features, the new appearance model is based on color patches on a fixed grid. Similar to previous approach the new generative appearance model (Section5.6.1) combines information from a sampling of representative examples and are view independent.

The model (illustrated in Figure5.13) consists of a fixed set of parts, e.g. upper and lower.

Each part consists of a fixed grid. At each position on the grid, the model stores a list of clus-ters of the corresponding patches from the training images. These clusclus-ters are represented by their mean color and a weight. Note that we still base our appearance model on color, rather than on texture, as many objects of interest (such as an actor’s clothing) are stable in color, but not in textural details.

The probability of a candidate window containing the object is the product of probabilities of the occurrence of the parts, where the probability of each part is the sum over its posi-tions. For each position, the probability of the corresponding observed image patch is based on the best matched cluster. The appearance model is constructed from the set of representa-tive images, where each training image is first normalized to align with a fixed grid and then partitioned into fixed size patches. The patches at each position are then clustered as illustrated in Figure5.14, yielding a single model combining all training examples (the model building procedure is detailed in Section5.6.2). The initial constructed model is then updated based on the same training images to alleviate the background noise (as the training images are anno-tated using a simple rectangle and no segmentation information is used, significant background noise may appear in the actor models). The update procedure is explained in Section5.6.3.

Once the model is updated, the detection is performed by sliding a window at multiple scales for each frame (described in Section5.6.4). For each window, we estimate a probabilis-tic score of generating the object using the given appearance model. The detection is performed for each frame and each actor independently.

A set of pixels in a patch could take any of the values from the underlying cluster and various combinations of the values give different appearances it could model. We wish to learn all the different color-weight tuples such that their mixture generates all the different viewpoints and poses of the object we want to model.

The probability that the observed grid of color patches B (aligned to model configuration and k =1) was generated from the given model C is a product over parts (i) :

P =Y

i

Pparti, (5.6)

where the probability of each part is the sum over all positions (j) in that part:

Pparti =X

j

Ppatch(Bij) . (5.7)

The probability for each patch is defined by the best matched cluster at the given position among the k possible choices.

For each position only the best matched cluster is considered for the scoring. Bij is the mean color of the patch at a given part (i) and position (j). The index k represents the best matched cluster among the k possibilities at a given part (i) and position (j). The term

ijk− Bij| is the Euclidean distance in Lab color space. The proposed model like CBD is also related to AND/OR graphs [ZCL08], taking products (AND) over mandatory parts and taking sums (OR) over optional positions per part. More specifically, it is a AND/OR graph with one AND node and multiple OR nodes.

5.6.2 Initial Model Construction

The appearance model is built from a set of representative images of the object. In this case we typically choose 8 examples uniformly sampled across all viewpoints. We choose 8 examples as standard practice because it broadly covers an entire rotation in front of the camera (front, 3/4frontright, right, 3/4backright, back, 3/4backleft, left, 3/4frontleft). Instead of rectangles we annotate each training image by drawing a single line. In the case of actors, the line is centered at the neck and extending to the top of the head (Figure5.14(a)). We have found this form of

5.6. RIGID COLOR PATCH BASED APPEARANCE MODELS 93

(a) (b) (c)

Figure 5.14: (a) Manual annotation is done by drawing a line centered at the neck and extend-ing to the top of the head. This line is then used to estimate the boundextend-ing box. (b) Each trainextend-ing image is normalized in size, segmented into blocks on a fixed size grid, and then merged into a single multi-view appearance model. (c) Our model consists of two different parts, i.e., the upper part and the lower part. Each part has a fixed grid of positions, each with a list of clus-ters of image blocks. Only the most frequent cluster at each position is shown, except for the position highlighted in green.

annotation to be fast and accurate in locating parts.

The initial construction procedure is illustrated in Figure5.14(b). Given the set of training images we normalize each image to same scale and size. Each training image is individually partitioned into fixed size patches and each block is represented by the mean Lab color value µ = (L, a, b) of the pixels within it. The patches are merged by constrained agglomerative clustering, similar to the one used in CBD (Section5.3.2), however the clustering is applied at each position independently in this case. The algorithm works in a stepwise manner where all patches in the first training image are initialized as singleton clusters. With each next incoming training image, a patch to patch comparison is made and if the feature distance (Euclidean Lab color difference) is less than a threshold (θ) the patches are merged and the weight is incremented otherwise a new singleton cluster is initialized with a weight of 1. This way each position on the grid is represented by multiple clusters with varying weights. In the end the weights are normalized by number of training images. Each cluster is represented by the mean color value (µ) and its weight (γ). The resulting model is a multi layer structure, where number of layers may differ for each position in the grid as shown in Figure5.13.

5.6.3 Update

The initial appearance model has a list of clusters for each model position and their correspond-ing weights. Our method tunes these cluster weights by performcorrespond-ing detection over the same training images. This step reinforces the clusters in under represented view-points and poten-tially reduces the background clutter in the model. For each detection only a set of clusters from the appearance model are considered for scoring. Let Cmbe the matched set of clusters at the MAP location (maximum a posteriori i.e. the highest scoring location) in the given train-ing image and Cgbe the matched set of clusters at the ground truth location. If the ground truth

(h) Biff (i) Happy (j) Miss Forsyth (k) Girl

Figure 5.15: Patch based appearance models of 6 actors from two different theatre sequences.

Two of the actors (Biff and Happy) are common in both sequences, the appearance model was rebuilt for the new appearances. Only the highest weighted patch of each position on the grid is shown for visualization purposes.

location is the same as the MAP location, no update is performed. But if the two locations are different, we update the cluster weights as follows:

γijk= γijk+

(−λ0 if [i, j, k] in Cmand not in Cg, +λ0 if [i, j, k] in Cg and not in Cm.

The parameter λ0 is a positive constant. The weights of the clusters are kept positive and the weight are only reduced if the current weight of the clusters is greater than λ0. The basic idea is to penalize the clusters which were matched in the MAP location but not in the ground truth and to reinforce the clusters which were matched at the ground truth but not at the MAP location if the MAP location and ground truth locations are not the same. This update is it-eratively performed for each training image until convergence i.e., the MAP location in each training image is same as the ground truth location. The background clutter is significant in the patch based approach as we are not using any segmentation information while annotating the training images (this was not the case in CBD where the background blobs were easily filtered) and the update considerably improves the detection performance. Updated appearance models of six actors from two different sequences are illustrated in Figure5.15.

5.6.4 Detection

After the model is updated the detection is performed for each object/actor independently in each individual frame. This process is illustrated in Figure5.16.

Given an input image as in Figure5.16(a), it is first partitioned into patches as shown in Figure5.16(b). The appearance model is normalized to corresponding patch size and a sliding window operation is performed. Detection is performed at multiple scales by partitioning the input image into different size patches. For each given detection window with center (x, y) and scale (w), we calculate the detection score using Equation 5.6. For each patch in the