6.2 Auto-annotation via Semantic Propagation in Sub-space
7.1.2 The Algorithm
7.1.3.2 Results on Region Based Image Annotation
In this section, we compare the effectiveness of the image based feature space (IBFS) approach in region based image annotation with other approaches. For fair comparison, we used the same data-set2 as that used in the experiments of Duygulu et al. (2002); Yang et al. (2005) on region based image annotation. The dataset contains 5000 images from 50 Corel Stock Photo CDs, and has been divided into a training set of 4500 images and a test of 500 images. Each image had been annotated manually with 1-5 keywords.
2
Chapter 7 The Image Based Feature Space Model 93
(a) Segmentation samples of images containing “Sidewalk”
(b) Segmentation samples of images containing “Boat”
(c) Segmentation samples of images containing “Car”
Figure 7.5: Segmentation samples of images from the Washington data-set (University of Washington, 2004).
Chapter 7 The Image Based Feature Space Model 94
Track Stand Football Field Track 5 30 7 7 36 1 Stand 7 30 5 9 34 1 Football Field 1 36 7 1 34 9
Figure 7.6: The number of times words “Track”, “Stand” and “Football Field” occur together and separately.
Figure 7.7: Samples of colour images containing “Flower” and the conterpart gray ones from the Washington data-set (University of Washington, 2004). The top ones are
RGB colour images, and the bottom ones are the corresponding gray images.
Images are segmented by Normalised Cut (Shi and Malik, 2000), which generated 5-10 regions per image. Each region is then represented by a 30 dimensional feature vector, including region average colour, size, location, average orientation energy and so on, as described in (Duygulu et al., 2002). Feature vectors are normalised to Z-Scores for distance measure in the image based space. Specifically, suppose the whole set of training feature vectors are V1, V2, ..., Vn, n being the total number of training image segments,
and Vi = {Vi1, Vi2, ..., Vi30}, we calculate the Z-Score for the jth dimension of the ith
vector as follows
Zij =
Vij − mean(V1j, V2j, ..., Vnj)
standard deviation(V1j, V2j, ..., Vnj)
(7.8)
Feature vectors of the test set are also normalised, using the mean and standard deviation of the training vectors.
In order to find the optimal value for threshold t in Equation (7.2), 500 random images are taken out of the training set for evaluation, by training on the remaining 4000 images. Based on the average per-word precision and recall that are calculated on the evaluation
Chapter 7 The Image Based Feature Space Model 95
(a) Water (b) Bush
(c) Football Field (d) Building
(e) Clear Sky (f) Leafless Tree
Figure 7.8:Some good results of representive regions found by the image based feature space approach for the corresponding words.
Chapter 7 The Image Based Feature Space Model 96
Keywords Our Method Random
Tree 21 3 Building 22 0 People 21 1 Bush 25 3 Grass 6 1 Clear Sky 19 0 Water 25 2 Sky 19 3 Cloudy Sky 21 3 Flower 8 2 Sidewalk 2 0 Rock 6 2 Ground 6 3 Overcast Sky 0 3
Partially Cloudy Sky 20 4
Hill 0 2 Mountain 5 1 Stadium 22 0 Football Field 20 1 Car 7 1 Track 2 0 Pole 1 0 Stand 0 1 Leafless Tree 24 1 Boat 0 0
Table 7.1: The number of correct segments out of the top 25 for our method and random choice.
set, the best performance is found at t = 0.88 (with step of 0.01). Thereafter, the 4500 training images are used to build the feature space, into which the test image segments are mapped. For each test image segment, the word with the highest probability is chosen as the region label. Figure 7.10 shows some good examples, while Figure 7.11 shows some bad examples. Note that for some regions, the coordinates on all dimensions of the space are zeros (i.e. m(It) = [0, 0, ..., 0]) after applying quantisation equation 7.2. It implies that the IBFS technique regards such regions as not similar to any of the training image regions available. In such cases, we assign the word “null” to these regions, indicating that the approach can not decide which word to attach.
In order to compare with other image annotation techniques, we predict image based labels by choosing words from the region labels. For each test image, the top 5 region labels with the highest values of probability are chosen. We compare our approach with two region based image annotation approaches mentioned in section 7.1.1, namely the Machine Translation Model (Duygulu et al., 2002) and the Point-wise Diverse Density Model (Yang et al., 2005). For a direct comparison, average per-word precision and recall are calculated, as described in section 4.3, for the whole set of words and the best
Chapter 7 The Image Based Feature Space Model 97
(a) Car (b) Boat
(c) Sidewalk (d) Flower
Figure 7.9: Some bad results of representive regions found by the image based feature space approach for the corresponding words.
Figure 7.10: Some good examples of image region annotation through the image based feature space approach.
Chapter 7 The Image Based Feature Space Model 98
Figure 7.11: Some bad examples of image region annotation through the image based feature space approach.
All Words Best 49 Words Approaches Avg. pr. Avg. re. Avg. pr. Avg. re.
MT 0.04 0.06 0.20 0.34
PWDD 0.07 0.09 0.31 0.46
IBFS 0.10 0.11 0.39 0.51
Table 7.2: Performance comparison of Machine Translation Model (MT), Point-wise Diverse Density Model (PWDD) and Image Based Feature Space Model (IBFS).
49 words (for comparison with the work by Duygulu et al. (2002); Yang et al. (2005)). As shown in Table 7.2, the image basd feature space (IBFS) method achieves better results on the same data-set.
7.1.4 Summary
A novel training image based feature space has been proposed together with a procedure for mapping in both image segments and textual labels for region based image annota- tion. Some segments associated with the same objects should be clustered together, and also close to the label that represents the object in question. As a result, the relation- ships between image regions and words can be discovered by comparing their distances in the feature space. Annotating new image segments and images is also straightforward. Regional labels for test images are predicted by choosing the words that are closest in the space. Furthermore, region labels that are most likely to be correct are adopted as the global labels for the whole image.