2.4 EXPERIMENTS
2.4.3 CORRELATIONS TO CHANGE-BLINDNESS
2.4.3.2 CORRELATE ALGORITHM OUTPUT WITH REACTION TIME
As the consequence of a complex cognitive process, the reaction time of a subject in a change- blindness experiment is influenced by many factors. We here correlate such reaction times with various measures derived from the original image and its modified version.
First, reaction times are compared with the saliency of the modified objects. For a good saliency algorithm, we expect the saliency value of an object to be inversely correlated with the reaction time, since the more salient an object is, the more easily a subject can spot it, and thus detect its removal. The saliency value of a removed object is computed by the mean (or sum) pixel intensity of the object region in the saliency map of the original image.
Second, reaction times are compared to the Hamming distance (Eq. 2.4) between the image signature descriptor of the original image and that of the modified image. As described in Sec. 2.2.2, this distance is a sensitive one when images share a background, as they do in the case of
1
a change-blindness pair. The distance between the descriptors should be related to the extent of difference in their salient, or foreground, regions.
Third, the widely used GIST descriptor [9] is used to describe each image in a change-blindness pair, and reaction times are compared to the GIST distance. [39] showed that perceptually similar images are usually close together in GIST descriptor space. GIST uses 8 orientations, 4 scales for each4×4grid of an RGB color channel, mapping an image to a8×4×16×3 = 1536dimensional real-valued descriptor.
Lastly, we use the pixel-wise distances between the images in the change-blindness pair, and compare these with reaction times. We actually use two pixel-wise measures: the`0and`2distances between the original and modified image. The`0distance is exactly equal to the modified area size. Lethi be the log reaction times of theith subject (a vector with a component for each image in the dataset), andvbe the image pair distances according to one of the methods described above; then, the normalized correlationcis given by correlatingvwith each−hi, normalized by the mean inter-subject correlation, and averaging over 9 subjects:
c= 1 9 9 X i=1 corr(−hi,v) Ej6=i corr(hj,hi) . (2.20) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.563 0.368 0.225 0.104 Normalized correlation 0.510 0.404 0.343 0.389 0.387 0.286 AWS AIM SIG DVA GBVS SUN SIG-Hamming ITTI GIST Pixel-L2 Pixel-L0
Figure 2.15: The average normalized correlation between reaction time and algorithm outputs. For the first 9 saliency algorithms, the left bar is the performance using the mean pixel value of the object region, whereas the right bar is the result of the sum of pixel saliency value (object size is variable). The score above each pair is the maximum correlation value among the two.
The results are summarized in Fig. 2.15. Among all 10 algorithms, the Hamming distance be- tween Image Signature descriptors correlates best with reaction times. That is, among the methods tried here, the perceptual distance between change-blindness pairs is best explained by the image signature descriptor. Given our understanding of the connection between foreground information and the signature, a difficult change-blindness trial is likely one in which the removed object is perceived as part of the background, because in such a trial, we expect a small signature distance.
2.4.4 Image Signature and face orientation
To further illustrate the Image Signature as a compact descriptor of the image, we use the FERET face database [11] as the corpus for analysis. This database contains1400images of200individuals. For each individual, 7 different images were taken, among which, 5 involves head-orientation change (−20◦,−10◦,0◦,10◦,20◦, respectively), and2other images taken in0◦pose contain facial expression and illumination changes. In our experiments, these images are considered as front-face (0◦).
We split the dataset into 700 training images where the labels are readily available for the algorithm, and700testing images where the labels will be estimated using K-NN algorithm. The core idea of this experiment is to illustrate the neighborhood structure of Image Signature defined by Eq.2.4. By choosingK = 20and using majority voting to determine the head orientation for a testing image, this simple algorithm achieved98.86%accuracy. That is, only around 8 images out of700images in the test set were classified as wrong. To the best of our knowledge, the best available result on the FERET database is done by [40] with an accuracy of97%, which means over
20misclassifications.
By comparing the distance metric of Image Signature against that of another famous descriptor, GIST, we can obtain even more interesting results. As shown in Fig.2.16, the difference between Signature and GIST is prominent: on one hand, signature neighborhood is much more consistent in head perspective than the GIST neighborhood. On the other hand, however, GIST is much more successful in extracting identity information. This result suggests that GIST captures the identity information, whereas the Signature captures the perspective information.
SIG
GIST
SIG
GIST
K-NN for identity classification 0 5 10 15 20 400 600 800 1000 1200 1400 Neighborhood size
Wrong label number
Signature GIST 0 5 10 15 20 100 200 300 400 500 600 700 800 Neighborhood size
Wrong label number
Signature GIST
K-NN for perspective classification
Figure 2.16: The neighbors of faces under different metric functions. In each row, the blue image is the query image, followed by 10-nearest neighbors. Red squares signal mismatches, which depends on different tasks (e.g. if the task is perspective classification, faces with different identity but the same orientation will be considered correct).