Average G and Average M Measures - Effect of the Resolution

Chapter 6 Retrieval-Based Evaluation Experiment

6.4 Effect of the Resolution

6.5.1 Average G and Average M Measures

Figure 6.2 presents the average G (a) and average M (b) measures derived using the 51 feature sets when the multi-resolution scheme is considered. The highest average G and

M measures of 0.41 and 0.27 were obtained using MRSAR [Mao and Jain, 1992] when

60 textures are retrieved. Specifically, at the same conditions, the average G measure is greater than 0.41/0.43 when over 14/15 relevant textures are in the same orders and ex- ceeds 0.40/0.42 when 15/16 textures are in the opposite orders. The average G of 0.41 suggests that less than 16 textures are relevant no matter what orders these are in, for some query textures. On the other hand, the average M measure is over 0.27 when one or more relevant textures are in the same order or at least 22 relevant textures are in the opposite order. In addition, if we ignore the difference between computational and perceptual ranked lists of relevant textures, the average proportion of the number of the relevant textures with respect to 60, i.e. average Precision (see Equation (2.2)), is 0.48. It means that only 29 ≈ 0. 8×60 retrieved textures are relevant when 60 textures are retrieved. To summarise, even when considering only the best performance, (1) no more than one half of retrieved textures are relevant, and (2) the orders of the relevant textures between computational rankings and perceptual rankings are different. Therefore, even the best performance obtained here is disappointing.

122 (a)

(b)

Figure 6.2: Stacked bar charts of the average G (a) and M (b) measures obtained using 51 feature sets when multi-resolution is only considered compared to 8D-ISO. Each bar shows four different, colour-coded results for four retrieval set sizes N 10, 20, 40 and 60 .

6.5.2 “Failed” and “Relatively-Successful” Textures

The textures associated with two extreme cases that we term: “failed” and “relatively- successful” can provide us with more insights. We consider “failed” textures to be the textures that cannot be accurately retrieved using the majority of the 51 feature sets. The “relatively-successful” textures are defined as the textures that can be retrieved using the majority of the 51 feature sets better than the other textures. Since the multi-

123

resolution scheme was regarded as the optimal one, it is used in this investigation. In addition, only N retrievals are considered.

It is found from Equations (2.10) and (2.12) that both G and M arrive at 0 when there is no relevant retrieved texture after a retrieval operation is performed for one query tex- ture. In this case, the query texture is termed as a “failed” texture. In addition, if the G or M measure is small, the current query texture can also be regarded as “failed”. The worst average G and average M measures (0.02 and 0.01) obtained using multi- resolution when top 10 textures are retrieved are used to threshold all G and M measures. For each feature set, if the G/M measure obtained on one query is greater than 0.02/0.01, this texture will be left out. The occurrence frequencies of all remaining query textures obtained using the 51 feature sets are accumulated vs. 334 textures. In essence, the oc- currence frequencies are the numbers of feature sets. T_n 30 is used to threshold the occurrence frequencies. After the thresholding operation is conducted, the textures whose occurrence frequencies are over Tn are taken as “failed” textures. Figures 6.3 (a) and (b)

present top 15 “failed” textures for no less than 43 feature sets, selected using the G and

M measures.

The “relatively-successful” textures, were also selected by thresholding G/M values. The best average G and average M (0.23 and 0.20) obtained when the top 10 textures are retrieved using the multi-resolution approach are used to threshold all G and M ob- tained using the 51 feature sets in the same conditions. For each feature set, if the G/M measure obtained on one query texture is less than 0.23/0.20, this texture will be left out. Then we accumulate the occurrence frequencies of 334 textures from the remaining tex- tures obtained using the 51 feature sets. Again, Tn 30 is used to threshold the occur-

rence frequencies. The textures whose occurrence frequencies are over T_n are considered as “relatively-successful” textures. Figures 6.4 (a) and (b) display top 15 “relative- ly-successful” textures for at least 37 feature sets, selected using the G and M measures. It can be observed from Figure 6.3 that the majority of the 51 sets of computational features are unable to accurately capture perceptual rankings between aperiodic textures e.g. “040”, “312”, “148”, “131” and “034” , although some of these textures are also well-ordered e.g. “148” and “034” . However, these feature sets are able to better en- code perceptual rankings between periodic (regular) or nearly-periodic textures (see Figure 6. , e.g. “168”, “171”, “172”, “121”, “061” and “308” . In fact, no matter wheth- er a texture is periodic or nearly-periodic, it shows strong periodicity which is normally

124

associated with power spectra. In contrast, aperiodic (structural or stochastic) textures are believed to be encoded by the phase information [Oppenheim and Lim, 1991]. This indicates that the 51 feature sets exploit power spectra more than phase spectra.

Figures 6.5 and 6.6 list the top 10 retrieval textures: (a) ranked by the human observers in the free-grouping experiments and (b) retrieved using GLH [Mirmehdi et al., 2009] and GDIRSOBEL [Ojala et al., 1996], of two “failed” textures “003” and “131” (see Figure 6.3) selected using both G and M measures. Furthermore, Figures 6.7 and 6.8 display the top 10 retrieval textures: (a) ranked by the human observers in free-grouping and (b) retrieved using GMAGGDIRCANNY [Ojala et al., 1996] and LBPBASIC [Ahonen and Pietikäinen, 2009], of two “relatively-successful” textures “047” and “121” (see Figure 6.4) selected using the M measure.

From Figures 6.5 (a) to 6.8 (a), it can be seen that humans are able to rank either aperiodic (e.g. structural or stochastic textures) or periodic (regular) textures. However, none of the retrieval results obtained using computational features for the “failed” or “relatively-successful” textures are highly consistent with those perceptual rankings. Even the “optimal” retrieval results for two “relatively-successful” textures are not complete- ly satisfactory.

In document Perceptual texture similarity estimation (Page 143-146)