3.3 Evaluation in Image Information Mining systems
3.3.1 Data-driven evaluation approach
Smeulders et al. (2000) posed that with the growing complexity of the CBIR systems composed of several modules, it is necessary to evaluate system parts individually as well as their mutual dependencies. Therefore, the objective evaluation can be analyzed considering two approaches: 1) Component by component evaluation, and 2) general evaluation of the performance.
3.3.1.1 Component by component evaluation
Figure 3.1 shows the generic IIM system architecture, where the components in the off- line part are image database, feature extraction and index generation, so that the eval- uation of each component consists of the evaluation of feature extraction methods and clustering methods.
Several works have been presented in the evaluation of feature extraction meth-
odsusing optical and SAR images and their different features like color, shape, texture. As for instance works using optical images and texture as feature were presented in (Sharma and Singh, 2001), (Razniewski and Strzelecki, 2005). The authors of (Sharma and Singh, 2001) compared five feature extraction methods (autocorrelation, edge fre- quency, primitive-length, Law’s method, and co-occurrence matrices) for image analysis using artificial and natural textures. (Razniewski and Strzelecki, 2005) described a study on feature selection methods for classification purposes. The authors compared texture features obtained using mutual information with texture features obtained with Fisher coefficient in terms of classifications. The experiments were done using texture images from the Brodatz album.
Kachouri et al. (2008) presented a hierarchical feature extraction model and the rel- evance evaluation of several features for an heterogeneous optical image database. The authors used multiple primitive features (color, shape, texture) to describe an image. The
3.3. EVALUATION INIMAGEINFORMATIONMINING SYSTEMS 43
evaluation consisted in making classifications using an adjusted version of SVM that sup- ports multiple classes. The authors compared different features employed separately, dif- ferent combinations of the same kind of features, aggregated features and the proposed hierarchical feature model.
A study to exploit morphological features from multispectral Ikonos imagery was presented in (Huang et al., 2009). The extracted features were compared using the object- based analysis and the Gray-Level Co-occurrence Matrix (GLCM).
Li and Shawe-Taylor (2005) experimented the texture classification using multireso-
lution features extractedfrom dyadic wavelet, wavelet frame, Gabor wavelet, and steer- able pyramid. The classifications were made using SVM as classifier. The experimen- tal results show that the steerable pyramid and Gabor wavelet classify the texture im- ages with the highest accuracy. However, experimental results on fused features demon- strated the combination of two feature sets always outperformed each method individu- ally.
Karkanis et al. (2001) presented an evaluation of textural feature extraction using
medical images. The authors compared four texture extraction methods (GLCM, run length encoding, fractal dimension and discrete wavelet transform descriptor) by means of classifications.
Selection of feature extraction methods with SAR imagery was published in (Solberg et al., 1997). The authors evaluate the performance of texture features derived from 1) the GLCM, 2) local image statistics, 3) fractal features, and 4) lognormal field models. In order to test the performance of the texture features the authors made urban area classifications using a K-Nearest Neighbor algorithm as classifier.
In (Clausi and Jernigan, 1998) and (Clausi and Yue, 2004) the performance of GLCM, Markov Random Fields (MRF) and Gabor features in classifying sea-ice imagery was compared. The authors found that the GLCM produced the overall best results in terms of classification accuracy, followed by the Gabor features. However, the GLCM features were found to be more sensitive to texture boundaries as compared to MRF.
Kandaswamy et al. (2005) proposed the use of approximate textural features for fast image texture analysis. Rather than using the entire image, approximate features are derived from a carefully selected subset of the original image, based on the notion of patch reoccurrence. Later, the proposed approximated features can be extracted for two texture analysis methods 1) the GLCM, and 2) Gabor wavelets. The results are expressed in terms of classifications.
As conclusion the component by component evaluation is predominantly based on classification results, which reflect the accuracy of the feature extraction methods.
3.3.1.2 General evaluation of the performance
The mainly used measures in image retrieval evaluation are the Precision and Recall (PR) measures, which are widely used in the evaluation of text document retrieval (Harman, 1993). However, these measures are not appropriate for image retrieval (Dyson and Box, 1997) and they are only of limited use for image collections (M ¨uller et al., 2001). In fact, RP techniques have two drawbacks: 1) Selection of a relevant set in an image database is more complicated than in a text database since the definition of the image meaning can have a long number of interpretations. 2) In an image database the application of a selected query returns a ranking list of results instead of an undifferentiated set of relevant images (Smeulders et al., 2000). However, in spite of these shortcomings, PR are useful in special
44 3. IMAGEINFORMATIONMINING SYSTEMS
circumstances or with special considerations as the addition of strong semantics to the
images database provided by labeling or by textual description(Smith and Li, 1998). M ¨uller et al. (2001) presented an overview and proposal in performance evaluation in CBIR. Here, the authors identified the basic problems in CBIR performance evaluation as for example defining common image collections, obtaining the relevance judgments and making comparisons with textual information retrieval. In addition, the authors gave a summary of the most commonly used evaluation methods such as PR graphs, the rank of first retrieved relevant image (Rank1), the average normalized rank (Rank). The authors recommended the use of the normalized average rank and highlighted the necessity of standard performance measures, a standard image database, and the integration of the user in the evaluation process.
The work of Aksoy (2001) posed the retrieval problem in a probabilistic framework where the aim is to minimize the error in a setting of two classes: the relevance and the irrelevance classes of the query. The author proposed new methods in different compo- nents of the CBIR like feature extraction, image matching, feature combination and rele- vance feedback and presented a validation of those methods in terms of comparisons be- tween state of the art methods and the proposed ones. The performance evaluation was done using extensive experiments on three different manually ground-truthed databases, including aerial satellite, texture and stock images. The used databases are: ISL Database (MIT, 2001) which contains EO images from Texas, VisTeX composed by texture images and COREL Photo Stock Library. The metrics used in the evaluation were PR curves, number of retrievals that have a specific target image among the set of retrieved images, and classifications.
Deselaers et al. (2004) analyzed the different evaluation metrics proposed by (M ¨uller et al., 2001) and complemented this work proposing a classification error rate (ER) as per- formance evaluation metric assuming a connection between the CBIR and image classifi- cation.
The work of Huijsmans and Sebe (2005) presented the shortcomings of PR graphs, mentioning the fact that they provide the user an incomplete information about how well the IR system will perform for various relevant class sizes and various irrelevant class sizes. Here, the authors introduced the term ”generality” to describe the influence of the relevant items in the database and proposed a 3D graphic, which shows the PR values as a function of the generality. Also, the importance of normalizing the performance measures with respect to the class size was highlighted. The authors proposed a well- normalized description of the ranking performance compared to the performance of an ideal retrieval system defined by ground-truth for a large number of predefined queries.