Coordinated multiple views to support image retrieval

(1)

Universidade de São Paulo

2014-07

Coordinated multiple views to support image

retrieval

International Conference on Information Visualisation, 18th, 2014, Paris.

http://www.producao.usp.br/handle/BDPI/48563

Downloaded from: Biblioteca Digital da Produção Intelectual - BDPI, Universidade de São Paulo

Biblioteca Digital da Produção Intelectual - BDPI

(2)

Coordinated Multiple Views to Support Image Retrieval

Danilo Medeiros Eler, Jorge Marques Prates, Rog´erio Eduardo Garcia Faculdade de Ciˆencias e Tecnologia, UNESP – Univ Estadual Paulista,

Presidente Prudente, Departamento de Matemática e Computação Presidente Prudente/SP, Brazil

[email protected], [email protected], [email protected]

Rosane Minghim

ICMC, USP – University of S˜ao Paulo S˜ao Carlos/SP, Brazil [email protected]

Abstract—The number of images available has grown over

the years, as well as the number of techniques to aid to organizing and retrieving from image collections. Techniques and systems have been proposed to recover images based on query, in which an image (or words) is used as input parameter and a list of similar images (or images with related text content) is recovered. However, understanding how the retrieved images are related to each other remains as a problem. This paper proposes an approach based on multidimensional visualization and coordination techniques to show the relationship from retrieved images. In this approach, coordination techniques are employed to perform image retrieval methods and highlight the results in visual representations, showing how retrieved images are relate. To evaluate our proposal image collections with and without textual annotations related to each image were used, and also image retrieval mechanisms based on distance, topic and semantic to retrieve images from distinct and multimodal datasets.

Keywords-Image Retrieval; Coordinated Multiple Views;

Linked Views; Coordination Techniques; Multimodal Dataset.

I. INTRODUCTION

Recently, digital images have been popularized and em-ployed in distinct areas, such as medicine, nanotechnology, remote sensing, and others. Consequently, there has been a signiﬁcant increase in the amount of existent images, which have impaired some essential image collection tasks, such as organization and accessing. To aid in these essential tasks, several image retrieval techniques and systems have been proposed in literature [1], [2], [3]. Most of image retrieval systems show results based on a query, using as search parameter an image, key words, annotations, etc; from which a list of related images is retrieved. However, most image retrieval systems neither show how those retrieved images are related to each other nor how they are related to other images from the collection. Furthermore, those systems should exploit the retrieval process to demonstrate why the retrieved images are relevant [4].

Information Visualization (IV) approaches have been pro-posed to deal with image collections [5], [6], [7], [8]. Usually, these approaches employ visual representations computed from images similarities, displaying how they are organized in the collection. However, these IV approaches do not support exploration of multiple datasets and do

not employ the visualization to support the image retrieval process.

This paper proposes a joint exploration approach between visualization and image retrieval mechanisms. In this ap-proach, retrieval mechanisms are used by coordination tech-niques, that is, the user selects images in a source view, the information retrieval mechanisms compute a list of related images, and the coordination technique highlight retrieved images in target views. After that, the visual representation supports the analysis of retrieved results.

The main contribution of this paper is the coordinated multiple views approach to support image retrieval. Most image retrieval systems only show a list of images, but in our approach we show how this list is related to the entire collection, enabling a dataset overview. Furthermore, the visual representations aid in analysis and identiﬁcation of image similarity, providing a better understanding of feature space and images relationship. Thus, users can understand how retrieved images are related both to each other and to other images from the collection. Additionally, the proposed approach employs coordination techniques to perform re-trieval mechanisms from distinct and multimodal datasets.

This paper presents applications of the proposed ap-proach employing multidimensional projection and point based techniques to compute the visual representations. It also employs three image retrieval mechanisms based on image content, key words (i.e., terms, topics) from textual annotations and concepts from ontology.

This paper is organized as follows: in Section 2 we present some related works; in Section 3 we describe the proposed approach and the coordination techniques employed; in Section 4 we show some applications of our approach; and, ﬁnally, in Section 5 we present some conclusions and further works.

II. RELATEDWORKS

A great research challenge is how to organize large image collections aiming at the effectiveness on image retrieval. New approaches use visualization and data mining techniques to aid exploration process. For instance, Chen et al. [5] used visualization techniques to compare images based on three feature extraction methods: color, shape and

(3)

texture. In this approach, the images are classiﬁed by a tech-nique called Pathﬁnder Network Scaling (PFnets) [9]. PFnets are used to identify strong similarities among data – the nodes are connected so that the proximity and the similarity are reproduced between pair of images, only remaining the most important connections. Finally, the images are shown in graph based visualization.

In addition, multimodal datasets have grown, for instance, image collections with textual information, in which each image has a short description about its content. For those collections, Fan et al. [6] classified images based on textual annotations from its content. They build ontology with the objects of interest extracted from images using the textual information. Feature vectors are extracted from those objects in order to classify and associate them in a specific domain of the ontology. New images added to the collection are clas-sified according to the ontology. The visual representation, computed from Principal Component Analysis (PCA) [10], groups the images with similar content based on ontology. The user interacts with a hyperbolic tree also computed from the ontology. Then, the queries are performed based on the ontology concepts computed from the image annotations.

Another way to make the visual representation from image collections is proposed by Yang [11]. He shows a classiﬁcation scheme that relies on projecting the data on a lower dimensional space. He describes experiments with image data sets containing the same underlying image with variations in shape, brightness and orientation. He does not enable queries for image retrieval, but the user can look at the image similarities based on the visualization.

Recently, Kalaramas et al. [8] proposed a retrieval frame-work for multimodal datasets, that is, datasets with distinct data type (e.g., text, sound, image) for each instance. They employed spanning tree and self-organizing maps to visually show the similarity relationships. In this case, the retrieval process is based on the visual analysis of images neigh-borhood. The authors also employed only one data type to compute the visual representation and the multimodality is exploited during user interaction. When the user selects a set of instances and changes the data type, the visual representations is updated based on the similarities computed from the chosen data type.

Frequently, the visual representations are computed from dimensionality reduction techniques, which are applied to images as a whole [12], [13] or to feature space computed from the image collection [14], [15]. However, most works do not aid in traditional image retrieval, but only show a way to organize the image collection for providing a better comprehension of image similarities [8]. Additionally, the traditional image retrieval systems do not make use of visual representations in order to provide better under-standing of the relationship of retrieved results. Therefore, this paper mixes the image retrieval process, represented by coordination techniques, with the visual representations

Figure 1. The Proposed Approach: from the user selection performed in source view a coordination technique performs some image retrieval mechanism and highlights retrieved images in target view.

of image similarities to improve the user understanding of image retrieval results. This approach is described in the next section.

III. THEPROPOSEDAPPROACH

The proposed approach consists in providing a way to display how retrieved images are related to each other and to other images from the collection. For that, it employs multidimensional visualization techniques to preserve the distance relationships established by the similarity measure-ments from the image collection, considering the computed feature space. Thus, when the retrieved images are high-lighted in the visual representation, the user can explore their neighborhood to discover more related images and understand how they are related to the retrieval result.

In our approach, the image retrieval mechanisms are performed by coordination techniques, as shown in Figure 1. The query is given from the user selection in a source view. Next, the retrieval mechanism is performed to compute a list of images. Finally, the coordination process highlights the retrieved images in target view, from which the user can explore similar images based on the visual representation.

This approach used coordination techniques as interaction mechanisms to select images. The selected images are used as input to image retrieval process, from which a list of related images is computed. Hence, any kind of image retrieval mechanism can be employed by a coordination technique. In contrast with other visualization systems, this feature provides a way to enable the exploration of distinct datasets in a same exploration process.

As beneﬁt, our approach allows employing multimodal datasets. In this case, the visual representations and retrieval results can be computed from any data type present in the collection (e.g., text, image, sound). For instance, we are able to use textual information to computed the visual representation and perform the image retrieval process based on image content. Thus, the images would be highlighted based on image content and the exploration on visual repre-sentations would be guided by distance measures computed

(4)

(a) Least Squares Projection (LSP). (b) Neighbor-Joining (NJ) tree. Figure 2. Coordination between LSP (a) and NJ tree (b) techniques. The labeled selections in (a) correspond to the labeled highlights in (b).

from textual data.

Our approach also supports image retrieval from a single dataset. In this case, the user can perform queries in a unique view or employ distinct views with distinct visual representations and coordinate them all. Thus, the user perspective can change according to the employed visual representation.

IV. APPLICATIONS

This section presents applications to illustrate how our approach can improve the user understanding about image retrieval results. Least Squares Projection (LSP) [16] and Neighbor-Joining (NJ) tree [17] are employed to compute visual representations. Both techniques are capable of pre-serving the similarities relationship from image collection. As shown in Figure 2, these techniques represent similarity relations among images in distinct way. In LSP technique, similar images are placed close in 2D space ((Figure 2(a)), while in the NJ tree similar images are placed close in a same branch (Figure 2(b)).

The applications employed image retrieval mechanisms based on image content, key words and concepts from an ontology. In this way, the following coordination techniques are used:

• Distance Coordination [18]: this technique highlights in target views the k-nearest instance from that selected in source view;

• Topic Coordination [18]: this technique highlights in target views the instances that have a speciﬁc topic (i.e., key words, terms) in its content. This topic is extracted from the instances selected in source view;

• Semantic Coordination [19]: this technique highlights in target views the instances that have an speciﬁc set of words selected by the user from an ontology. The words retrieved by the ontology reasoning is based on a topic extracted from that instances selected in source view.

Distance Coordination was originally proposed to deal with vector space models computed from textual content.

This work applies it to deal with feature spaces computed from image content. In this case, it is necessary to compute a common feature space with similar feature descriptors, for instance, if one image collection has a feature space com-puted from Gabor ﬁlters and wavelet, and a second image collection have a feature space computed from wavelet, then, Distance Coordination needs to compute a common feature space only with the features descriptors related to wavelet, so that the similarity computation (distance computations) can be done.

Topic Coordination and Semantic Coordination are em-ployed in this work to retrieve images based on textual content. Nowadays, several image collections have anno-tations (textual descriptions) for each image, for instance, medical images. Additionally, we can compute the visual representations based on feature space computed from image content as well as vector space model computed from textual annotations. Thus, the perception of similarity from retrieved images can chance according to the data source selected to compute the visual representation.

The datasets used in the applications are:

• ImageCLEF-2006: it is an image collection from X-ray divided in 116 unbalanced classes1_{. This collection}

is divided in two groups:

– ImageCLEF-Train: this collection contains 9000 X-ray images;

– ImageCLEF-Test: this collection contains 1000 X-ray images.

• FIGURES: it is an image collection from 996 outdoor scenes, in which each image is also related to a text describing its content2_.

• MEDICAL: sample of 200 X-ray images3_{to compose}

this image collection for this paper. We also created two distinct collections for demonstrate our approach. Additionally, each image has a medical description.

A. Distance Based Retrieval

In this ﬁrst application, shown in Figure 3, we use Distance Based Coordination to explore an unknown image collection based on a known one. The collection shown in Figure 3(a) is ImageCLEF-Test dataset and the collection shown in Figure 3(b) is ImageCLEF-Train dataset. In Figure 3(a) was employed Neighbor Joining [17] and in Figure 3(b) was employed Least Squares Projection [16] to create the visual representation. In both ﬁgures we used wavelet features, that is, the similarities were computed based on image content. In this example, once an image is selected, the5 nearest neighbors are highlighted in other view. Note that several images are highlighted in Figure 3(b)

1_{ImageCLEF image collection is availabe in http://www.imageclef.org/.} 2_{FIGURES image collection is available in http://www.cs.washington.}

edu/research/imagedatabase/.

3_{Images and annotations from MEDICAL image collection was obtained}

(5)

with bold yellow dots. We display one image from selected and highlighted groups to show the relation from source and target view.

(a) 1000 X ray images from ImageCLEF dataset.

(b) 9000 X ray images from ImageCLEF training dataset. Figure 3. Exploration of an image collection by means of Distance Coor-dination Technique [18]. (a) Samples selected in the NJ tree computed from ImageCLEF test dataset; (b) Samples highlighted by Distance Coordination, in this case,5 nearest neighbours from that samples selected in (a). The visual representation shown in (b) was computed by LSP technique from ImageCLEF training dataset.

In this example we show how visual representations can be useful to support comprehension of similarities from retrieved images. After the retrieval process, the user can ex-plore the neighborhood of highlighted images and discover more similar images and also note how they are related to other classes of images. Additionally, the visualization also reveals whether retrieved images are related each other, by looking at its positioning on the visualization.

Another sort of exploration between distinct datasets is shown in the next section, in which the coordination technique is employed in exploration of image collections related to text (annotations) collections.

B. Text Based Retrieval

In Figure 4 is shown an application with FIGURES dataset, in which each image has a textual information related to. So, image and text collections are used in exploration process. The Neighbour Joining (NJ) tree was computed from image collections, from which we acquired Fourier descriptors from image histogram, energy computed from Fourier descriptors from 2D image, mean and standard deviation from pixels intensity. We can note that similar images are placed in a same branch or in near branches (for instance, images from the blue sky). In this example we search for the word ‘stadium’, and the images with such word were highlighted in blue. Figure 4(a) presents this highlighting around the image, but we can note some overlapping. Thus, we can also use circles as visual marks to avoid the image overlapping, as shown in Figure 4(b); in this case, the user needs to interact with each instance (circle) to see the corresponding image.

(a)

(b)

Figure 4. NJ-tree computed from photos. (a) Highlighted in blue photos whose description (text content) has the word ‘stadium’; (b) Tha same NJ-tree from (a), but we used circles to avoid image overlapping.

Alternatively, as shown in Figure 5(a), we can compute the NJ-tree based on the textual information. In this case, similar images are placed together based on the similarity computed

(6)

from textual annotations. We used a distance measure called Normalized Compressed Distance (NCDs) [20], because the

textual information for each image is too short to compute a good vector space model. In this example we search for descriptions with the words ‘stadium’ and ‘temple’, that are the circles highlighted in blue. The Figure 5(b) shows the same NJ-tree, but we used the images as visual marks. We selected a region on NJ-tree where images were highlighted with the word ‘temple’. This selection is presented in Figure 5(c). It is possible to note the relationship among highlighted images. We could also start another exploration process from this sampling.

(a) (b)

(c)

Figure 5. NJ-tree computed from image descriptions (text content). (a) Highlighted in blue description with words ‘stadium’ and ‘temple’; (b) Tha same NJ-tree from (a), but we used images as visual mark and make a selection for zoom in; (c) Zoom in images highlighted with the word ‘temple’.

Using an image collection with textual descriptions for each image, we can employ the Topic Coordination in the exploration process. In this case, when a group of images is selected, a topic is extracted (e.g., based on co-occurrence or most frequent terms) and the coordination technique highlights in the other views instances that have the extracted topic. In the application illustrated in Figure 6 we used the MEDICAL dataset. In Figure 6(a) we selected three images and extracted the term ‘humerus’. The Topic Coordination highlights images whose content has the term ‘humerus’, as shown in Figure 6(b). The color information indicates the occurrence of this term in the textual content.

Alternatively, if an ontology is available for the dataset domain, we can employ the Semantic Coordination to

(a)

(b)

(c)

Figure 6. NJ-tree computed from X-ray image descriptions(text con-tent). (a) Selection of three images from which the term ‘humerus’ was selected; (b) Topic Coordination highlighted the occurrence of the term ‘humerus’; (c) Semantic Coordination highlighted the occurrence of the terms ‘humerus’ and ‘shoulder’.

improve the exploration process and increase the number of highlighted images. We used the same selection from previous example, in which the user selected the term ‘humerus’. After that, we employed the vertebrate organs homologous groups ontology (vHOG) 4 _{to show a list of}

concepts related to ‘humerus’, from which the user selected the term ‘shoulder’. Thus, as shown in Figure 6(c), the Semantic Coordination highlighted the descriptions whose content has the terms ‘humerus’ and ‘shoulder’.

4_{VoHG is a multispecies anatomical ontology for the vertebrate lineage,}

based on the Common Anatomy Reference Ontology (CARO) available at http://www.obofoundry.org/cgi-bin/detail.cgi?id=caro

(7)

V. CONCLUSIONS ANDFURTHERWORKS Image retrieval systems are valuable tools to aid users in exploration of large image collections. These systems also aid in constructing subsets to perform a focused analysis. In this work we proposed a coordinated multiple views approach to improve the image retrieval process, enabling users to better understand the retrieved results. For that, we highlight how the retrieved images are related to each other and to other images. Thus, the user can analyse the retrieved images without losing the overall context.

In our applications we employed multidimensional projec-tion and point placement techniques to compute the visual representation, but any kind of visualization technique could be use to represent the images similarities. Also, we used im-age retrieval mechanisms based on imim-age content, key words and concepts, but other mechanism could be employed as coordination technique. Our approach allows exploring distinct image collections and multimodal datasets.

The proposed approach relies on visual representations, which are usually displayed in a computer screen. The visual scalability decreases as the image collection size increases. In further works, we plan to use multi-projection systems, as used in virtual reality, to increase the visual space as presented in [21].

ACKNOWLEDGEMENTS

The authors acknowledge the financial support of the Brazilian financial agency São Paulo Research Foundation (FAPESP).

REFERENCES

[1] Y. Rui and T. S. Huang, “Image retrieval: Current techniques, promising directions and open issues,” Journal of Visual

Comm. and Image Representation, vol. 10, pp. 39–62, 1999.

[2] Y. Liu, D. Zhang, G. Lu, and W.-Y. Ma, “A survey of content-based image retrieval with high-level semantics,” Pattern

Recogn., vol. 40, no. 1, pp. 262–282, Jan. 2007.

[3] S. K. Shandilya and N. Singhai, “A survey on: Content based image retrieval systems,” International Journal of Computer

Applications, vol. 4, no. 2, pp. 22–26, July 2010, published

By Foundation of Computer Science.

[4] A. Kumar, J. Kim, W. Cai, M. J. Fulham, and D. Feng. [5] C. Chen, G. Gagaudakis, and P. Rosin, “Similarity-based

image browsing,” in XVI IFIP World Computer Congress,

In-ternational Conference on Intelligent Information Processing,

Beijing, China, 2000, pp. 206–213.

[6] J. Fan, Y. Gao, and H.Luo, “Hierarchical classiﬁcation for au-tomatic image annotation,” in Proceedings XXX ACM Intern.

Conf. on Research and Development in Information Retrieval.

New York, USA: ACM Press, 2007, pp. 111–118.

[7] D. Eler, M. Nakazaki, F. Paulovich, D. Santos, G. Andery, M. Oliveira, J. E. S. Batista, and R. Minghim, “Visual analysis of image collections,” The Visual Computer, vol. 25, no. 10, pp. 923–937, 2009.

[8] I. Kalamaras, A. Mademlis, S. Malassiotis, and D. Tzovaras, “A novel framework for retrieval and interactive visualization of multimodal data,” Electronic Letters on Computer Vision

and Image Analysis, vol. 12, no. 2, 2013.

[9] R. W. Schvaneveldt, Ed., Pathﬁnder associative networks:

studies in knowledge organization. Norwood, NJ, USA: Ablex Publishing Corp., 1990.

[10] I. T. Jolliffe, Principal Component Analysis (second edition). New York, NY, USA: Springer-Verlag, October 2002. [11] L. Yang, “Distance-preserving projection of high-dimensional

data for nonlinear dimensionality reduction,” IEEE

Transac-tions on Pattern Analysis and Machine Intelligence, vol. 26,

no. 9, pp. 1243–1246, 2004.

[12] S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction by locally linear embedding,” Science, vol. 290, no. 5500, pp. 2323–2326, December 2000.

[13] J. B. Tenenbaum, V. de Silva, and J. C. Langford, “A global geometric framework for nonlinear dimensionality reduction,”

Science, vol. 290, no. 5500, pp. 2319–2323, December 2000.

[14] B. Moghaddam, Q. Tian, N. Lesh, C. Shen, and T. S. Huang, “Visualization and user-modeling for browsing personal photo libraries,” International Journal of Computer Vision, vol. 56, no. 1-2, pp. 109–130, 2004.

[15] G. P. Nguyen and M. Worring, “Interactive access to large image collections using similarity-based visualization,” J. Vis.

Lang. Comput., vol. 19, no. 2, pp. 203–224, Apr. 2008.

[16] F. V. Paulovich, L. G. Nonato, R. Minghim, and H. Lev-kowitz, “Least Square Projection: a fast high precision mul-tidimensional projection technique and its application to document mapping,” IEEE Transactions on Visualization and

Computer Graphics, vol. 14, no. 3, pp. 564–575, 2008.

[17] A. M. Cuadros, F. V. Paulovich, R. Minghim, and G. P. Telles, “Point placement by phylogenetic trees and its application for visual analysis of document collections,” in IEEE

Sym-posium on Visual Analytics Science and Technology 2007,

Sacramento, CA, USA, 2007, pp. 99–106.

[18] D. M. Eler, F. V. Paulovich, M. C. F. d. Oliveira, and R. Minghim, “Coordinated and multiple views for visualizing text collections,” in Proceedings of the 12th International

Conference Information Visualisation (IV ’08). Washington,

DC, USA: IEEE Computer Society, 2008, pp. 246–251. [19] J. M. Prates, L. P. Scatalon, R. E. Garcia, and D. M.

Eler, “Coordinating multiple views using an ontology-based semantic mapping,” in Proceedings of the 17th International

Conference Information Visualisation (IV ’13). Washington,

DC, USA: IEEE Computer Society, 2013, pp. 192 – 197. [20] G. Telles, R. Minghim, and F. Paulovich, “Visual analytics:

Normalized compression distance for visual analysis of doc-ument collections,” Computers & Graphics, Special Issue on

Visual Analytics, vol. 31, no. 3, pp. 327–337, 2007.

[21] A. C. Moraes, D. M. Eler, and J. R. F. Brega, “Collaborative information visualization using a multi-projection system and mobile devices,” in Proceedings of the 18th International

Conf. Information Visualisation (IV ’14), 2014 (To Appear).