An Interactive 3D Visualization for Content-Based Image Retrieval

(1)

An Interactive 3D Visualization for Content-Based Image Retrieval

Munehiro Nakazato and Thomas S. Huang

Beckman Institute for Advanced Science and Technology

University of Illinois at Urbana-Champaign,

Urbana, IL 61801, USA

E-mail: {nakazato, huang}@ifp.uiuc.edu

Abstract

3D visualization for Content-Based Images Retrieval (CBIR) eases integration of searching and browsing images in a large image database. While many researchers proposed visualization system for CBIR, most approaches display images based on a fixed set of image features. However, not all features are always equally important for users. Moreover, in most systems, all images in the data-base are displayed regardless of the user’s interest. Dis-playing too many images consume the resources and may confuse the users.

In this paper, we propose an interactive 3D visualiza-tion system for Content-based image retrieval named 3D MARS. In 3D MARS, only relevant images are displayed on projection-based immersive Virtual Reality system or desktop VR. Based on the users’ feedback, the system dynamically reorganize its visualization scheme. 3D MARS eases tedious task of searching images from a large set of images.

In addition, the sphere display mode effectively visual-ize clusters in the image database. This will be a powerful analyzing tool for CBIR researchers.

1. Introduction

Digital imaging became ubiquitous. People can take digital pictures by inexpensive digital cameras. We no longer have to worry about the price of films. Moreover, beautiful pictures can be easily downloaded from various web sites for free. In return, however, we are suffering from tedious tasks of searching and organizing a huge number of images.

In the last century, various techniques of Content-based image retrieval (CBIR) were proposed [5][6][7][8]. While CBIR systems provided us with a smart way of searching images, they had a significant limitation. First, in traditional CBIR systems, the query results are ordered

and displayed in a line (i.e. 1D) based on the weighted sum of the distance measures. Meanwhile, the image features consist of high-dimensional vectors of different image properties such as color, texture and structure. As a result, much information can be lost for visualization. This cause problems especially when the number of query examples is small. The system cannot tell which feature is the most important for a user. Consequently, the most important image may not appear in the early stage of query operations. One solution to this problem is to allow the user to adjust the query parameter as often used in other image retrieval systems [6]. In this approach, the user has to specify the weights of each feature. This process, however, is very tedious and difficult for novice users.

Second, because the query result images are tiled in a monitor, only limited number of images can be displayed at the same time. It is painful for user to go back and forth in the browser by clicking “Next” and “Previous” buttons.

In this paper, we propose a new visualization system for Content-based image retrieval named 3D MARS. In 3D MARS, images are displayed on a projection-based immersive Virtual Reality or non-immersive desktop VR. The three dimensional space can display more images than traditional CBIR systems at the same time. By giving different meaning to each axis, the user can simultaneously browse the retrieved images with respect to three different criteria.

In addition, from the users’ feedback, the system incrementally improves image query and dynamically adapts visualization scheme by relevance feedback techniques [5][7][8]. Moreover, with Sphere Display Mode, the system provides a powerful analyzing tool for CBIR researchers.

The rest of paper is organized as follows. In the next section, we describe the difference between text database visualization and image database visualization. In Section 3. a brief overview of previous approaches is presented. Then, the proposed system is described in the following

(2)

sections. Finally, the future work and conclusion are presented in Section 10.

2. Text Visualization vs. Image Visualization

Many researchers have already proposed 3D information visualization for Text document database [1][2][3]. Why do we need another visualization scheme for image database? This is because there are significant differences between Text documents and Image documents with regard to visualization.

First, in most Text document visualization systems, only the title and minimal information can be displayed at once. Otherwise, the display would be cluttered with texts. Meanwhile, it is difficult for user to judge the relevance only from the title of a document. In order to see more information such as the abstract or the contents of the documents, the user has to select one of the document and open up another display window [focus+context.] On the other hand, in image retrieval, the user need only image itself for his relevance judgement. This user judgement is instant and does not require additional display window. Hence, the system need to show only images themselves (and titles if necessary.) Therefore, image visualization is more suitable for fully immersive Virtual Reality systems such as CAVE.

Second, in both text and image retrieval systems, documents are indexed in a high dimensional space. Thus, in order to display the documents in a 3D space, the dimensionality has to be reduced. Because the index of text retrieval is made of the occurrence and frequency of keywords, it is difficult to automatically group these components in a meaningful manner. Such a organization is usually domain specific and requires human operation. On the other hand, the feature vector of image retrieval systems can be grouped easily, for example, into color, texture and structure. Therefore, the feature space can be easily organized in a hierarchical manner for 3D visualization.

In content-based image databases, however, there are significant semantic gaps [14] between indexed image features and the user’s concept. Most image databases index the images into numerical features such as color moments and wavelet coefficient as described in Section 7. These features are not directly related to the user’s concept. Even if two images are close each other in the high dimensional feature space, they do not necessarily look similar for the users. Therefore, in order to express the user’s semantic concept with these low-level features, the weights of these feature components should be adjusted automatically. Relevance Feedback technique for image retrieval was introduced by Rui et al. [5] for this purpose. Meanwhile, in text databases, it is more likely

that related documents have the same keywords and are located close to each other in the feature space.

3. Related Work

Many researcher proposed 2D or 3D visualization systems for Content-based Image Retrieval [10][11][15] [16][17].

Virgilio [11] is a non-immersive VR environment for image retrieval. Their system is implemented on VRML. Thus, the location of the images are computed off-line and interactive query is not possible. Only system administrators can send a query to the system and the other users can only browse the resulting visualization.

Hiroike et al. [10] also developed VR system for image retrieval. In their system, hundreds of images in the database are displayed in 3D space. According to the user feedback, these images are re-organized and form a cluster around the example images. In their system, all the images in the database are always displayed at the same time.

Chen et al. [17] applied the Pathfinder Network Scaling

technique [18] on image database. Pathfinder network creates links among the images so that each path represents a shortest path between images. In their system, mutually linked images are displayed in 3D VR space. Depending on pre-selected features, the network forms very different shapes. The number of images are fixed to 279.

Several researchers applied Multidimensional Scaling

(MDS) [13] to image visualization. Rubner et al. [15] used MDS for 2D visualization of images. Tian and Taylor [16] applied MDS to visualize 80 color texture images in 3D. They compared visualization result with different set of image features. However, because MDS is computationally expensive ( time,) it is not suitable for visualization of a large number of images.

4. 3D MARS: the Interactive Visualization

for CBIR

In most approaches described above, a set of image features has to be selected in advance. The problem is, however, that not all features are equally important for the users. For example, assume a user is looking for images of “balls of any color.” In this case, “Color” features are not very useful and should not be used for visualization. Inappropriate visualization can be misleading. Furthermore, the important sets of features are context dependent. Thus, the user has to change the feature set according to his current interest. This is very difficult task for novice users.

(3)

Furthermore, in most systems, all images in the database are displayed regardless of the users’ interest. Displaying too many images exhausts resources while it may be annoying for the users.

To address these problems, we propose a new visualization system for Image database named 3D MARS. In 3D MARS, the system interactively adapts visualization scheme according to users’ request. The user of 3D MARS tells the system his interest by specifying example images (Query-by-Example.) Repeating this feedback loop, the system can incrementally optimize the display space.

5. User Navigation

In 3D MARS, images are displayed in a projection-based immersive VR or non-immersive desktop VR. In the immersive case, we use NCSA CAVE. The image space is projected on four walls (front, left, right, and floor) surrounding the user. With shutter glasses, the user can see a stereoscopic view of the world. The user interacts with the space by a wand. The user can freely walk around in CAVE. In the non-immersive case, the VR space is displayed on a CRT monitor. The user interacts with the system with a keyboard and a mouse. The user can ware shutter glasses for better VR experience.

When the system starts, it displays a number of images aligned in front of the user like a gallery. As the user moves, the images rotate to face the user. These images are randomly chosen by the system. When the user touches one of images by the wand, the image is highlighted and the filename is displayed below it. By moving the wand (or mouse,) the image can be move to any position. The user can select an image as relevant (i.e. a query example) by pressing a wand/mouse button. More than one images can be selected. The selected images are displayed with red frames. In order to de-select an image, press the button again. The user can also specify an image as a negative example. The negative examples are displayed with blue frames. Moreover, the users can fly-through in the space by joystick. In order to prevent the user from getting lost, a virtual compass is provided on the floor. Three arrows of the compass always show X-axis, Y-X-axis, and Z-axis respectively (Figure 3.)

When the user presses the QUERY button, the system retrieves and displays the most similar images from the image database (Figure 2.) The locations of the images are determined by the feature distance from the query images. The X-axis, Y-axis and Z-axis represent color, texture and structure of images respectively. The more similar an image is, the closer to the origin of the space it is located. If the user finds another relevant (or irreverent) image in the result set, he can select it as an additional relevant (or irrelevant) example and press the QUERY button again.

By repeatedly picking up images, the query is improved incrementally and more images of the user’s interest are clustered near the origin (Figure 3.)

Figure 2. shows the result after a user select one “red flower” image as a positive example. Because only one example is specified, the system assume every feature is equally important. As a result, many different types of images are displayed. From this result, the user can give another feedback by selecting more “red flower” images.

Figure 3. shows the resulting visualization after the user selected two “red flower” images as query examples. More pictures of flower are gathered around the origin. Here, the green arrow means the “Color (X)” axis, the blue arrow means the “Texture (Y)” axis, and the red arrow means “Edge structure (Z)” axis. Pictures of “Red flower” have very similar color features to the query examples but have different texture and structure. Therefore, they are displayed on Y-Z plane. Meanwhile, an image of “white flower” has different color features, but has the similar shape to the examples. Thus it is displayed on X-Y plane.

For researchers of image retrieval systems, visualizing how query vector is formed and how images are clustered in the feature space are useful information to evaluate their algorithms. For this purpose, we have implemented Sphere Mode in our system (Figure 4.) In this mode, all the images are represented by spheres. Therefore, it is easier for the user to examine clusters in the VR space. The positive examples are displayed as red spheres, and the negative examples are displayed as blue spheres. By flying through the space in this mode, the researcher can examine how images are clustered from different view angles. For example, by looking down the floor from a higher position, the user can see how images are clustered with respect to color and structure (see Figure 5.)

6. System Overview

3D MARS is implemented as a client-server system. The system consists of Query Server and Visualization Engine (client) as shown below (Figure 1.) They are communicating via Hyper-Text Transfer Protocol (HTTP.) More than one clients can connect to the server simultaneously. Image features are extracted in advance and stored into the Meta-data database.

(4)

Figure 1. The system architecture

7. Image Features

Image features consists of thirty-four (34) dimensional numerical values from three groups of features: color,

texture, and edge structure. These features are indexed and stored in the Meta-data database in advance.

Color. For color features, HSV color space is used. We extract the first two moments (mean, and standard deviation) from each of HSV channels [4]. Therefore, the total number of color features is .

Texture. For texture, each image is applied into wavelet filter bank where the images are decomposed into 10 de-correlated subbands. For each sub-band, the standard deviation of the wavelet coefficients is extracted. Therefore, the total number of this feature is 10.

Edge Structure. We used Water-Fill edge detector [9] to extract image structures. We first pass the original images through the edge detector to generate their corresponding edge maps. From the edge map, eighteen (18) elements are extracted from the edge maps.

8. Query Server

The Query Server is implemented as an extension of MARS (Multimedia Analysis and Retrieval System) [5][7][8] for Information Visualization. The server maintains image files and their meta-data. When the server receives a request from a client, it computes the user-selected images with images in the database. Then, the server sends back IDs of the k most similar images and their locations in 3D.

8.1. Total Ranking vs. Feature Ranking

In the normal MARS system [5], the ranking of the similar images are based on the combination of the all three features. The weights of each feature is computed

from query examples. In early stage of user interaction loop, however, the user may specify only one example. In this case, the query server cannot tell which feature is important. Therefore, the system assumes every feature is equally important. As a result, an image is considered to be relevant only when every its feature is close to the query. This causes the searching to fall into a local minimum.

To remedy this problem, we use two ranking strategy:

Feature Ranking and Total Ranking. The Feature Ranking is a ranking with respect to only one group of the features. First, for each feature group , the system computes a query vector based on the positive examples specified by the user. Next, it compute feature distance of each image n in the database as follows,

(1) where is k th component of i-th feature, is

k-th component of . The weight is the inverse of the standard deviation of ( ),

(2) Then, the feature ranking is computed by comparing ( ). In addition, the value is also used for the location along the corresponding axis in the

fixed axes described later.

After the Feature Ranking is computed, the system combines each feature distance into the total distance

. The total distance of image n is a weighted sum of each ,

(3)

where . I is the total number of feature groups. In our case, I is 3. The optimal solution of

is solved by Rui et al.[7] as follows,

(4)

where , and N is the number of positive examples. This gives higher weight to that feature whose total distance is smaller. Which means that if the query examples are similar with respect to a feature, this feature gets higher weight. The complete discussion of the original MARS system is found in [5] and [7].

Finally, the Total Ranking is computed based on the total distances. The server sends back to the client IDs of the top images in the feature ranking and the top

images in the total ranking. Query Server

CAVE

HTTP

SGI Onyx

Server (Sun Enterprise)

Immersive VR Client Meta-data Database Visualization Engine Visualization Engine SGI O2 Desktop VR Image File Database Feature Extractor 3×2 = 6

i = {Color, Texture, Structure} qi d_ni d_ni w_ik(x_nik–q_ik) k

∑

= x_nik q_ik q_i w_ik x_nik n = 1 … N w_ik 1 σik ---= d_ni n = 1 2, … N d_ni d_ni Dn d_ni D_n = uTdn dn = [dn1, , … dnI] u = [u₁,_…uI] u_i fj f_i ---j=1 I

∑

= f_i d_ni n=1 N

∑

= K_feature K_total

(5)

By the use of both Feature Ranking and Total Ranking, the system can return images even if only one of their feature is close to the query. These images are usually located at some distance in the 3D space. The feature ranking is important especially in the early stage of query process where the user does not have enough number of query example. They could be ignored in the traditional CBIR systems.

8.2. Implementation

The server is implemented as a Java Servlet with Apache Web Server. It is written in C++ and Java. The server can simultaneously communicate with different types of client such as Java applet client [20]. It is running on a Sun Enterprise Server. Currently, 17,000 images and their feature vectors are stored.

9. Visualization Engine

The Visualization Engine takes a request from the user, sends the request to the server and receives the result from the server. Then it visualizes the resulting images in VR space. In the immersive case, it displays a set of images on the four walls of CAVE, which is a projection-based Virtual Reality system.

When the user pushes the QUERY button, it sends IDs of the selected (positive or negative) images to the server. The requests are sent as a “GET” command of HTTP. When the reply is returned, the client receives a list of IDs of the k most similar images and their locations. Next, it downloads all the relevant image files (such as JPEG files) from the image database. Finally, these images are displayed on the virtual space. The system can display arbitrary number of images on the VR space depending on the available resources (texture memory.) In our environment, 50 to 200 images are displayed. In Sphere mode, more data can be displayed simultaneously because the image textures do not have to be stored in the memory.

This component is written in C++ with OpenGL and CAVE library. The immersive version of the visualization engine is running on a twelve-processor Silicon Graphics Onyx 2. Each wall of the CAVE is drawn by a dedicated processor. Loaded image data are shared on a share memory and accessed from these processors. For the desktop VR version, the system is running on SGI O2.

9.1. Projection Strategies

In order to project the high dimensional feature space into 3D space, we take two different approaches: Static Axes and Dynamic Axes.

Static Axes. In the static axes approach, the meanings of X, Y, and Z-axis are fixed to some extent. In our implementation, X-axis, Y-axis and Z-axis always mean the distance with respect to Color, Texture and Structure, respectively. The location of each image is determined by the weighted sum of corresponding feature distance computed in the Query Server as described in Eq. 1. Therefore, for each axis, the system automatically chooses appropriate combination of features from the corresponding feature group.

Because the meanings of axes do not change for each interaction, the user can use the axes to obtain a context of image searching. This makes navigation in the VR space easier. The problem of static axes approach is that some axes (a group of features) may not give any useful information to the user. For example, if none of texture features are meaningful, Y-axis does not have any meaning.

Dynamic Axes. In the dynamic axes, the meanings of the axes change for every interaction. The location of images are determined by projecting the full 34 dimensional feature vector into three dimensional space. Many techniques have been proposed for this purpose. Because our goal is to provide a fully interactive visualization, computationally expensive method such as MDS is not suitable. In stead, we use faster FastMap method developed by Faloutsos et al. [12] FastMap takes a distance matrix of points and recursively maps points into lower dimensional hyperplanes. FastMap requires only computation, where N is the number of images and k is the desired dimension.

First, we feed the raw feature vectors of the retrieved images (including query vector) and the feature weights (Eq. 2) into FastMap. Here, there is no distinction among color, texture and structure feature groups. They are combined into one 34 dimensional vectors. After FastMap projects image features into 3D, we translate the entire VR space so that the location of the query vector matches the origin of the space. This guarantees that the distance between an image and the origin always represents the degree of similarity. The advantage of this approach is that the system use only necessary information to discriminate the images. The disadvantage is that because the meanings of the directions are always changing, the user may be confused.

10. Conclusion and Future Work

In this paper, we proposed a new interactive visualization system for Content-Based Image Retrieval, named 3D MARS. Compared with the traditional CBIR systems, more images can be displayed in 3D space. By

(6)

giving different meaning to each axis, the user can simultaneously browse the retrieved images with respect to three different criteria. In addition, by the use of the feature ranking, the system can display images that could be ignored in traditional CBIR systems. Furthermore, unlike other 3D image visualization systems, where mapping to 3D space is fixed, 3D MARS can interactively optimize the visualization scheme in response to the users’ feedback.

With the Sphere display mode, the 3D MARS not only provides efficient image searching, but also gives CBIR researcher a powerful analyzing tool. By flying through the space, the user can analyze image clusters from different viewpoints.

One limitation of our system is that the user has to find a query example for “Query-by-Examples.” from a random selection. The user has to repeat the random query until any interesting image is found. Chen et al.[19] proposed a technique to automatically generate a tree structure of image database. By following this hierarchy, the user can effectively browse images.

Pecenovic et al. [21] integrated image browsing and query-by-example. In their system, images are organized into hierarchical structure by recursively clustering images by k-means algorithm. For every level, each node is represented by image that is the closest to the centroid of the cluster. From the browsing, the user can switch to query by example.

If some images in the database have text annotation, these information can be used as a starting point of a query. We plan to integrate several browsing strategies into 3D MARS.

11. Acknowledgement

This work was supported in part by National Science Foundation Grant CDA 96-24396.

12. References

[1] Card, S. K., Macinlay, J. D. and Shneiderman, B., “Readings in Information Visualization - Using Vision to Think,” Morgan Kaufmann, 1999.

[2] Wise, J.A. et al., “Visualizing the non-visual: Spatial analysis and interaction with information from text documents,” In Pro-ceedings of the Information Visualization Symposium 95, pp. 51-58. IEEE Computer Society Press, 1995.

[3] Hearst, M. A. and Karadi, C., “Cat-a-cone: An interactive interface for specifying searches and viewing retrieval results using a large category hierarchy. In Proceeding of 20th Annual International ACM SIGIR Conference, Philadelphia, PA, 1997. [4] Sticker, M. and Orengo, M., “Similarity of Color Images,” In

Proceedings of SPIE, Vol. 2420 (Storage and Retrieval of Image and Video Databases III), SPIE Press, Feb. 1995.

[5] Rui, Y., Huang, T. S., Ortega, M. and Mehrotra, M., “Rele-vance Feedback: A Power Tool for Interactive Content-Based Image Retrieval,” In IEEE Transaction on Circuits and Video Technology, Vol. 8, No. 5, Sept. 1998

[6] Flickner, M. et al., “Query by image and video content: The QBIC system,” IEEE Computers, 1995.

[7] Rui, Y. and Huang, T. S., “Optimizing Learning in Image Retrieval,” In Proceedings of IEEE CVPR, 2000.

[8] Zhou, X. and Huang, T. S., “A Generalized Relevance Feed-back Scheme for Image Retrieval,” In Proceedings of SPIE Vol. 4210: Internet Multimedia Management Systems, 6-7 November 2000, Boston, MA, USA.

[9] Zhou, X. S. and Huang, T. S., “Edge-based structural feature for content-base image retrieval,” Pattern Recognition Letters, Special issue on Image and Video Indexing, 2000.

[10] Hiroike, A. and Musha, Y., “Visualization for Similarity-Based Image Retrieval Systems,” IEEE Symposium on Visual Languages, 1999.

[11] Massari, et al., “Virgilio: a Non-Immersive VR System to Browse Multimedia Databases,” In Proceedings of IEEE ICMCS 97, 1997.

[12] Faloutsos, C. and Ling, K., “FastMap: A Fast Algorithm for Indexing, Data-Mining and Visualization of Traditional and Mul-timedia Datasets,” In Proceedings of ACM SIGMOD95, pp. 163-174, May 1995.

[13] Kruskal J.B., and Wish, M., “Multidimensional Scaling,” SAGE publications, Beverly Hills, 1978.

[14] Santini, S. and Jain, R., “Integrated Browsing and Querying for Image Database,” IEEE Multimedia, Vol. 7, No. 3, 2000, pp. 26-39.

[15] Rubner, Y., Guibas, L. and Tomasi, C., “The earth mover’s distance, multi-dimensional scaling, and color-based image retrieval,” In Proceedings of the ARPA Image Understanding Workshop, May 1997.

[16] Tian, G.Y. and Taylor, D., “Colour Image Retrieval Using Virtual Reality,” In Proceedings of IEEE International Confer-ence on Information Visualization (IV’00), 2000.

[17] Chen, C., Gagaudakis, G. and Rosin, P., “Content-Based Image Visualization,” In Proceedings of IEEE International Con-ference on Information Visualization (IV’00), 2000.

[18] Schvaneveldt, R.W., Durso, F.T. and Dearholt, D.W., “Net-work structures in proximity data,” In The Psychology of Learn-ing and Motivation, 24, G. Bower, Ed., Academic Press, 1989, pp. 249-284.

[19] Chen, J-Y., Bouman, C.A., and Dalton, J.C., “Heretical Browsing and Search of Large Image Database,” In IEEE Trans-action on Image Processing, Vol. 9, No. 3, pp. 442-455, March 2000.

[20] Nakazato, M. et al., UIUC Image Retrieval System for JAVA, available at http://chopin.ifp.uiuc.edu:8080.

[21] Pecenovic, Z., Do, M-N., Vetterli, M. and Pu, P., “Integrated Browsing and Searching of Large Image Collections,” In Pro-ceedings of Fourth International Conference on Visual Informa-tion Systems, November, 2000.

(7)

Figure 2. The result after the user selected one “red flower” picture (in Fixed axes mode.) The query example is displayed near the origin. Arrows are the Virtual compass, which specifies the directions of each axis. Some similar flower pictures as well as non similar images are displayed at a distance. The number of images is 50.

Figure 3. The result after the user selected another “flower” images. Many “flower” images are gathered around the origin. Red flowers of different texture are aligned along the red arrow. A “while flower” is also displayed in a differ-ent position.

Figure 4. The Sphere Mode. The number of images is 100.

Figure 5. The sphere mode from a different view angle (from the zenith of the space.) Relationship between color and structure is visualized.