From the above survey of the current approaches, we can observe that an important issue has been overlooked by most of the above techniques. This was stated in Santini and Jain (1999) by the following quote: “If our
A
systems have to respond in an intuitive and intelligent manner, they must use a similarity model resembling the humans.” Our belief in the utmost importance of the above phrase motivates us to propose a novel technique to measure the similarity of video data. This approach attempts to introduce a model to emulate the way humans perceive video data similarity (Farag & Abdel-Wahab, 2003).
The retrieval system can accept queries in form of an image, a single video shot, or a multishot video clip. The latter is the general case in video retrieval systems. In order to lay the foundation of the proposed similarity matching model, a number of assumptions
is listed first:
• The similarity of video data (clip-to-clip) is based on the similarity of their constituent shots. • Two shots are not relevant, if the query signature
(relative distance between selected key frames) is longer than the signature of the database shot. • A database clips is a relevant one, if one query
shot is relevant to any of its shots.
• The query clip is usually smaller than the average length of database clips.
The results of submitting a video clip as a search
example is divided into two levels. The first one is the
query overall similarity level which lists similar data- base clips. In the second level, the system displays a list of similar database shots to each shot of the input query and this gives the user much more detailed results
based on the similarity of individual shots to help fickle
users in their decisions.
A shot is a sequence of frames so we need to for-
mulate frames similarity first. In the proposed model, the similarity between two video frames is defined
based on their visual content where color and texture are used as visual content representative features. Color similarity is measured using the normalized histogram intersection, while texture similarity is calculated us- ing a Gabor wavelet transform. Equation (1) is used to measure the overall similarity between two frames f1 and f2 where Sc (color similarity) is defined in equa- tion (2). A query frame histogram (Hfi) is scaled before
applying equation (2) to filter out variations in video
clips dimensions. St (texture similarity) is calculated based on the mean and the standard deviation of each
component of the Gabor filter (scale and orientation)
(Manjunath & Ma, 1996).
( , ) 0.5* c 0.5* t Sim f f1 2 = S + S (1) 64 64 1 1 ( f ( ), f ( )) / f ( ) c i i S Min H i H i H i = = =
∑
1 2 ∑
1 (2) Suppose we have two shots S1 and S2 each has n1 and n2 frames respectively. We measure the similarity be- tween these shots by measuring the similarity between every frame in S1 with every frame in S2 and form what is called the similarity matrix that has a dimen- sion of n1Xn2. For the ith row of the similarity matrix, the largest element value represents the closest frame in shot S2 that is most similar to the ith frame in shot S1 and vice versa. After forming that matrix, equation (3) is used to measure shot similarity. Equation (3) is applied upon the selected key frames to improve ef-ficiency and avoid redundant operations.
1 2 ( ) , ( ) , 1 1 ( 1, 2) n i( i j) n j( i j) / ( 1 2) i j Sim S S MR S MC S n n = = = + +
∑
∑
(3)Where MR(i) (Si,j)/ MC(j) (Si,j): is the element with the maximum value in the i/j row/col respectively and n1/n2 is the number of rows/columns in the similarity matrix.
The proposed similarity model attempts to emulate the way humans perceive the similarity of video mate- rial. This was achieved by integrating into the similarity measuring formula (4) a number of factors that most probably humans use to perceive video similarity. These factors are:
• The visual similarity: Normally, humans deter- mine the similarity of video data based on their visual characteristics such as color, texture, shape, and so forth. For instance, two images with the same colors are usually judged as being similar. • The rate of playing the video: Humans tend also to be affected by the rate at which frames are displayed and they use this factor in determining video similarity.
• The time period of the shot: The more the pe- riods of video shots coincide, the more they are similar to human perception.
• The order of the shots in a video clip: Humans often give higher similarity scores to video clips that have the same ordering of corresponding shots.
R R V W D W F S W S S Sim( 1, 2)= 1* + 2* + 3* (4)
[
( ) ( )/ ( ( ), ( )]
1 S d S d Max S d S d DR= − 1 − 2 1 2 (5)[
( ) ( )/ ( ( ), ( )]
1 S r S r Max S r S r FR= − 1 − 2 1 2 (6)Where SV is the visual similarity, DR is the shot duration ratio, FR is the video frame rate ratio, Si(d) is the time duration of the ith shot, Si(r) is the frame rate of the ith shot, and W1, W2, and W3 are relative weights.
There are three parameter weights in equation (4), namely, W1, W2, and W3 that give indication on how important a factor is over the others. For example, stressing the importance of the visual similarity factor is achieved by increasing the value of its associated weight (W1). It was decided to give the user the ability to express his/her real need by allowing these param-
eters to be adjusted by the user. To reflect the effect of
the order factor, the overall similarity level checks if the shots in the database clip have the same temporal order as those shots in the query clip. Although this may restrict the candidates to the overall similarity set to clips that have the same temporal order of shots as the
query clip, the user still has a finer level of similarity
that is based on individual query shots which capture other aspects of similarity as discussed before.
To evaluate the proposed similarity model, it was implemented in the retrieval stage of the VCR system (a video content-based retrieval system) (Farag & Abdel-
Wahab, 2003). The model performance was quantified through measuring recall and precision defined in equa- tions (7) and (8). To measure the recall and precision
of the system, five shots were submitted as queries
while changing the number of returned shots from 5 to 20. Both recall and precision depend of the number of returned shots. To increase recall, more shots have to be retrieved, which will in general result in decreased precision. The ground truth set is determined manually by a human observer before submitting a query to the system. The average recall and precision is calculated for the above experiments and plotted in Figure 1 that indicates a very good performance achieved by the system. At a small number of returned shots the recall value was small while the precision value was very good. Increasing the number of returned clips increases the recall until it reaches one; at the same time the value of the precision was not degraded very much but the curve almost dwells at a precision value of 0.92. This way, the system provides a very good trade-off between recall and precision. Similar results were obtained us- ing the same procedure for unseen queries. For more discussion on the obtained results the reader is referred to Farag and Abdel-Wahab (2003).
R = A / (A + C) (7)
P = A / (A + B) (8)
Figure 1. Recall vs. precision for five seen shots
0 0.2 0.4 0.6 0.8 1 1.2 0 0.2 0.4 0.6 0.8 1 1.2 Recall values Pr ec isi on va lu es
A
A: correctly retrieved, B: incorrectly retrieved, C:missed
future trends
The proposed model is one step to solve the problem of modeling human perception in measuring video data similarity. Many open research topics and outstanding problems still exit and a brief review follows. Since Euclidean measure may not effectively emulate human perception, the potential of improving it can be explored via clustering and neural network techniques. Also, there is a need to propose techniques that measure the attentive similarity that some researchers believe that it is what humans actually use while judging multime- dia data similarity. Moreover, nonlinear methods for combining more than one similarity measures require more exploration. Investigation of methodologies for performance evaluation of multimedia retrieval systems and the introduction of benchmarks such as TRECVID effort are two other areas that need more research. In addition, semantic-based retrieval and how to correlate semantic objects with low-level features to narrow the semantic gap is another open topic. Real-time inter- active mobile technologies are evolving introducing new challenges to multimedia research that need to be addressed. Also, incorporating the user intelligence through human-computer interface techniques and in- formation visualization strategies are issues that require further investigation. Finally, the introduction of new psychological similarity models that better capture the human notion of multimedia similarity is an area that needs more research.
conclusIon
In this article, a brief introduction to the issue of mea- suring digital video data similarity is introduced in the context of designing effective content-based video re-
trieval systems. The utmost significance of the similarity
matching model in determining the applicability and effectiveness of the retrieval system was emphasized. Afterward, the article reviewed some of the techniques proposed by the research community to implement the retrieval stage in general and to tackle the problem of assessing the similarity of multimedia data in particu- lar. The proposed similarity matching model is then
introduced. That novel model attempts to measure the similarity of video data based on a number of factors
that are likely to reflect the way humans judge video
similarity. The proposed model is considered a step in the road towards appropriately modeling the human’s notion of multimedia data similarity. There is still many research topics and open areas that need further investigation in order to come up with better and more effective similarity-matching techniques.
references
Berretti, S., Bimbo, A., & Pala, P. (2000). Retrieval by shape similarity with perceptual distance and ef- fective indexing. IEEE Transactions on Multimedia, 2(4), 225–239.
Brunelli, R., Mich, O., & Modena, C. (1999). A sur- vey on the automatic indexing of video data. Journal of Visual Communication and Image Representation, 10(2), 78–112.
Cheung, S., & Zakhor, A. (2003). Efficient video
similarity measurement with video signature. IEEE Transactions on Circuits and Systems for Video Tech- nology, 13(1), 59–74.
Deb, S. (2005). Video data management and informa- tion retrieval. Idea Group Publishing.
Farag, W., & Abdel-Wahab, H. (2001). A new paradigm for detecting scene changes on MPEG compressed videos. In Proceedings of the IEEE International Symposium on Signal Processing and Information Technology (pp. 153–158).
Farag, W., & Abdel-Wahab, H. (2002). Adaptive key frames selection algorithms for summarizing video data. In Proceedings of the 6th Joint Conference on Information Sciences (pp. 1017–1020).
Farag, W., & Abdel-Wahab, H. (2002). A new para- digm for analysis of MPEG compressed videos. Jour- nal of Network and Computer Applications, 25(2), 109–127.
Farag, W., & Abdel-Wahab, H. (2003). A human-based technique for measuring video data similarity. In Pro- ceedings of the 8th IEEE International Symposium on
Computers and Communications (ISCC’2003) (pp. 769–774).
Guan, J., & Qui, G. (2007). Image retrieval and multi- media modeling: Learning user intention in relevance feedback using optimization. In Proceedings of ACM International Workshop on Multimedia Information Retrieval (pp. 41–50).
Hori, T., & Aizawa, K. (2003). Context-based video retrieval system for the life-log applications. In Proceed- ings of the 5th ACM SIGMM International Workshop on Multimedia information retrieval (pp. 31–38). Hörster, E., Lienhart, R., & Slaney, M. (2007). Image retrieval on large-scale image databases. In Proceedings of the 6th ACM International Conference on Image and Video Retrieval (pp. 17–24).
Kosugi, N., et al. (2001). Content-based retrieval ap- plications on a common database management system. In Proceedings of the ACM International Conference on Multimedia (pp. 599–600).
Lew, M. (Ed.). (2001). Principles of visual information retrieval. Springer-Verlag.
Li, C., Zheng, S., & Prabhakaran, B. (2007). Segmen- tation and recognition of motion streams by similarity search. ACM Transactions on Multimedia Computing, Communications, and Applications,3(3), article 16, 1–24.
Lian, N., Tan, Y., & Chan, K. (2003). Efficient video
retrieval using shot clustering and alignment. In Pro- ceedings of the 4th IEEE International Conference on
Information Communications and Signal Processing (pp. 1801–1805).
Liu, X., Zhuang, Y., & Pan, Y. (1999). A new approach to retrieve video by example video clip. In Proceedings of the ACM International Conference on Multimedia (pp. 41–44).
Manjunath, B., & Ma, W. (1996). Texture features for browsing and retrieval of image data. IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 18(8), 837–842.
Marchionini, G. (2006). Exploratory search: From
finding to understanding. Communications of the ACM, 49(4), 41–46.
Oerlemans, O., Rijsdam, J., & Lew, M. (2007). Real- time object tracking with relevance feedback. In Pro- ceedings of the 6th ACM International Conference on Image and video retrieval (pp. 101–104).
Oria, V., Ozsu, M., Lin, S., & Iglinski, P. (2001). Similarity queries in DISIMA DBMS. In Proceedings of the ACM International Conference on Multimedia (pp. 475–478).
Santini, S., & Jain, R. (1999). Similarity measures. IEEE Transactions on Pattern Analysis and Machine Intelligence,21(9), 871–883.
Truong, B., & Venkatesh, S. (2007). Video abstraction:
A systematic review and classification. ACM Transac- tions on Multimedia Computing, Communications, and Applications, 3(1), article 3, 1–37.
Wang, Z., Hoffman, M., Cook, P., & Li, K. (2006). VFerret: Content-based similarity search tool for continuous archived video. In Proceedings of the 3rd ACM Workshop on Continuous Archival and Retrieval of Personal Experiences (pp. 19–26).
Zhou, X., & Huang, T. (2003). Relevance feedback in image retrieval: A comprehensive review. Journal of Multimedia Systems, 8(6), 536–544.
key terms
Color Histogram: A method to represent the color feature of an image by counting how many values of each color occur in the image and forming a represent- ing histogram.
Content-based Access: A technique that enables searching multimedia databases based on the content of the medium itself and not based on keywords de- scription.
Context-based Access: A technique that tries to improve the retrieval performance by using associate contextual information, other than those derived from the media content.
Multimedia Databases: An unconventional data- base that stores various media such as images, audio, and video streams.
Query By Example: A technique to query multi- media databases where the user submits a sample query such as an image or a video clip and asks the system to retrieve similar items.
A
Relevance Feedback: A technique in which theuser associates a score to each of the returned hits; then these scores are used to direct the following search phase and improve its results.
Retrieval Stage: The last stage in a content-based retrieval system that accepts and processes user que- ries then returns the results ranked according to their similarities with the query.
Similarity-Matching: A process of comparing extracted features from the query with those stored in the metadata that returns a list of hits ordered based on measuring criteria.