236
AN ENHANCED CONTENT-BASED VIDEO RETRIEVAL SYSTEM BASED ON QUERY CLIP
T.N.SHANMUGAMand PRIYA RAJENDRAN Department of Mathematics, Anna University, Chennai, India.
ABSTRACT
Content-based search and retrieval of video data has become a challenging and important issue. Video contains several types of audio and visual information which are difficult to extract, combine or trade-off in common video information retrieval. This research work is the enhanced version of our previous research with texture feature extraction. In this paper, we address the specific aspect of inferring our enhanced approach for content-based video retrieval from a collection of videos. Specifically, we present a video data model that supports the integrated utilization of various approaches. To begin with, the system splits the video into a sequence of elementary shots and extracts a small number of representative frames from each shot and subsequently calculates frame descriptors depending on the Motion, Edge, Color and Texture features. The video shots are segmented using 2-D correlation coefficient technique. The motion, edge histogram, color histogram and texture features of the elementary video shots are extracted by employing Fast Fourier transform and L2 norm distance function, Statistical approach, HSV color space conversion and Gabor wavelets using Fast Fourier transform respectively. The elementary video shots‟ features, extracted using the above approaches, are stored in feature library. On the basis of a query clip, the videos are retrieved in our system. The color, edge, texture and motion features are extracted for a query video clip and evaluated against the features in the feature library. With the help of Kullback- Leibler distance similarity measure the comparison is carried out. Later, similar videos are retrieved from the collection of videos on the basis of the calculated Kullback- Leibler distance.
KEYWORDS: Video Retrieval; Content-based Video Retrieval (CBVR); Video sequence; Shot segmentation; Edge Histogram; Color Histogram; Texture feature; Motion Estimation; Query Clip;
Similarity Measure; Kullback leibler distance.
1. INTRODUCTION
Nowadays, the capture, storage, uploads and delivery of videos has become effortless due to the rapid advancements in digital devices, Internet infrastructures, and Web technologies. The search for video content over the Web has been extremely challenging even with the accomplishment of the web search engines. In general, several web search engines index the Meta data of videos alone and search them using textual information. Traditional search engines possessed limited video retrieval capabilities when they were devoid of the capacity to understand media contents. For enhancing conventional search engines, there is an abundant space in the field of video retrieval via the exploitation of the rich media contents. This has transformed content-based video (CBVR) into a promising direction for creating future video search engines [1]. The ongoing lift in the number of large video libraries that are publicly accessible at the moment has shown the way to the requirement of novel methodologies, capable of manipulating the video data according to the content [2]. On the other hand, conventional database management systems working on relational or object oriented data models do not yet offer adequate facilities for administering and retrieving video contents. In accordance with [3], three principal reasons are stated for this: (i) lack of amenities for the administration of spatiotemporal relations, (ii) deficiency of knowledge-based methods for inferring raw data into semantic contents, and (iii) deficiency of query representations
Due to the growing profusion of digital video contents, competent techniques for analysis, indexing, and retrieval of videos which are based on their contents have gained more significance. The works in existence deal with video content at various levels: raw data, low-level visual content and semantic content
237
[4]. Raw video data encompasses elementary video units in addition to the common video formats including color, shapes, textures and more. On the other hand, semantic content comprises of high-level concepts with objects and events. We can represent semantic content with the aid of diverse of visual presentations. The primary distinction among the two types of content is the varied demands for the extraction of each of these contents. The method employed in the extraction of semantic content is comparatively tedious as it demands domain knowledge or user interaction, whereas the extraction of visual features is usually domain independent [4].
The retrieval of video and image data on the basis of their visual content like color distribution, texture and shape [5] has been the prime focus of many researchers and these approaches work in accordance with similarity measurement. VisualSEEk [6], Photobook [7], Blobworld [8], along with Virage video engine [9], CueVideo [10] and VideoQ [11] constitute a few of the eminent examples for image and video retrieval systems. The image retrieval systems assist the users to formulate queries on the basis of the visual image content – properties like color percentages, color layout and textures used in the images generally with the support of instances of prior matches (query by example). A small number of these systems make use of spatial information and permit the user to build queries either by drawing the layout of color regions, or by providing the URL of a seed image. Novel approaches of video retrieval only appended the functionality for segmentation and key frame extraction to the available image retrieval systems. Similarity measurement based on low level features was applied following the extraction of the key frame. Due to the fact that video is temporal data the aforesaid procedure is not appropriate and thus sequencing of individual frames, which produced new semantics that may not exist in any of the individual shots, was carried out.
Recently, the necessity for intelligent processing and analysis of multimedia information has been rising on a regular basis. Researchers have built a number of technologies for intelligent video management which includes the shot transition detection, key frame extraction, video summarization and video retrieval and more. Content based retrieval is considered to be the most difficult and significant issue of practical value amongst all the others. It assists the users in the retrieval of favored video segments from a vast video database efficiently based on the video contents with the aid of user interactions [12]. In general, the video retrieval system can be divided into two principal constituents: a module for the extraction of representative characteristics from video segments and defining a fitting similarity model to position similar video clips from video database. A large number of approaches employed a wide variety of features to symbolize a video sequence of which color histogram [13], shape information [14], motion activity [15], and text analysis [16] are a renowned few. A small number of approaches utilized the aforementioned features to improve the retrieval performance [17].
In this paper, we have presented an effective approach for content based video retrieval system which is an enhanced version of our previous work [18].
The proposed system retrieves similar video clips for a query video clip from a collective set of videos.
The pre-annotation of video shots is not employed in the proposed system. Traditionally, the initial step in a majority of available content-based video analysis techniques is to segment a video into elementary shots, each of them constituting a sequence of consecutive frames recording a video event or scene continuous in time and space. These elementary shots are organized to form a video sequence during video sorting or editing with either cut transitions or gradual transitions of visual effects such as fades, dissolves, and wipes.
We have used 2-D Correlation Coefficient technique for video shot segmentation. Moreover, we carry out discrete cosine transform, mean and standard deviation over the video sequence to segment the video shots.
Four different kinds of video features, including motion, edge, texture and color for each video shot is being extracted by our system as the second step. To minimize the dimensionality of the data, we employ feature extraction which extracts preferably compact and discriminative features of data Motion is the key feature representing temporal information of videos. Motion estimation is performed with the help of Fast Fourier t0ransform (FFT) and
L
2-norm distance.For content-based video retrieval, efficient motion feature extraction is an important step. The spatial distribution of edges is captured by the Edge Histogram (EH) with the help of sobel perators. Color histogram is the most extensively used method because of its robustness to changes due to scaling, orientation, perspective, and occlusion of images, which are recognized by using the HSV color space.
Texture analysis has been in existence for long periods and the texture analysis algorithms range from the usage of random field models to multi resolution filtering techniques such as the wavelet transform. Multi resolution representation the based on Gabor filters has been employed. Several factors influence the utilization of Gabor filters for extracting textured image features. In order to minimize the joint two-
238
dimensional uncertainty in space and frequency the Gabor representation is shown to be the most optimal.
The feature library stores the extracted features. In the proposed system, the similar videos are retrieved on the basis of a specified query clip. Therefore, the abovementioned four distinct features are extracted for a query video clip and are compared with the features in the feature library. The similarity measure is employed to compare the query features and the features in the feature library. For similarity measure calculation the proposed system makes use of Kullback- Leibler distance method calculation. On the basis of the calculated Kullback- Leibler distance similar videos are retrieved from the collection of videos.
The rest of the paper is organized as follows. The proposed effective content-based video retrieval system is detailed in Section II. An experimental result is presented in Section III. The conclusions are summed up in Section IV.
2. ENHANCED AND EFFECTIVE CONTENT-BASED VIDEO RETRIEVAL SYSTEM
The proposed effective content-based video retrieval system is illustrated in this section. The video database, a collection of video sequences, is completely processed offline in our system. The individual videos are split into separate shots followed by the tracking of the video objects across frames within every shot. The first step in our system is to partition a long video sequence into several video shots, i.e. shot segmentation, where each shot is the basic unit for video retrieval. In the next step, our system extracts four different kinds of video features, including motion, edge, texture and color for every video shot and the extracted video features are stored in the feature library. Then, the same features (aforesaid) are extracted for a query clip (single clip) and are compared with the features in the feature library. With the aid of Kullback- Leibler distance similarity measure, the comparison is carried out. Finally the videos are retrieved from the videos collection on the basis of Kullback- Leibler distance.
2.1. Shot Segmentation
The video has to be split into “chunks” or video shots prior of conducting any video object analysis.
Scene change detection, either abrupt scene changes or transitional (e.g. dissolve, fade in/out, wipe) is employed to achieve the video shot separation. Meng et al. [18] proposed an efficient scene change detection algorithm that operates on compressed MPEG streams. They calculated the statistical measures with the assistance of motion vectors and DCT coefficients from the MPEG stream. Afterwards, the heuristic models of abrupt or transitional scene changes are confirmed through these measurements [19].
Up late, a majority of video retrieval or scene change detection systems employ shots as the fundamental element in the video database [20].
A shot can be defined as a sequence of frames taken by a single camera without any significant change in the color content of consecutive images. A number of researchers utilized robust techniques on basis of the color histogram comparison to accomplish this function. We employed 2-D Correlation Coefficient technique for video shot segmentation in our system. For example, consider a sample video sequence comprising of a set of color frames roughly. Firstcolor frame is selected from that sequence and it is transformed into grey scale format followed by the application of 2-D discrete cosine transform. DCT is a separable linear transformation; specifically, the two-dimensional transform and a one-dimensional DCT performed along a single dimension followed by a one-dimensional DCT in the other dimension are equivalent [21]. The two-dimensional DCT for an input image A and output image B can be given as:
1 0
1 0
2 , ) 1 2 cos ( 2
) 1 2 cos (
1
0 1
0
N q
M N p
q n M
p A m
B
M
m N
n mn q
p pq
(1)
Where
1 1
, / 2
0 , / 1
M p M
p M
p ;
1 1
, / 2
0 , / 1
N q N
q N
q
Where
M
is the row size andN
is the column size ofA
[22]. A similar process is carried out in the 2nd frame of the video sequence followed by the computation of correlation coefficient between the frame 1 and 2 by means of the following formula [23].239
m n
mn
m n
mn
m n
mn mn
B B A
A
B B A A r
2 2
(2)
) ( 2 B
and ), ( 2 A
mean A mean B
where
Once correlation features of the frames are determined, the 1st frame of the video sequence is replaced with the 2nd frame. This process is carried out for every single frame in a video sequence. The feature library stores the correlation features of all the frames. Then the changing shot of the video sequence is positioned with the aid of the correlation features. Let us consider the first 20 correlation features so as to determine the changing shot. An
M
-dimensional feature vectorai, is computed for each frame fi,j
i 1;2;...; . The matrix is obtained with ai as a column.
: ] .[
]...
[a1 aj
A (3)
We construct the
M N
feature matrixA
with the aid of such a feature vector as a column. Every feature is associated with a row vector of A of dimensions1 N
, while a column vector of A of dimensions M 1depicts each frame. The following formulas aid n the calculation of the mean and standard deviation of the correlation features.n
i ij
j
x
x n
1
1
(4)N j M
u
M
i
j ij
j 1
1
1
2
(5)
Where, j denotes the mean of
j
th column. Eventually, we find the Shot/Cut changes took place in the video sequence on basis of the value of mean and standard deviation.2.2. Feature Extraction
This subsection illustrates the extraction of features from the segmented shots. A shot of a person (the person is the “object” here) walking can be regarded as an instance that is segmented into a compilation of adjacent regions with different criteria like shape, color, Edge and texture, though all the regions may be consistent in their motion attribute. A feature is believed to be good only if similar objects are adjacent to each other in the feature space, and dissimilar objects are far from each other [24]. Motion Estimation, Edge Histogram and Color Histogram are the features extracted in our system. The aforementioned features are extracted for all the video shots and stored in the feature library.
2.2.1. Motion Estimation
Motion is the most significant feature in video which represents two dimensional temporal change of video content [25] despite the conventional image features including color, texture and shape. It is possible to distinguish video and images in terms of motion. Numerous applications including motion based segmentation and structure from motion, utilize the motion information. This sub-section describes the estimation of motion. A number of significant applications in the areas of computer vision and video processing also employ the process of estimation of motion. Motion compensation based video coding is one application which encompasses motion estimation technique as a direct appliance [26]. The past two decades have witnessed many efforts made on motion estimation, which is still one of the most active
240
research areas in video analysis. Motion can aid to find interesting objects in the video [27]. Our system recognizes an approach for motion estimation.
Let us regard a sample video sequence comprising some of set frames. Primarily the color frames are transformed into grey scale followed by the Selection of 1st and 2nd frame from the sample video sequence.
The non-overlapping blocks of size 8x8 are extracted from the both the frames. Later, the blocks in the first frame are evaluated against the blocks in the second frame through FFT and
L
2-norm distance. At first, FFT is applied to the blocks. Subsequently, the difference between the two blocks is determined on basis ofL
2-norm distance. The FFT transform pair given for vectors of lengthN
is as follows:N
j
k j
j N
x k
X
1
) 1 )(
1
) (
( )
( (6)
Where N
e
( 2 i)/N is a Nth root of unity.The
L
2-norm [29] is also referred as the Euclidean Norm. For a function (x), theL
2-norm is defined asdx
b x
a 2 2
)
( (7)
The aforesaid procedure is repeated for every block in the frame followed by the indexing of motion vector of the first frame and in turn followed by the application of the same process on 2nd frame and the frame adjoining it. Similarly the process is executed repeatedly for all the video frames in the sample video sequence. Later, a threshold value is set for video sequence. All the measured distance of the frames are compared by keeping this threshold value in mind and the sample video sequence is classified into static and motion object, to form the motion vector.
Assumptions:
Block size = 8;
Thres Predefined threshold value
F
F First frame in a shot;N
F Next frame in a shotBF
F Block of first frame;BN
F Block of next frame MV Motion vectorshots in frame each for
frame in column) (row,
each for
) , (
BF
FBN
FBlockmatch
D
) (D Thres If
] [ F
V F
M Vector
if end
end
end
The formula used in the matching of two blocks is as follows:
2 F
F FFT BN
BF FFT
241
2.2.2. Edge Histogram
The global feature composition of an image is most frequently represented by characteristics of the histogram. The representation of the content of the image considers edge in the image as an important feature. Human eyes percept image by being sensitive to edge features. Object recognition can be effectively performed using edge histograms. The Edge Histogram (EH) uses the Sobel Operator to capture the spatial distribution of edges [29].
In our system, initially the sample video sequence in RGB color space is converted into YCbCr color space. The YCbCr color space is extensively employed by the video and digital photography systems [28].
The luma components, blue-difference chroma component and red-difference chroma component are represented by Y', Cb and Cr respectively. The luma is differentiated from luminance which means light intensity by non-linearly encoding using gamma; it is denoted by using a prime („) onY. Chroma
Cb
corresponds to theU
color component of a generalYUV
color space and chromaCr
corresponds to the
V
component of a generalYUV
color space. The following equations express the conversion of RGB into YCbCr color space:B G
R
Y 0 . 2989 0 . 5866 0 . 1145
(8) BG R
Cb 0.1688 0.3312 0.5000 (9)
B G
R
C
r0 . 5000 0 . 4184 0 . 0816
(10)Further, the luma component
( Y
')
from the video sequence is chosen and edge histogram is computed over that component. We have employed sobel operator for computing the edge histogram. The steps involved in the computation of the local edge orientation histogram with the aid of sobel operator for each shot are as follows:2.2.2.1. Convolution with Sobel filters:
As shown in Figure 1, Sobel filters are applied to each shot in five directions. The gradient direction of the pixel is the selected filter that produces strongest response. [29].
Figure 1: Oriented Sobel filters
Here a 4-bin edge histogram is used to represent the strength of edge in 0, /4, /2,3* /4 directions. Image gradients Gx and Gy are computed using Sobel operators [28]; 1 0 -1
A 1 - 2 1
0 0
1 2 1 G and A
Gy x 0
1 0 1
2 0 2
1 0 1
(11)
Where denotes the 2-dimensional convolution operation and
A
is the source image. Sobel operator is used to compute edge map and the cut-off percentage is determined using a root mean square (RMS) estimate of image noise;2.2.2.2. Edge pixels detection:
The edge pixels are the pixels whose magnitude of gradient is larger (in our experiment, it is0.30 Gmax, where Gmax is the maximum magnitude of the edge pixels). Compute the edge direction
242
x y
G
arctan G
(12)Edge direction is then uniformly quantized to 4 bins (0, / 45, 90 degrees) using the decision values +/-22.5,+/-67.5 degrees.
2.2.2.3.Edge Histogram computation:
The edge histogram has eight bins corresponding to the Sobel filters to count the number of edge pixels in five directions. Edge histogram is then normalized with respect to the image size, i.e. each bin value represents the percentage of a certain edge direction in an image.
2.2.3. Color Histogram
Color histogram is the most widely used method owing to its robustness to scaling, orientation, perspective, and occlusion of images [30]. The joint distribution of the three color channels is denoted by the histogram. The human perspective to color is a merger of three stimuli, R (red), G (Green), and B (Blue), which form a color space. It is possible to produce more color spaces by separating chromatic information and luminance information. It is necessary to choose a color space in order to extract color information. Since there are numerous colors in existence, it is essential to quantize color space in order to decrease the complexity in histogram computation. The proposed method utilizes color categorization for the quantization of color space. Every pixel in the region is converted from RGB (Red, Green and Blue) color space to HSV (Hue, Saturation and Value) color space prior to color analysis [31].
The transformation from
RGB
color pace toHSV
color space is concisely given as follows: Let max represent the greatest ofr, g
andb
, and let minrepresent the least.b g if
r
g r if
b
r b if
g
if
h
max , min 240 60 max
max , min 120 60 max
max , 360 mod 0 min max 60
min max 0
(13)
otherwise
max, 1 min max
min - max
0 max if , 0
s (14)
max
v (15)
Here the hue value of "grays" (where R=G=B) are set to 0 instead of leaving it undefined, for the convenience of computation. Then, each color component is uniformly quantized: H -- 16 bins; S -- 4 bins;
V -- 4 bins. Finally, this 16x4x4 histogram concatenated and we get a 256-dimensional vector. After converting the frame into HSV color space, the color histogram is computed. The hue and saturation component of the region that creates the highest histogram count value will eventually be utilized as the feature of that particular region. Noticeably, only the hue and saturation component is considered for color histogram. Many reasons prevent the inclusion of value. First one being that a three dimensional array that requires more memory would be necessary if all three components are to be included. Secondly, the hue and saturation components express adequate information making the inclusion of value component unnecessary. To be precise, the extracted color feature will be the often utilized feature [32].
2.2.4. Texture Feature Extraction
: The elementary concerns met in many, low level, image analysis and computer vision tasks are color, shape and texture, a vital visual property of the materials. In the image science and amongst the diverse research directions taken up in the field, the research of texture alone is considered to be a complicated
243
subject. Texture is an area property and is exemplified by features similar to roughness, variability, repeatability, directionality and more, defined over a certain spatial extent, when compared to color a point property. The energy distribution in the frequency domain is employed by numerous techniques for texture retrieval and classification in order to recognize texture [45], [46], [47]. In our system, a simple texture feature representation based on Gabor wavelet features has been employed [33].
Functions and Wavelets:
The representation of a two dimensional Gabor function g(x,y) and its Fourier transform G( vu, ) can be as follows [44]:
jWx y y x x
y x y
x g
2
2 2 2 2 2 1 2 exp
) 1 , (
(16)
2 2 2
)2
( 2 exp 1 ) , (
v u
v W v u
u
G (17)
Where u 1/2 x and v
1 / 2
y. A complete but non-orthogonal basis set is formed by the Gabor functions, which when utilized for expanding a signal offers a localized frequency description. Here, we consider a class of self-similar functions, known as Gabor wavelets in the subsequent discussion [34].Subsequently, the appropriate dilations and rotations ofg(x,y), the mother Gabor wavelet through the generating function generates this self-similar filter dictionary:
), ' , ' ( )
,
(x y a g x y
gmn m a 1,m,n integer;
), sin cos
(
' a x y
x m and y' a m( xsin ycos ), (18)
Where
n / K
and K is the total number of orientations. The scale factor a m in eqn. (19) is meant to ensure that the energy is independent ofm [35].Feature Representation:
Given an imageI(x,y), its Gabor wavelet transform is then defined to be
1 1 1 1 1
1, ) ( , )
( ) ,
(x y I x y g x x y y dxdy
Wmn mn (19)
Where * indicates the complex conjugate. It is assumed that the spatially homogeneous nature of local texture regions is assumed and the representation of the region for classification and retrieval purpose is made using the mean mn and the standard deviation mn of the magnitude of the transform coefficients [42]:
mn dxdy y mn x mn W
and dxdy mn xy mn W
) 2 , (
, ) (
(20)
244
Now, mnand mn are employed as feature components in the construction of feature vector. In the experiments, four scales
S 4
and six orientationsK 6
are employed to obtain a feature vector as the outcome [43].35 35 01
00
00
f
(21)2.3. Retrieving Similar Videos Based On Query Clip
By means of the approaches referred in the previous subsections, the motion vector, edge histogram, color histogram and texture features are extracted for all the video shots in the database and stored in the feature library. All the extracted features are stored in a Mat file since we have implemented our approach in Matlab. Based on the input query clip, in our video retrieval system, we retrieve the related videos from the video database. The above mentioned features are evaluated for the query clip and compared against the features in the feature library. With the help of a similarity measure, the comparison of the features is achieved. In our system we have employed the Kullback –Leibler distance as similarity measure. Finally, based on the distance measures the video sequences are sorted in ascending order and the similar videos are retrieved.
2.4. Similarity Measure
When the feature set is fixed for a particular retrieval system, what the researchers can possibly enhance is the similarity measure. Clearly, the similarity measure plays a major role as the original feature space in deciding “close” or “far”. Euclidean distance and other Minkowski-type distances are a notable few among the extensively used similarity measures. We have used Kullback Leibler distance method for similarity measure computation in our system.
2.4.1.The Kullback Leibler Distance Method:
Kullback and Leibler in 1951 studied from a statistical perspective, a measure of information that implicated two probability distributions associated with the same experiment [19]. To determine the difference between two distinct probability distributions (over the same event space) the Kullback-Leibler divergence measure is used. The subsequent equation describes the KL divergence of the probability distributions P,Qon a finite set
X
.X x
KL Q x
x x P
P Q
P
D ( )
) log ( ) ( )
( (22)
Owing to the fact that KL divergence is a non-symmetric information theoretical measure of distance of P fromQ, it is not specifically a distance metric. To generalize this measure a variety measures were instituted in the past literature. Therefore the following different symmetric Kullback-Leibler divergences i.e., Kullback-Leibler Distances (KLD) have been employed for our experiments [36]. There are various applications including language models [37], query expansion [38], and categorization [39] which have utilized
KL
andKLD
. In addition, they have also been employed in natural language and speech processing applications on the basis of statistical language modeling [40], and in information retrieval, for topic identification [41].Steps for Similarity Measure:
Let us consider
P
Query clip feature vector Q Feature library 1st feature vector n Element of vectorN
Normalized factor of Q ) (P ion NormalizatF P
Then find ((Q 0) & (
F 0
)) and store that inF
I .245
Then similarity measure is carried out using
) (
) ( log * ) (
I I I
KL Q F
F F F N
F
D (23)
3.EXPERIMENTAL RESULTS AND DISCUSSION
Our proposed approach has been validated by experiments with several kinds of video sequences. We have implemented our proposed system in Matlab (Matlab 7.4). We report here some results obtained on a part of a video sequence utilized for retrieval. The results of shot segmentation, Motion estimation, RGB to YCbCr color space conversion, RGB to HSV color space conversion are presented along with Query clip and retrieved shots.
Figure 1 shows some sample frames of the segmented shots which are obtained from a single video clip. Figure 2,3,4 illustrates extracted objects which exhibits motion in the shots obtained after the application of block matching algorithm with FFT, RGB to HSV color space conversion output for color histogram, RGB to YCbCr color space conversion output for Edge histogram for the shots respectively.
The motion distance between two blocks is illustrated using fig 5(a) and 5(b).The size of the image has 240 * 320 rows and columns, which is assigned in X and Y direction and contains 30 *40 blocks of 8 by 8 block size. Two methods such as DCT and FFT have been implemented and the results are compared.
We yield a range of 1 to 40 blocks for X-direction and 1 to 30 blocks for Y-direction. On applying DCT and L2 method, there exists a distance between 2 frames ranging from 13-23 blocks in X direction and 15- 23 in Y-direction which is shown in table 1 & 2. From table 3 & 4 we obtain a range of 13-23 blocks for X- direction and 15-23 for Y-direction on applying FFT and L2 method.
For Comparison we take the threshold value to be 60. If Distance is greater than the Threshold, we mean that, there exists a motion between blocks. The blocks which satisfy the above condition has been highlighted in the tables 1-4 below.
Observing the video result for sample sequence Fig 5(a) and 5(b), FFT founds to be better than DCT method, as FFT captures even the shadow motion of the persons when compared to the DCT method.
Figure 6(a) and 6(b) details some input query clips and the corresponding video clips retrieved by the proposed content based-video retrieval approach. For the given query clips, on the basis of videos retrieved, we have determined the precision and recall for the proposed system. The precision and recall measurements are the more frequently using measurements to analyze the performance of an image retrieval system which can be defined as
images retrieved of
number Total
selected images relevant of
Number precision
dataset the in images similar of number Total
selected images relevant of Number
recall Some
of the query clips utilized for the precision and recall calculation is given in the figure 7.
For the entire given query clips Q1, Q2, Q3, Q4 and Q5 as illustrated sequentially in the figure 7, precision and recall have been determined and the values are tabulated in the table 1.
Table I: Precision and Recall for a given set of query clips Sl.
No.
Query clip Precision Recall
1 Q1 0.10 0.90
2 Q2 0.26 0.88
3 Q3 0.31 0.86
4 Q4 0.70 0.32
5 Q5 0.89 0.27
246
In the table 1, we had given a sample of five query clips and the corresponding precision and recall values and figure 8 gives the precision-recall plot for the query clips.
Figure 8: precision-recall plot for the input query clips
For the given five input query clips Q1, Q2, Q3, Q4 and Q5, we have plotted the precision – recall graph.
From the graph, it can be decided that the proposed system exhibit a good acceptable precision and recall which proves the system as more effective.
4.CONCLUSION
Content-based retrieval of visual information is an emerging area of research which has been in limelight amongst the researchers and experimenters, recently. In this paper, we have presented an enhanced content based video retrieval system which performs efficiently. The proposed scheme facilitates the segmentation of the elementary shots in the long video sequence proficiently. Subsequently, the extraction of the features for instance motion vector, edge histogram, color histogram and texture features of the video sequence is performed and the feature library is employed for storage purposes. The kullback liebler distance similarity measure is employed for successful comparison between the features in the feature library and the features of the query clip extracted in a similar manner. The computed kullback liebler distance serves as the basis for the effective retrieval of the similar videos from the video database.
(a)
247
(b)
Figure 1: Shots obtained after shot segmentation from a video clip (a) frames belongs to one shot (b) frames belongs to another shot
Figure 2: Motion Estimation output of a video sequence
Figure 3: RGB to HSV Color Space Conversion output to extract the Color Histogram Feature
248
Figure 4: RGB to YCbCr Color Space Conversion output to extract the Edge Histogram Feature
Figure 5(a): Motion Estimation using DCT
Figure 5(b): Motion Estimation using FFT
249
Table 1:DCT Value for Shadow of the First Person
X
Y 34 35 36 37 38 39 40
15 78.5994 14.9865 7.1776 9.7014 10.3577 6.3028 9.4546 16 44.2589 7.4465 16.9947 17.9732 10.1542 5.6517 7.2035 17 15.5864 17.4779 15.2831 13.1081 13.8262 11.5193 10.3198 18 14.9036 8.9529 12.1174 15.5645 8.2705 9.8875 14.2527 19 10.8021 17.2128 11.8821 4.6453 11.0029 8.3179 7.8187 20 17.2527 13.4261 8.9699 14.9130 16.6430 19.6191 18.9085 21 22.0507 22.9473 12.1469 6.1963 23.9582 26.0864 21.5636 22 106.3294 44.0055 35.6003 23.1434 25.5265 12.7690 20.0477 23 23.1194 9.9618 11.5705 11.4814 7.9070 4.2397 16.8217
Table 2:DCT Value for Shadow of the Second Person
X
Y 13 14 15 16 17 18 19 20 21 22 23
15 1.6001e+003 75.714 0
38.596 0
74.865
7 17.087
9
15.96 39
32.48 67
44.747 2
48.747 1
39.74 34
14.33 16 1 61
.2913e+003
63.465 9
48.745 5
72.088 9
55.340 2
32.04 64
36.95 93
64.106 7
66.042 1
47.14 81
48.78 17 1.2375e+003 33.288 12
5
83.681 7
86.002 4
46.216 6
42.72 96
51.35 99
77.279 0
65.112 1
88.39 73
56.45 18 751.7537 56.510 31
8
96.464 0
63.368 6
14.747 3
41.13 26
28.96 07
97.667 2
69.628 2
70.44 27
54.00 19 298.2532 18.687 87
0
62.199 9
49.953 7
41.999 5
44.79 93
60.84 66
71.368 7
75.435 0
67.14 83
36.79 20 70.2640 51
69.866 9
46.870 7
33.152 1
38.495 8
29.88 45
65.82 35
102.04 15
123.10 70
77.63 72
88.55 21 45.0522 49.708 03
5
61.029 5
32.953 1
74.607 3
87.14 47
71.86 96
111.89 58
75.522 9
33.51 56
58.52 22 223.0965 182.56 22
94
125.18 74
120.48 76
109.86 52
97.10 44
62.80 94
77.867 9
37.913 2
35.96 72
41.07 23 1.2263e+003 474.23 37
02
276.85 47
192.13 93
125.88 12
98.42 78
65.10 58
58.242 0
37.432 6
25.39 84
27.62 24 54.7021 13
0.8532
7.5403 85.938 4
34.632 1
46.86 46
27.59 12
21.019 7
38.810 9
28.32 03
16.96 79 Table 3:FFT Value for Shadow of the First Person
X
Y 13 14 15 16 17 18 19 20 21 22 23
15 65.7333 26.7689 13.645 8
26.469 0
6.0415 5.6441 11.485 8
15.820 5
17.234 7
14.051 4
5.0686 16 810.1052 22.4386 17.234
1
25.487 3
19.565 7
11.330 1
13.067 1
22.665 1
23.349 4
16.669 4
17.246 17 437.5159 11.7692 29.586 8
0
30.406 4
16.340 0
15.107 2
18.158 5
27.322 2
23.020 6
31.253 2
19.959 18 265.7851 19.9796 34.105 2
2
22.404 2
5.2140 14.542 6
10.239 1
34.530 6
24.617 3
24.905 3
19.094 19 105.4484 6.6068 21.991 9
0
17.661 3
14.849 1
15.838 9
21.512 5
25.232 6
26.670 3
23.740 5
13.009 20 24.8421 24.7017 16.571 0
3
11.721 1
13.610 3
10.565 8
23.272 1
36.077 1
43.524 9
27.448 9
31.307 21 15.9284 17.5746 21.577 2
2
11.650 7
26.377 7
30.810 3
25.409 7
39.561 1
26.701 4
11.849 5
20.690 22 78.8765 64.5480 44.260 7
4
42.598 8
38.843 2
34.331 6
22.206 5
27.530 5
13.404 4
12.716 3
14.521 23 433.5471 167.665 7
7
97.882 9
67.931 5
44.505 7
34.799 5
23.018 4
20.591 7
13.234 4
8.9797 9.7656
250
X
Y 34 35 36 37 38 39 40
15 222.3126 42.3882 20.3013 27.4398 29.2960 17.8270 26.7417 16 125.1830 0 48.0682 50.8358 28.7204 15.985 20.3745 17 44.0851 49.4350 43.2273 37.0754 39.1064 32.5815 29.1888 18 42.1538 25.3226 34.2731 44.0231 23.3926 27.9661 40.3127 19 30.5530 48.6852 33.6077 13.1388 31.1209 23.5266 22.1148 20 257.9192 48.7979 37.9749 42.1804 47.0736 55.4911 53.4813 21 62.3688 64.9047 34.3566 17.5257 67.7640 73.7835 60.9909 22 300.7448 124.4664 100.6929 65.4594 72.1998 36.1161 56.7034 23 65.3916 28.1762 32.7263 32.4742 22.3644 11.9918 47.5789
Table 4:FFT Value for Shadow of the Second Person
(i)
(ii)
Figure 6 (a): Results of the proposed CBVR system: (i) input query clip and the (ii) retrieved video clips
251
(i)
(ii)
Figure 6 (b): Results of the proposed CBVR system: (i) input query clip and the (ii) retrieved video clips
Figure 7: The input query clips, Q1, Q2, Q3, Q4 and Q5 given to the proposed system for precision and recall calculation
REFERENCES:
[1] Steven C.H. Hoi and Michael R. Lyu, "A Multimodal And Multilevel Ranking Framework For Content- Based Video Retrieval", 2007 International conference on Acoustics, speech, and Signal processing, Hawaii, USA, 15-20 April 2007.
[2] M. Petkovic, W. Jonker, "Content-Based Video Retrieval by Integrating Spatio-Temporal and Stochastic Recognition of Events", Proceedings of IEEE Workshop on Detection and Recognition of Events in Video, pp. 75-82, 2001.
[3] A. Yoshitaka, T. Ichikawa, “A Survey on Content-Based Retrieval for Multimedia Databases”, IEEE Transactions on Knowledge and Data Engineering, vol. 11, no. 1, pp. 81-93, 1999.
[4] M. Petkovic, "Content-based Video Retrieval", VII Conference on Extending Database Technology (EDBT), Ph.D. Workshop, Konstanz, Germany, March 2000.
252
[5] P. Aigrain, H. Zhang, D. Petkovic, Content based Representation and Retrieval of Visual Media: A State-of-the-Art Review, Multimedia Tools and Applications, Kluwer Academic Publishers, vol. 3(3), 1996, 179-202.
[6] J. R. Smith, S.F. Chang, “Visual SEEk: A Fully Automated Content-Based Image Query System”, ACM Multimedia Conference, Boston, MA, November 1996.
[7] A. Pentland, R. W. Picard, S. Sclaroff, “Photobook: Content-Based Manipulation of Image Databases”, Int. J. Computer Vision, vol. 18, no. 3, pp. 233- 254.
[8] C. Carson, M. Thomas, S. Belongie, J. M. Hellerstein, J. Malik, “Blobworld: A System for Region- Based Image Indexing and Retrieval”, Third Int. Conf. On Visual Information and Information Systems, Amsterdam, 1999, pp. 509-516.
[9] A. Hampapur, A. Gupta, B. Horowitz, C.F. Shu, C. Fuller, J. Bach, M. Gorkani, R. Jain, “Virage Video Engine”, SPIE Vol. 3022, 1997.
[10] D. Ponceleon, S. Srinivasan, A. Amir, D. Petkovic, D. Diklic, “Key to Effective Video Retrieval:
Effective Cataloging and Browsing”, ACM Multimedia, ‟98, pp. 99-107.
[11] S-F. Chang, W. Chen, H. Meng, H. Sundaram, D. Zhong, “A Fully Automated Content Based Video Search Engine Supporting Spatio-Temporal Queries”, IEEE Transaction on Circuits and Systems for Video Technology, Vol. 8, No. 5, Sept., 1998.
[12] Chi-Jiunn Wu Hui-Chi Zeng Szu-Hao Huang Shang-Hong Lai Wen-Hao Wang, "Learning-Based Interactive Video Retrieval System", Proceedings of IEEE International Conference on Multimedia and Expo., pp: 1785-1788, 9-12 July 2006,
[13] A. M. Ferman, A. M. Tekalp, and R. Mehrotra, “Robust color histogram descriptors for video segment retrieval and identification,” IEEE Transactions on Image Processing, Vol. 11, No. 5, pp 497-508, 2002.
[14] B. Erol, and F. Kossentini, “Shape-based retrieval of video objects,” IEEE, Trans. on Multimedia, Vol.
7, No. 1, pp 179-182, 2005.
[15] C.W. Ngo, T.C. Pong, H.J. Zhang, “Motion-based video representation for scene change detection,”
Int. Journal Computer Vision, pp 127-142, 2002.
[16] L. Chen and T.S. Chua, “A match and tiling approach to content-based video retrieval,” Proc. ICME, pp. 301-304, 2001.
[17] M.Y. Chen and A. Hauptmann, “Searching for a specified person in broadcast news video,” Proc.
ICASSP, Vol. 3, pp 1036-1039, 2004.
[18] T.N.Shanmugam, Priya Rajendran, “Effective Content-Based Video Retrieval System Based On Query Clip”, Proceeding of the 2nd International Conference On Advanced Computer Theory and Engineering, vol.2 no.5, pp.1095-1102, September 2009.
[19] J. Meng, Y. Juan, S.F. Chang, Scene Change Detection in a MPEG Compressed Video Sequence, SPIE Symposium on Electronic Imaging: Science and Technology - Digital Video Compression:
Algorithms and Technologies, SPIE Vol. 2419, San Jose, Feb. 1995.
[20] Shih-Fu Chang, William Chen, Horace J. Meng, Hari Sundaram, Di Zhong, "VideoQ: An Automated Content Based Video Search System Using Visual Cues", Proceedings of the fifth ACM international conference on Multimedia, Seattle, Washington, United States, pp. 313-324, 1997.
[21] Yang Liu, Weiqiang Wang, Wen Gao, Wei Zeng, "A novel compressed domain shot segmentation algorithm on H.264/AVC", in Proc. of International Conference on Image Processing, ICIP 2004, 24- 27 October 2004.
[22] Jain, Anil K., Fundamentals of Digital Image Processing, Englewood Cliffs, NJ, Prentice Hall, 1989, pp. 150-153.
[23] Pennebaker, William B., and Joan L. Mitchell, JPEG: Still Image Data Compression Standard, Van Nostrand Reinhold, 1993.
[24] Paul Bourke, "Correlation", from http://local.wasp.uwa.edu.au/~pbourke/miscellaneous/ correlate/, 1996.
[25] Tsuhan Chen, "From Low-Level Features to High-Level Semantics: Are We Bridging the Gap?”
Proceedings of the Seventh IEEE International Symposium on Multimedia, pp. 179, 2005.
[26] Y. F. Ma and H. J. Zhang, “Detecting Motion Object by Spatio-Temporal Entropy”, IEEE International Conference on Multimedia and Expo, Tokyo, Japan, August 22-25, 2001.
[27] K. Otsuka, T. Horikoshi, S. Suzuki and M. Fujii, “Feature Extraction of Temporal Texture Based on Spatio-Temporal Motion Trajectory”, Proceedings of the 14th International Conference on Pattern Recognition, ICPR‟98, pp.1047-1051, Aug. 1998.
[28] P. Bouthemy, R. Fablet, “Motion Characterization from Temporal Co-occurrences of Local Motion- Based Measures for Video Indexing”, Int. Conf on Pattern Recognition, ICPR'98, pp.905-908, Vol. 1, Australia, Aug. 1998.
[29] Society of Motion Picture and Television Engineers, "Television - Signal Parameters - 1125-Line High-Definition Production", SMPTE 240M-1999.
[30] I. Sobel, G. Feldman, A 3x3 Isotropic Gradient Operator for Image Processing, presented at a talk at the Stanford Artificial Project, unpublished but often cited, 1968
[31] W.Y. Ma and H. Zhang, “Content-Based Image Indexing and Retrieval,” Handbook of Multimedia Computing, CRC Press, 1999.
253
[32] Xiuqi Li, Shu-Ching Chen, Mei-Ling Shyu, Borko Furht, "An Effective Content-based Visual Image Retrieval System", Proceedings of the 26th Annual International Computer Software and Applications Conference, COMPSAC 2002, pp. 914-919, 2002.
[33] Jeff E. Tandianus, Andrias Chandra, Jesse S. Jin, "Video Cataloguing and Browsing", Proceedings of the Pan-Sydney area workshop on Visual information processing, vol. 11, pp. 39 - 45, 2001.
[34] B.S. Manjunathi and W.Y. Ma, "Texture Features for Browsing and Retrieval of Image Data", IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 18, No. 8, August 1996.
[35] J.G. Daugman, "Complete Discrete 2D Gabor Transforms by Neural Networks for Image Analysis and compression," IEEE Trans.ASSP, vol. 36, pp. 1,169-1,179, July 1988.
[36] G.M. Haley and B.S. Manjunath, "Rotation Invariant Texture Classification Using the Modified Gabor filters," Proc. IEEE ICIP '95, vol. I, pp. 262-265, Washington D.C., Oct. 1995.
[37] C.H. Bennett, P. Gacs, M. Li, P. Vitanyi, and W. Zurek, “Information Distance”, IEEE Transactions on Information Theory, vol. 44, no. 4, pp. 1407–1423, 1998.
[38] B. Bigi, Y. Huang, R. d. Mori, “Vocabulary and Language Model Adaptation using Information Retrieval”, In Proceedings of the ECIR-2003, volume 2633 of Lecture Notes in Computer Science, pp.
305-319, Springer-Verlag, 2003.
[39] B. Bigi, “Using Kullback-Leibler Distance for Text Categorization”, In Proceedings of the ECIR- 2003, volume 2633 of Lecture Notes in Computer Science, pp. 305-319, Springer-Verlag, 2003 [40] C. Carpineto, R. d. Mori, G. Romano, B. Bigi, “An information-theoretic approach to automatic query
expansion”, ACM Transactions on Information Systems, vol. 19, no. 1, pp. 1-27, 2001.
[40] I. Dagan, L. Lee, F. Pereira, “Similarity-based models of word co-occurrence probabilities”, Machine Learning, vol. 34, no. 1–3, pp. 43-69, 1999.
[41] B. Bigi, R. d. Mori, M. El-Beze, T. Spriet, “A fuzzy decision strategy for topic identification and dynamic selection of language models”, Special Issue on Fuzzy Logic in Signal Processing, Signal Processing Journal, vol. 80, no. 6, pp. 1085–1097, 2000.
[42] Loris Nanni, Alessandra Lumini, "On selecting Gabor features for biometric authentication", International Journal of Computer Applications in Technology, vol.35 no.1, pp. 23-28, April 2009.
[43] Ville Kyrki, Joni-Kristian Kamarainen , Heikki Kälviäinen, “Simple Gabor feature space for invariant object recognition”, Pattern Recognition Letters, vol. 25 no. 3, pp.311-318, February 2004
[44] LinLin Shen , Li Bai , Michael Fairhurst, “Gabor wavelets and General Discriminant Analysis for face identification and verification”, Image and Vision Computing, vol. 25 no. 5, pp. 553-563, May, 2007 [45] J. Zhang, M. Marszałek, S. Lazebnik, C. Schmid, “Local Features and Kernels for Classification of
Texture and Object Categories: A Comprehensive Study”, International Journal of Computer Vision, vol.73 no.2, pp.213-238, June 2007
[46] Engin Avci, “An expert system based on Wavelet Neural Network-Adaptive Norm Entropy for scale invariant texture classification”, Expert Systems with Applications: An International Journal, vol.32 no.3, pp.919-926, April, 2007
[47] S. Arivazhagan, L. Ganesan, “Texture segmentation using wavelet transform”, Pattern Recognition Letters, vol.24 no.16, pp.3197-3203, December 2003.