A Video Segmentation and Key Frame Extraction Method for Compressed Video Stream

(1)

2017 2nd International Conference on Communications, Information Management and Network Security (CIMNS 2017) ISBN: 978-1-60595-498-1

A Video Segmentation and Key Frame Extraction Method for

Compressed Video Stream

Feng-ling WANG

Hezhou University, Hezhou, Guangxi 542899, China

Keywords: Compressed video stream, DC coefficient, Motion vector, Key frame extraction.

Abstract. Based on compressed video segment detection, the decompression step can be omitted, features can be directly extracted from the original video data stream, and the detection rate can be accelerated. This paper analyzes the characteristics of video data, video segmentation and key frame selection. Based on the analysis of the key technologies of video retrieval technology, the compressed video stream extraction method based on video segmentation and key frame is introduced. A method of extracting key frames from MPEG compressed video based on DC coefficients and motion vectors is proposed. Experiments show that the proposed method can reduce the computational burden, and can better represent the video content.

Introduction

The video data is an unstructured data. The complexity and number of data and the lack of expression make video storage and retrieval very difficult. In order to effectively manage and utilize video information, the video must be analyzed and an effective video organization structure can be extracted to extract the characteristics of the video and synthesize it in order to efficiently store and retrieve the video data.

Video Data Characteristics

The video data is different from the traditional character digital data. As a multimedia message, the video data belongs to non-character digital data. Compared with the traditional character numerical data, the video data has a richer content.

Video Data Has a High Information Resolution

The so-called information resolution refers to how much detail the media provides. Video data with the depth of observation, you can gradually get some new details.

The Diversity of Video Data Content

Video data as a medium representing information, which can be divided into two types of content: a class of video content called information content, which refers to video containing semantic content; another type of video content, called audio and video content, which carries the video and audio contained in the external visual representation.

The Diversity and Ambiguity of Video Data Interpretation

The video data is continuously reproduced image information, and the information contained in the image frame is very rich. Different people may have different interpretations of the picture or video, which is not as accurate as the numerical data on the video data has a complete and objective explanation, often subjective factors. In the video database, often only with the similarity of the query, that is, only similar to the matching video data query [1].

Video Segmentation

(2)

data must be a multi-level tree structure. For example, for the organization of the feature film, from low to high can be the lens, scene, plot. In general, the bottom of the tree structure is a lens. Any video stream consists of many scenes. Therefore, the video database, the new video data stream into the database, the data model should be based on multi-level data stream segmentation, the bottom is generally lens segmentation. Therefore, in the video segmentation structure, the lens is the most important form of video clips, lens technology detection is relatively mature [2].

Video is composed of many shots, the key is to detect the split lens and identify the video stream in the lens conversion. Lens transformation is the conversion of successive video image sequences into another continuous video image sequence, including lens mutations - lens switching and fading - fade in/out, hiding, sliding, and so on. They are formed by video clips. In the video post-production, the lens organization is a video clip. For example, for movies and news, through video clips to complete, purely recorded video sports videos are not composed of shots. One of the most basic tasks of lens conversion recognition is to break down this type of video clip into a lens [3].

Due to the diversity of lens shifts, it is not possible to find a general method that can well detect the detection of various transformations. For the switching process, the conversion method is relatively simple, so the recognition method can usually achieve more than 90% recognition rate; for the lens gradient transformation, the current method can only achieve 80% accuracy, identify the lens transformation, mainly to find a variety of transform can Have better adaptability and higher accuracy of the identification method to improve the accuracy of automatic lens segmentation. Lens transformation methods are mostly used to measure the difference between the brightness characteristics, color characteristics and motion characteristics [4].

Typical Lens Segmentation Method

Template Matching Method

The template matching method takes the sum of the absolute values of the corresponding pixel differences of the two frames as the interframe difference, and calculates the formula (1) as follows:

,

0, 0

( , ) ( , ) ( , )

x M y N

i j i j

x y

d I I I x y I x y

 

 





 (1)

) , (Ii Ij

d

is the interframe difference between Ii and Ij _, Ii represents the i-th frame,

) , (x y

Ii is the pixel value of the i-th frame (x, y), M and N are the frame width and height. The

method compares the change between the corresponding pixels of the previous frame and the latter frame, and if the change exceeds the threshold t, it is assumed that the lens is switched.

The disadvantage of template matching is that it is very sensitive to the noise and movement of the lens or object, since it is strictly limited to the position of the pixel. Noise and object motion increase the interframe difference, resulting in false scene transition detection. An improved method is proposed by averaging the frames by dividing the frames into small blocks of 8 × 8 pixels, and then comparing the corresponding blocks of the previous and subsequent frames with the average. The image removes some noise, compensates for small object motion and lens movement [5].

Histogram Method

The histogram method is the most commonly used method of calculating the difference between frames, which does not take into account the position of the pixel information, and uses pixel brightness and color statistics, so the anti-noise capability matches the template. The basic principle is to divide the color space into discrete color intervals and then calculate the number of rows of

images that fall into each cell. The color space is divided into n intervals, and H is the number of k

(3)

1

( , )

n

i j ik jk

k

d I I H H





 (2)

The disadvantage of the color histogram method is that scene transformations are sometimes missed because the two images may have completely different structures, but the color histograms are very close. Another way to calculate the difference between frames is similar to the color histogram method, which is the X2 histogram method [6]. The method is used for lens shifting, and detection is superior to the above two methods. The difference between the two images is given by equation (3).

2

( )

( , )

n

ik jk

i j

k jk

H H d I I

H

 





(3)

Edge-based Approach

This lens edge detection method is based on the edge feature, and its basic idea is that "the edge of the edge should leave the position of the old edge when the lens conversion occurs, and the same old edge disappears from the new edge."

Firstly, the edge images Ei and Ei1 of the two frames of the video images Ii and Ij are extracted, and the difference between the video images of the two frames is: diff = max(din, dout),

where din is the proportion of incoming pixels (newly emerging pixels away from the existing edge), Where dout is the proportion of the pixels that exit the pixel (the pixel that disappears from

the new edge), where din =p1/pm, p is the nearest edge pixel in 1 Ei1 from Ei, and D is the total number of edge pixels in Ei1_;dout₌p2/pn, p2 is the edge of the Ei1_{image in} Ei_The

distance from the nearest edge pixel is greater than the total number of points r, pn is the total

number of edge pixels in Ei. If diff is greater than the specific threshold t, it is assumed that the

lens is switched [7].

Model - based Approach

The above method uses the inter-frame difference from the bottom to the lens boundary, which results in better results for the mutation detection, but is difficult for the gradient detection because it largely ignores the intra-band correlation of the gradient switch. But the model-based approach is the use of a priori knowledge of the lens editing, a variety of lens switching to establish a certain mathematical model, from top to bottom on the lens switch to detect, so this method of lens gradient detection can often get a good impact [8].

Hampapur through a study of the video production process, it was found that a lens can be used for the boundary detection video editing model (video editing model). For example, a typical lens gradient model can be found in equation (4).

1 2

( , , ) ( ) ( , , ) ( ) ( , , )

f x y t  t g x y t  t g x y t (4) )

, , (

1 x y z

g _{is the upcoming lens, if the lens movement is not very small movement, can be}

recorded as g1(x,y,t)≈g1(x,y), g2(x,y,t)≈g2(x,y). (t) and (t) are the linear functions of

time. Assuming that the duration of the transition is 0 to T For slow conversions, they can be treated as formula (5).

1, 0

( ) 1 / , 0

0, t

t t T t T

t T



 



 _  

 _



(5)

( ) 1- ( )t t

(4)

0

2 

g _{for fade out,} g₁ 0_{for fade in. In the course of the change, each image on each}

image element changes linearly according to the law. The following constant graph CI (constant image) can be defined by the formula (6) as follows:

( , , ) ( , , )

( , , ) f x y t t

CI x y t

f x y t

 

  (6)

Assuming that the lens is linearly attenuated without motion, which is a(t)= I-t / T, β(t)= 0, )

, , (

1 x y t

g _≈g₁(x,y)_.

1 1

1

( , ) ( )

1

( , , ) ( )

( ) ( , )

g x y t

t t

CI x y t t

t g x y T

 _

 

 _

 

   

 (7) Therefore, for a certain time t, get all the pixels as a constant CI, only need to detect the gradual change of model constants. For a given model, once a constant graph is detected, there is a gradual process. Only the model is established accurately, the model-based approach can achieve better results for gradient detection, but need to build a model for each switch type, and the modeling process is more complex [9].

Key Frame Selection Method

Key frames are used to describe the key image frame of the lens, which reflects the main content of the lens, people often use key frames to identify scenes, stories and other high-level semantic units, providing video browsing, download access points. On the one hand, the choice of key frames must be able to reflect the main events of the camera, so the description should be as accurate as possible. On the other hand, the data should be as small as possible, and the calculations should not be too complicated [10].

There are many ways to choose keyframes. The classical method is the frame averaging method and the histogram averaging method. The frame averaging method is to take the average of the pixel values from all the frames of the lens at a certain position, and then the lens position of the pixel value closest to the average of the frame as the key frame; the histogram average law is the average of all the lenses Frame, and then select the frame closest to the average histogram as the key frame. These methods have the advantage of calculating relatively simple advantages, and the selected frame has a mean representation of significance. The disadvantage is that from the lens to the selection of key frames, can not describe the movement of multiple objects of the lens. In general, a fixed number of keyframes selected from the camera are not a good idea because the way to change the lens selects too many key frames, and for more movement of the lens, one or two key frames can not be fully described.

Wolf et al. Computes the amount of motion in the lens by the selected key frame at the light flow analysis and the local minimum, which reflects the rest of the video data and often indicates the emphasis on the actual situation. The method first uses the Horn-Schunck method to compute the optical flow and adds the modulus of the optical flow component of each pixel to the amount of motion M (k) of the k-th frame, i.e.

x y

i j

M(k)



O ( , , )i j k  O ( , , )i j k (8)

Where Ox(i,j,k) is the X component of the optical flow of the pixel (i, j) in the frame k,

) , , (

Oy i j k is the Y component of the optical flow of the pixel (i, j) in the frame k.

Then find the local minimum of M (k). By scanning the M (k) - curve, find the difference between the two local maxima

1

M(k ) and M(k2) the value of the sum of at least P%, if

(5)

Then k2 as the current k1, continue to find the next k2. Wolf's motion-based approach allows the selection of a corresponding number of key frames based on the structure of the lens. If the moving object in the image is taken from the background, the light flow at the location where the object is located can achieve better results.

Summary

Based on the characteristics of video data and the characteristics of video data, video segmentation and key frame extraction based on compressed video stream are adopted. Based on the video search technology of massive video information, according to the characteristics provided by the user quickly browse and play, it goes without saying that there is a very large and beautiful application prospects. However, in order to make the whole system into practical use, future research needs to address the following issues: compression techniques for compressing video streams require higher compression techniques that allow users to view clearer video streams; feature to enrich the content of the content space, which is an important part of future research.

Acknowledgment

This research was supported by the Professor `s scientific research foundation of Hezhou University under Grant No. HZUJS201615.

References

[1] Hu Shengwu, Li Kunpeng. Research on the key technology of 3D GIS [J]. Geospatial Information, 2008.6 (3): 9-12.

[2] Zhu Yingying, Zhou Dongru. A method of extracting key frames from compressed video streams[J]. Computer Engineering and Applications, 2003,18.

[3] Wang Di, Huang Chunyi. Content-based video retrieval[J]. Modern Library and Information Technology, 2000,86.

[4] Lu Yan, Chen Fusheng. Content-based video retrieval technology[J]. Computer application research, 2003,11.

[5] ZHU Ai-hong, LI Lian. Study on Key Technology of Content-based Video Retrieval[J]. Information Retrieval, 2004,01.

[6] Peng Yuxin, Ngo Chong_wah, Guo Zongming, Xiao Jianguo. Key technology of content-based video retrieval[J]. Computer Engineering, 2004,01.

[7] Meng Qian. Research on video database data model based on content retrieval[J]. Journal of Xuzhou Normal University (Natural Science Edition), 2003, 21 (4).

[8] Yasuyuki Nakajima. Video browsing using fast scene cut detection for an efficient networked video database access[J]. IE ICE Transform Information& System, 1994, E77-D (12): 1335-1364.

[9] Yeo B L, Liu B. A unified approach to temporal segmentation of motion JPEG and MPEG compressed video[C]. In: Proc IEEE Int conf on multimedia computing and systems, Washington, DC: 1995: 81-90.