Video Summarization and Restoration

(1)

Video Summarization and Restoration

Soniya Sunit Shukla

Department of Computer Engineering

Student of Fr. Conceicao Rodrigues College of Engineering, Bandra(W),

Mumbai, India [email protected]

Shirley Stephen Dsouza

Department of Computer Engineering

Swati Chandrakant

Bharmal

Department of Computer

Engineering

Abstract—In video processing, numerous frames containing

similar information are usually processed. This leads to unnecessary time consumption, slow processing speed and increased complexity. Video summarization using key frames can facilitate to speed up video processing. Key frames extraction using wavelet statistics is a technique that can be used in video summarization. In extracting key frames, two consecutive frames are firstly DWT transformed and then the differences of the detail components of them are estimated. If different value of a consecutive pair is greater than threshold, the last frame of the pair is considered as a key frame. The second technique, Block based Histogram Difference is method for key frame extraction which is based on the block based Histogram difference and edge matching rate. The third technique uses Human Motion Detection where Image Acquisition, Image Subtraction, Motion Detection, Reorganization Engine and Area Mapping is used to obtain key frames. The key frames extracted then undergo restoration. Video restoration is an important ﬁeld in image and video processing community, which aims to estimate the high-quality version from typically noisy and of low resolution input. The final output from the proposed system will be a summarized restored video.

Keywords—Motion Detection; Background Subtraction; Stationary; Tracking; Surveillance; Restoration; Histogram Difference; DWT Wavelet.

I. INTRODUCTION

In the recent years, faster digital devices which can facilitate more storage space are rapidly being developed. Hence, instead of representing information using text, still images, audio etc. and the information is recorded and stored in videos. Video summarization using key frames can facilitate to speed up video processing. Also its important for video to be presented in a high resolution format hence video restoration forms an important part of video processing. A video is composed of continuous still images called frames. When we want to show complete information of a video in rapidly, only the key frames are needed to show instead of using all frames. Each key frame can provide and represent all necessary information of a video shot of the video. This compact display technique is usually called video

summarization.Many techniques for key frame detection have been reported in so far.

A. Description

(2)

shakes due to internal vibration, depth of field, dirty lens and focus.Hence, restoration is performed on key frames. Using these restored key frames final video is recorded.

B. Purpose and Objecrive:

As proposed earlier, this project is to be linked with another project to come up with the final system called Human Motion Detection System. This project would be focused on the Video Motion Detection module where we would perform research on the techniques and methodology to detect motion and to develop a module for a technique that we prefer to use in this project. This module would record down motion and pass it into the next module that would be on object classification where it classify human and non-human object. Thus, this project is to come up with a solution that detects motion effectively and record it down with one or more objects that are moving and causing motions.

The purpose of this project is to help new researchers learn and further research on their topic of interest, which in this case is the human motion detection system. The question to be addressed here in this module is, given a sequence of images, how do we detect motion or track a moving object? The project is to mainly answer this particular question addressed by providing a prototype to emulate or prove the algorithms or techniques that are available to perform motion detection by an input of images in a number of frames. Following are the aims of system:

 To study DWT Wavelet Statistics to extract key frames from offline video.

 To study key frame extraction based on Block based Histogram difference and Edge Detection from offline video.

 To study Human Motion Detection technique to extract key frames from real as well as offline video.  To compare DWT Wavelet Statistics, Histogram Difference technique and Human Motion Detection technique.

 To study restoration techniques to restore key frames

C. Resources and Requirements

Basically, the project is going to use a method described by David Moore‟s final thesis on “A real-world system for Human Motion Detection and Tracking” [3] from California Institute of Technology. Mainly, this module requires functions and algorithms written in the Intel‟s open CV library. In the hardware‟s perspectives, we had used a webcam for testing purposes with specifications up to 30 frames per second and support up to 600x800 screen sizes. However, the project had only implemented 300x200 of resolution. The reason behind is because of speed performance issues and also limitation by the Intel Open CV libraries rectangles drawing functions which did not draws well with larger resolution. This however maybe be only a problem in the version implemented here which is beta3.1.

II. SYSTEM DESIGN AND ARCHITECTURE

In this chapter, we will look into the design of the methods and techniques implemented in the final prototype system. Diagrams of the architecture of motion detection algorithms are being presented here in this chapter as well.

[image:2.595.310.548.122.267.2]

A. System Overview

Fig. 1 Overview of a basic motion detection application system

As shown in Fig. 1, the basic human motion detection [1] would have an alarm system integrated. Thus, to get a clearer picture of the system developed, Fig. 2 is shown below:

Fig. 2 Overview of the prototype human motion detection application system

III. METHODOLOGY

In this section, some related work on key frame extraction and restoration will be discussed.

A. Concept of Video Frame and Scene Change

[image:2.595.316.552.327.479.2]

(3)

contents in the frames of each scene are also different. The visual contents can be represented by two approaches, spatial (pixel) and frequency (color). In the spatial approaches, the visual contents of a frame or an image are composed of pixels with different gray level values. Hence, if some of related pixel values of two frames are different, the visual contents of the two frames are also different. Similarly in frequency, if frequency components of two images are different, the images are exactly different. By this way, scene changes can be detected by finding the difference over a threshold.

Fig.3 Illustration of (a) Video frames of a scene and (b) Selected frame that representsinformation of the scene.

B. Key Frame Extraction

1) DWT Wavelet and Frequency Components

Each key frame can represent each related scene and also entirely contains all important information of the scene. After the key frame extraction [3], the key frames are intended to use in video summarization, feature extraction [2] and other processing so key frame extraction algorithm should not be very complex and time consuming.For the detection of key frame from a scene, we have to find the obvious difference value between two successive frames. In a video stream, each video frame is a slightly variation with previous one. However, whenever scenes are changed, visual contents and objects are obviously different between current frame and next one. Hence, we use the difference as a key for the key frame detection. The proposed system is considered and based on DWT wavelet [3]. Whenever visual contents are changed in original image, the frequency components of transformed image are also changed. When scenes are changed, the visual contents of current and later frames are not the same. Hence, the frequency components of the frames are exactly different. By using the difference, we can detect the key frame of next scene.Algorithm for key frame extraction can be divided into four steps [3]:

Input: Video V, contains N frames Output: Key frames for input video

1. Two successive frames are read and transformed with DWT to achieve four sub-bands, LL, HL, LH and HH.

Fig.4: Illustration of DWT wavelet transformation of an Image Within the four sub-bands, only three sub-bands, HL, LH and HH are used todetect key frame.

cH1 = HL band of current frame cV1 = LH band of current frame cD1 = HH band of current frame cH2 = HL band of next frame cV2 = LH band of next frame cD2 = HH band of next frame

For each sub-band, different value is estimated by subtracting detail component values of current and next frame.

2.Mean and Standard Deviation are computed from the difference values of eachsub-band.

3. Threshold value for each sub-band is calculated by adding the Mean and Standard Deviation.

4. The threshold and difference value of each band are compared. If two difference values of any two sub-bands are over each related threshold, the last frame can be considered as a key frame [3]. The above algorithm is not complex too much and can easily and quickly detect key frames of a video clip.

2) Key Frame Extraction Based on Block Based

Histogram Difference and Edge Detection

The method for key frame extraction consists of three steps: 1. Input a video and calculate the block based histogram difference of each consecutive frame [4].

(4)

Flow chart for detecting the candidate key frames from video is shown in Figure 5. first of all the original video is converted in to frames. Then each frame is decomposed in tothree components (Red, Green and Blue images). Then each component of every frame is divided into nine blocks. Now histogram of each block is computed and finds the histogram

[image:4.595.85.242.171.443.2]

difference between successive frames block by block wise. If histogram difference is greater than threshold value candidate key frame will be detected. Following is flowchart [4]:

Fig. 5: Flowchart of Histogram Difference and Edge Detection

3) Key Frame Extraction using Human Motion Detection

Technique

[image:4.595.47.278.571.705.2]

We would take input video from user and will perform image acquisition [2]. Then will perform motion detection algorithm, which will give key frames as a output. Using HOG recognization engine [1] will detect if detected object is human or not. On output of this step will perform restoration and those frames will merge to video to get summarized and

Fig. 6: Detailed Description of Video Summarization and Restoration

1. Image Acquisition

The system developed here, can capture sequence of images both from real-time video from a camera or from a recorded sequence in “.avi” format. This is mostly an initial step for most motion detection algorithms [2]. Images of resolution 300x200 have been set to be captured. No matter what settings of cameras property, our prototype system will only use 300x200 resolutions for its calculations. Since we had used a camera capable of displaying more than 300x200 image resolutions then we assume that there should be no problem for other cameras with the same or even better specifications to run our prototype program. Also some performance issues on speed were the cause of us developing this project in such resolution specification [2].

2. Image subtraction

Here two subsequent images are compared and processed using arithmetic operation of subtraction to subtract the pixels‟ value in the images. We‟d implemented an absolute subtraction technique [2] thus there‟s no difference in choosing which image to subtract with which, since both would result in the same resulting image. Since usually color images are used, the algorithm implemented considers the color data of the images. The technique first separates the images to three planes or channels. Then it performs the arithmetic subtraction to each planes or channels. After that, the results are combined back again to form a color image. Thus, the result of subtraction [2] is in an invert color format. The reason for the inversion is simply because a subtraction of the pixels values was performed.

3. Image processing

(5)

4. Motion Detection

Here we‟ve implemented two algorithms [2] to update the background image which is to be used in the image subtraction stage of the motion detection algorithm [2] as shown in figure 6. We discuss the two algorithms in the following section [2]:

a. Spatial Update

The first algorithm implemented was designed in such a way where the background model [2] is updated based on spatial data instead of temporal data. By temporal data here, we refer to the time or frame number of the images in the sequence. Thus, since this method uses only spatial data which means the background is updated based on some percentage of the pixel change in the subtraction result. For example, let A be an image in a sequence or being the current frame processed and let B represent the previous background or the initial background image.After processing the image subtraction stage of the motion detection algorithm, the resulted subtraction image is then considered. The percentage of pixel change is then calculated based on the number of nonzero pixels belonging to the resulted image from the subtraction process.

To be exact; Percentage = number of nonzero pixels in subtraction result image (|A-B|) total resolution of the images A or B or even the result image of (|A-B|) We had set the threshold for the updating process to 80% meaning the background will be set to a completely new image (current frame image) without any computations when the value of the percentage calculated with the above formula is greater than or equal to 80. With this algorithm [2], the performance speed is increased gradually. The reason is simply because of less computation being done. There‟s no need for statistical data collection of the temporal information. The assumption being made here is that a background would only change if a camera is being repositioned or reinitialised. Thus, giving the spatial data which satisfy the condition where pixels change is higher than 80%. This method [2] also helps to eliminate the need for tracking algorithm since even none moving objects that were not in the initial background image will be detected. However, it also suffers several drawbacks such as giving more false alarms which is not possible to remove and can only be removed by reinitialising a new background image.

b. Temporal Update + Edge Detection

As for the second algorithm [2], we had implemented here a usual motion detection approach for updating the background image. Compared to the first algorithm [2], this method uses much more computation steps and thus giving slower response or performance. The information that is obtained for the computations are from subsequent frames and not only depend on a single subtraction result in fact it doesn‟t concern any subtraction process results. Basically, it uses a running average method written ready by Intel OpenCV group. The methods described as accumulative difference is similar to the method[2] that we present here for the second

algorithm on the background update model [2]. Basically, all frames are used for computation here for a running average image to be produced. A basic running average method is defined as the sum of the previous value of a pixel at a certain location with the new value taken by the corresponding pixel at the next frame in the image sequence with a certain factor of degrading the old pixel value. For example, let A represents an image in a sequence or being the current frame processed and let B represents the next image or frame in the image sequence. The factor of degrading the old pixel is denoted as W. The computation of the running average image R is computed as [2]:

R[x,y] = (1-W) • A[x,y] + W • B[x,y]

As the computation continues, instead of using A[x,y] pixels values, the function will replace it with R[x,y] values to form a recurrence equation. And thus making the equation formed as [2]:

R[x,y] = (1-W) • R[x,y] + W • B[x,y]

Notice that x and y denotes the location of the certain pixels we‟re referring to on the computation. Thus all pixels belonging to the images are computed to give a resulting running average image of the same size which is used as the background image. The background image is therefore formed after a certain period of time and is constantly updated based on the frequency set for that period of update [2]. Here we implemented a double layered running average meaning we implemented the same formula on a number of resulting images from the running average formula to form a new result which is used for our background instead of directly using the resulting images to form our background. Therefore, there are two timer macros needed that defines the frequency of updating the temporal data obtained from the first layer of the running average function that would be used as inputs to the second layer of the running average function [2] and also to define the frequency of updating the background using the second layer function. In algorithm 2, the moving edge concept is also implemented. The reason is simply because the result returned is completely different compared to the result obtained by algorithm 1. The perfect shape of a human obtained by algorithm 1 is not obtained here. Thus to help recognition of a human being detected, using edge would help a lot in distinguishing human object and non human ones. Instead of using the AND operator to combine the edge obtained from the current frame and the result of subtraction, the AND operator was not used in fact. The implemented moving edge technique only performs a Canny edge operator [2] directly on the resulted image from the subtraction process.

5. Recognization Engine

(6)

consecutive video frames and determine the absolute difference. A threshold value is used to determine the results. If n I is the intensity of the nth frame image from real-time video, 1 -nI is the previous one. Then the pixel wise difference function nD is defined as [1],

Accordingly, Motion image can be calculated as follows:

Where, i and j are pixel positions along horizontal and vertical directions.

[image:6.595.47.289.223.613.2]

a. Architecture of Recognization Engine

Fig. 7: Motion Detection System

After the motion image is determined, several motion regions can be extracted. Figure 7 shows the motion extraction results. From Figure 7, we can see that walking pedestrian region can be extracted, most of the original region can be filtered by motion detection and the computational complexity can be reduced. The motion region 2 is bigger than a pedestrian area because the tree branch is swinging

with wind, then walking pedestrian and branch form one moving region, the person can be found by the following detection algorithm. As is shown in Figure 7, the motion region 3 corporate two people, we may call it “multi-person”region. Because the distance between people is small, so, they can be seen as one region after motion detection, sliding windows and human detection algorithm can divide them into two individuals. Motion region 4 is a running car; of course, it can be classified as a non-human object.

b. HOG Descriptor

This section gives an introduction of HOG feature extraction algorithm [1]; the method is basedon evaluation well-normalized local histograms of image gradient orientations in a dense grid. The basic idea is the local object appearance and shape can often be characterized rather well by the distribution of local intensity gradients or edge directions, even without precise knowledge of the corresponding gradient or edge positions. HOG features [1] are similar to SIFT descriptor [1], gradient orientation information is used to embody characteristics of objects; the latter is calculated on the basis of key point detection and is used to image matching. While HOG features are descriptions of orientation histograms and can be calculated as follows. The first stage applies a simple template to calculate gradients of every point along two directions. In the second stage, the Image detection window is divided into small spatial regions, called “cells”. For each cell, a local 1-D histogram of gradient of gradient or edge orientation over all the pixels in the cell is accumulated. This is the representation of “orientation histogram” representation [1]. The gradient magnitudes of the pixel in the cell are used to vote into the orientation of histogram. Several cells form a group, called “block”. The third stage is the calculation of normalization, for better invariance to illumination, shadowing, it is useful to contrast-normalize the local responses. This can be done by accumulating a measure of local histogram “energy” over “block”. The result is used to normalize each cell in one block. The cell thus appears several times in the final output vector with different normalizations. Then the normalized block descriptors are defined as Histogram of Oriented Gradient (HOG) [1]. The HOG descriptors can capture local contour information, the edge and gradient structure, andreflect the characteristics of local shape. HOG features can be used to pedestrian detection.

c. SVM Classifier

(7)

Where parameter b is estimated using Kuhn-Tucker conditions [1], parameter i a is computed by minimizing the quadratic problem,

positive definite kernel [1]. Training samples with non-zero a(i) are the support vectors. Second is the testing stage, the classifier function C(x) is applied to image windows detected as road signs to estimate the confidence in class value. According to the analysis above, C(x) is the estimated classification value. We link the value with the human detection. If C(x) >0, the sliding window can be regarded as a human region, the higher C(x), the higher the confidence. On the contrary, if C(x) <0, the detection window can be regarded as other objects.

6. Area Mapping

After the bounding rectangles are identified, the position is then mapped to the source image and the rectangle being drawn there. Mapping [2] is done for the area of the bounding boxes drawn in the source and also the corresponding area in the binary processed image. The area from this processed binary images are the ones to be used for the recognition engines [1]. This project does not cover the recognition engine and the output component. The recognition engine [1] is implemented from the work of the partner in this project.

7. Bounding Rectangle Calculations

Some operations are done in order to remove overlaying boxes or rectangles when drawn to the source image. Basically rectangles or boxes that are near and almost crosses one another‟s edges would be joined up to form a larger rectangle or box [2]. This would help eliminate the problem of only portion of human being returned to be recognized.

C. Image Restoration

Image Restoration [6] is the operation of taking a corrupt or noisy image and estimating the clean, original image. Corruption may come in many forms such as motion blur, noise and camera mis-focus. Image restoration is performed by reversing the process that blurred the image and such is performed by imaging a point source and use the point source image, which is called the Point Spread Function (PSF) [6] to restore the image information lost to the blurring process.

1) Sharpning Enhancement

Image sharpening consists of adding to the original image a signal that is proportional to a high-pass filtered version of the original image. The original image is first filtered by a high-pass filter that extracts the high-frequency components, and then a scaled version of the high-pass filter [6] output is

added to the original image, thus producing a sharpened image of the original. Note that the homogeneous regions of the signal, i.e., where the signal is constant, remain unchanged.

2) Contrast Enhancement

Two commonly used point processes are multiplication and addition with a constant [6]:

The parameters are often called the gain and bias parameters; sometimes these parameters are said to control contrast and brightness respectively. You can think of f(x) as the source image pixels and g(x) as the output image pixels. Then, more conveniently we can write the expression as [6]:

where i and j indicates that the pixel is located in the i-th row and j-th column.

3) Deblurring

Used the deconvblind function [6] to deblur an image using the blind deconvolution algorithm [6]. The algorithm maximizes the likelihood that the resulting image, when convolved with the resulting PSF, is an instance of the blurred image, assuming Poisson noise statistics. The blind deconvolution algorithm can be used effectively when no information about the distortion (blurring and noise) is known. The deconvblind function restores the image and the PSF simultaneously, using an iterative process similar to the accelerated, damped Lucy-Richardson algorithm [6]. The deconvblind function, just like the deconvlucy function, implements several adaptations to the original Lucy-Richardson maximum likelihood algorithm, which address complex image restoration tasks. Using these adaptations, you can: Reduce the effect of noise on the restoration Account for nonuniform image quality (e.g., bad pixels) Handle camera read-out noise.

D. Video Summarization

1. After reading the image from the current folder, convert the image to movie frame using „addframe‟ function [5].

2. Write the frame to the file and read next image and repeat the process of converting and writing to the file till the last image is processed and written. After loading the first image convert it to bitmap (.bmp) and next create a new AVI file and add a new video stream and one frame to the new file and dispose of the bitmap frame [5] continue for all the frames.

IV. TEST CASES

A. Case 1

(8)

Block Based Histogram

DWT Wavelet

Human Motion Detection

Extracted

Key Frames 135 162 24

B. Case 2

Original video length: 106 seconds Total no of frames: 2756

Block Based Histogram

DWT Wavelet

Human Motion Detection

Extracted

Key Frames 292 327 51

V. EXPERIMENTAL RESULT

VI. CONCLUSION

Proposed system performs video summarization and restoration. Video summarization implements key frame extraction which extracts the key frame from input video. We have presented key frame extraction algorithm using DWT wavelet statistics, Block based Histogram Difference and Edge Matching Rate and Human motion detection.The objective of the DWT wavelet method is to generate key frames with faster speed for video summarization as compared to other existing methods. This method based on the image histogram difference and edge matching rate for key frames extraction avoids shot segmentation and has a good adaptability. But Future work needs pattern extraction and feature recognization for DWT wavelet technique for

better performance.Block based Histogram Difference

based on the block based Histogram difference and edge matching rate. Firstly, the Histogram difference of every frame is calculated, and then the edges of the candidate key frames are extracted by Prewitt operator. Histogram plots are used to better understand how frequently or infrequently certain values occur in a given set of data. Limitation of present work is on tool used, as it has certain constraint related to memory. Human Motion Detection technique

would focus on mainly motion-related problems whereby not only detecting motion is sufficient, further steps are taken to obtain the object that causes the motion and using an recognition engine instead of just image-matching to classify the object to check whether if it is a human or not. With this, therefore, this technique is not fixed to only the first application area discussed but it can also be applied to applications in control areas or analysis areas where human motion is to be classified to distinguish them from other objects that cause motion. Once we get key frames, we will restore them by performing sharpening, deblurring and adding contrast and then will merge them to get summarized video.

VII.FUTURE SCOPE

Video summarization using Motion Detection System can be used in surveillance and security systems in a fixed restriction area. Object detection is necessary for surveillance applications, for guidance of autonomous vehicles, for efficient video compression, for smart tracking of moving objects, for automatic target recognition (ATR) systems and for many other applications. In all of the above application the proposed technique can provide faster results. Only some enhancement or addition needs to be added if the algorithm discussed were to be ported into other applications such as those in control application areas. For example providing inputs to Virtual reality or Augmented reality applications, we need to have extra recognition of the notified human motion returned by the algorithms. We could exploit a multi-view approach and adopt an improved model based on localized parts of the image. The technique is limited to human object identification and can be extended to include object recognition or even face detection. For a computer vision system, the ability to cope with moving and changing objects, changing illumination, and changing viewpoints is essential to perform several tasks. Hence the technique should be able to handle situations with low contrast videos taken in low lighting conditions. Further enhancements can be used to deal with any automated system for three-dimensional point positioning

.

ACKNOWLEDGMENT

(9)

us during our work in this paper. Also not to forget those who have helped us in our findings and development of the prototype system. We hope that this paper can be as informational as possible to you.

REFERENCES

[1] Fast Human Detection Using Motion Detection and Histogram of Oriented GradientsHou Beiping and Zhu Wen JOURNAL OF COMPUTERS, VOL. 6, NO. 8, AUGUST 2011

[2] Motion Detection Surveillance System Using Background Subtraction AlgorithmAshish Kumar Sahu and Abha Choubey Volume 1, Issue 6,

November 2013 International Journal of Advance Research in Computer Science and Management Studies.

[3] Key Frame Extraction for Video Summarization Using DWT Wavelet Statistics KhinThandar Tint, Dr. Kyi Soe International Journal of Advanced Research in Computer Engineering & Technology (IJARCET) Volume 2, No 5, May 2013.

[4] Key Frame Extraction Based on Block based Histogram Difference and EdgeMatching Rate Kintu Patel Oriental Institute of Science & Technology, Bhopal, MP, India International Journal of Scientific Engineering and Technology.

[5] https://www.pantechsolutions.net/blog/matlab-code-to-convert-video-to- sequence-of-frames/