3.5 Experiment
3.5.6 Analysis
In this analysis, we want to show how the hybrid method supports the frame-level filtering. Figure 3.8 summarizes two major benefits. The first benefit is the compressed search space. If the unique object table does not exist, the frame-level filtering needs to scan the occurrence table in a brute-force manner when given a set visual objects. Fig 3.8(a) shows the huge difference between scanning on the whole occurrence table and the unique object table. It indicates that the filtering process needs to pay much more time to scan the visual objects without the unique object table. The second benefit is the enriched object occurrences. Frame-level filtering with visual objects should provide
0% FIGURE3.8: The benefits from unique object table and occurrence table.
precise temporal information. If most of the object occurrences are missing, the temporal information won’t be precise any more. Figure 3.9(a) compares and shows how much occurrences are fixed by the proposed method. Obviously, the hybrid method significantly increases the amount of object occurrences compared to using detection only. It indicates frame-level filtering gets more precise temporal information.
(a) occurrences (b) pair-wise co-occurrences FIGURE3.9: Share comparison
Above analysis shows that the amount of object occurrence is increased after propagation. The result is promising but we come up with another question. That is, whether the increase in occurrences would cause the increase in co-occurrences. The co-occurrences between objects are often leveraged in more complex frame-level filtering when a set of objects are given. If the amount of co-occurrences
3.6 SUMMARY 55
does not increase significantly, the fixed occurrences cannot support the complex frame-level filtering well. To answer this question, we count the number of pair-wise co-occurrences between the objects which appear on the same video frames and display the ratio comparisons in Figure 3.9: the shares of detection and propagation on occurrences are revealed in Figure 3.9(a); the shares of detection and propagation on co-occurrences are revealed in Figure 3.9(b). The share results show that the amount of co-occurrences is increased significantly as well. It indicates complex frame-level filtering could be served well by the proposed hybrid method.
3.6 Summary
In this chapter, we study how to support frame-level filtering by detected visual objects. The key tasks are how to generate the unique object table and occurrence table accurately and efficiently.
Based on our literature review, the visual objects used in previous methods are manually labeled. The process is top-down which heavily relies on human labors. It makes previous methods inapplicable on the dynamic or large-scale datasets. To improve these problems, we propose to use detected visual objects instead, whereas object detection fails to support object identifying and connecting as human.
There are several assistant methods but all of them have many drawbacks. Accordingly, we propose a hybrid method which consists of local merge, propagation and global merge to better serve the frame-level filtering. The experiments show that the unique object table and occurrence table generated from the proposed method is better than those generated from the existing methods. Our further analysis shows that the unique object table and occurrence table generated by the proposed method supports a more accurate and efficient frame-level filtering.
Chapter 4
Video-level Filtering Using Small Non-textual Content Set
4.1 Introduction
Unlike frame-level filtering which decomposes the video into frames, video-level filtering treats the video as a whole during the process. The widely used application is keyword-based video filtering where each video in the database is associated with some texts collecting from web users or video producers. When a set of keywords are given, the videos whose texts are irrelevant to the keywords are filtered off. Keyword based filtering is inapplicable when the texts are sparse or meaningless.
Accordingly, non-textual content-based filtering is introduced and has been further widely applied in video classification [52], event detection [59, 4] and so on.
The non-textual content-based filtering starts with the user specific videos which are regarded as positive exemplars in the learning process. Then, the system extract non-textual contents and perform vectorization to obtain the vectors. The classification model is trained after the vector generation using some background videos’ content vectors as negatives. After that, the classification model is used to predict scores for all the videos, and the top-k videos of the highest scores are selected as the result. Usually, different content types cause different predication scores so as to the dissimilar rankings. Therefore, the fusion process is applied when the filtering process uses one more content types. Recent systems [60, 4, 123] exploit the content types as many as possible. In other word, all of them try to exploit rich content set to perform video-level filtering.
In some specific areas such as surveillance, the rich content set is not applicable for video-level 57
TABLE4.1: Differences between normal and surveillance videos
Normal Surveillance
untrimmed × X
muted × X
scene independent × X
noise from crowd little much
filtering. In Table 4.1, we list the major differences between normal and surveillance videos, which make video-level filtering cannot exploit rich content set [59, 4, 52]:
• Surveillance videos are untrimmed: In previous works [59, 4, 52], the input videos are trimmed. It means that the non-textual contents from the videos can be totally used as pos-itive or negative. Differently, the surveillance videos are untrimmed. It means that a video may contain positive and negative contents at the same time, which damages the discriminative ability of the content vectors;
• Surveillance videos are muted: Audio is a important content source for video-level filtering [123]. However, surveillance videos are usually muted that disable the audio contents. This makes many audio content vectors cannot be used for video-level filtering;
• Surveillance videos are scene independent: The scene contents are often exploited for video classification [94, 116]. They are useful because many events correlate to the scene such as playing football and playground, swimming and swimming pool. However, in surveillance videos, the scene is always same under the same camera. Therefore, the scene content vectors are useless for video-level surveillance filtering;
• Surveillance videos have noise from crowd: Surveillance videos record the daily activities under the certain cameras. If the filtering process try to remain some video clips correlate to some specific individual activities, it is inevitably interfered by the noise from the crowd.
One of the video-level filtering applications is surveillance event detection (SED). It aims at alarm-ing the predefined events in the surveillance videos when they occur. However, the differences in
4.2 PROBLEMSTATEMENT 59
Table 4.1 make the filtering process difficult on surveillance videos, because many content types ex-ploited by previous works are inapplicable. This makes the motion contents become the only choice for video-level surveillance filtering. In state-of-the-art system proposed in [60], two motion contents are used. They are spatial temporal interest points (STIP) [62] and motion SIFT (MoSIFT) [12].
These motion contents leverage sparse sampling method to extract interest points from the frames, and calculate the optical flow between the temporally adjacent points to describe motions. They are ineffective when the motions are complex in the SED videos. To improve this problem, we introduced the new content set which consists of dense trajectory (DT) [107] and improved dense trajectory (IDT) [109] to improve the accuracy of SED. Our internal experiments show that the new content set signif-icantly improves the accuracy and the conclusion helps us win the competition of TRECVID SED in 2015.
In summary, we have following contributions in this work:
• Through analyzing the characters of surveillance videos and uncovering the mechanism of re-cent motion features, we push the accuracy of video-level surveillance video filtering to a new level by exploiting new content set which consists of improved dense trajectory (IDT) and dense trajectory (DT). The new content set beats all the previous content sets on recent five-year TRECVID SED competition.
• We conduct extensive experiments and show how different settings influence the accuracy of video-level surveillance video filtering, which provides performance benchmark to the future followers.
4.2 Problem Statement
4.2.1 Preliminaries
The basic elements of video-level surveillance filtering system are videos and annotations. The videos are untrimmed so they cannot be used as negative or positive instantly. Additionally, they are usually captured from several cameras so the scenes are not helpful for the filtering. The annotations are classified by the events. They contain the event intervals in the videos. According to the annotations, the negative and positive could be parsed. In real-world surveillance, the amount of the annotations
is usually small. To perform video-level surveillance filtering, the users need to provide the annota-tions of the events on the training videos. The system then learns the prediction models to filter the irrelevant parts on the test videos and returns the high confident parts.