CiteSeerX — Hierarchical Decision Making Scheme for Sports Video Categorisation with Temporal Post-Processing

(1)

Hierarchical Decision Making Scheme for Sports Video Categorisation with Temporal Post-Processing

Edward Jaser, Josef Kittler and William Christmas Centre for Vision, Speech and Signal Processing

University of Surrey, Guildford GU2 7XH, UK E.Jaser, J.Kittler, W.Christmas

@eim.surrey.ac.uk

Abstract

The problem of automatic sports video classification is con- sidered. We develop a multistage decision making system that is founded on the concept of cues, i.e. pieces of vi- sual evidence, characteristic of certain categories of sports that are extracted from key frames. The main decision mak- ing mechanism is a decision tree which generate hypothe- ses concerning the semantics of the sports video content.

The final stage of the decision making process is a Hidden Markov Model system which bridges the gap between the semantic content categorisation defined by the user and the actual visual content categories. The latter is often ambigu- ous, as the same visual content may be attributed to differ- ent sport categories, depending on the context. We demon- strate experimentally that the contextual post-processing of the decision tree outputs by HMMs significantly improves the performance of the sports video classification system.

1. Introduction

The generation of digital multimedia content continues to witness phenomenal growth. In the particular domain of sport, many events are taking place every day, and an over- whelming vast amount of sport video materials are being recorded and stored. Ideally, and to ensure usability, all this sports material should be annotated, and the meta-data, generated on it, should be stored in a database along with the video data. This would allow the retrieval of any important event at a later date. Such a system has many uses, such as in the production of television sport programmes and docu- mentaries.

Due to the large amount of material being generated, manual annotation is both impractical and very expensive.

In this paper we consider the problem of automatic sports video categorisation. This problem arises during multidisci- plinary events such as Olympic games where huge volume of video material are recorded, with the content randomly switching from one discipline to another. A coarse auto-

matic annotation in terms of sport identity would aid the production of event summaries for news cast and other applications.

Much research in the field of multimedia analysis and retrieval is targeting the domain of sport videos. The reason is that most sport videos have a well-defined content structure and official rules and procedures as compared to videos from other domains. A sport can be defined as a set of one or more fundamental semantic events. The event life cycle is characterised by a starting stage, an action and a terminal stage. The action stage can be skipped depending on the status of the starting stage. The play is usually suspended at the end of each event. The repetition of these events in some order defines higher-level events and forms the structure of the sport. Moreover, most sporting events take place in one location. That means only a limited number of cameras, most at fixed position, are needed to cover the play area and capture the event.

global view crowd zoom in close−up

crowd global view

zoom in close−up

Swimming Hockey

Figure 1: Sport views

The camera that best captures the event taking place at a certain time is selected for broadcasting. Therefore, a set of characteristic views recorded by the cameras can be defined and associated with the events. Figure 1 gives an example of some characteristic views that exist in two sport disciplines, swimming and hockey. Between the end of one event and the start of the following one in which the play is suspended, other events that can be either related to the sport (replay,

(2)

close-ups), or have nothing to do with it (crowd, commer- cials), are broadcast.

Start Play End

Play Event Close

Up

Replay Crowd

Break {}

OR

? Sport

Figure 2: Structure of sport video.

Tennis, for example, has one basic event that represents a point. The event starts with a serve followed by hitting the ball back and forth over the net till scoring a winner or an unforced/forced error is committed. There are the cases of

“ace” and “double-fault” in which the event terminates with no need for ball exchange. Between each point, play is suspended and a replay of the last point played, players’ close- ups, crowd events and/or adverts are shown to the viewers.

Higher level concepts in tennis (e.g. game and set) can be defined in terms of the point event. Figure 2 summarises the process of capturing a sport.

Other work, specific to some form of sports annotation, includes [11] in which the authors addressed the problem of segmenting soccer videos into two basic semantic units:

“play” and “break”. This was done by processing a sequence generated from classifying frames into three predefined views according to the video shooting scale using a domain-specific feature. Chang et al [1] proposed a statisti- cal method aiming at the automatic extraction of predefined highlight segments in a baseball game video using a Hid- den Markov Model (HMM) built for each class of highlight.

HMMs were also used by [3] for tennis scene classification and segmentation. An HMM was used to fuse audio and visual information. They also used HMMs to model tennis syntax and the hierarchical structure of a tennis match.

In this paper we propose a multistage decision making system that is founded on the concept of visual cues, i.e. pieces of visual evidence, characteristic of certain categories of sports, that are extracted from key frames. The main decision-making mechanism is a decision tree which generates hypotheses concerning the semantics of the sports video content. The final stage of the decision making pro-

cess is an HMM system which bridges the gap between the semantic content categorisation defined by the user and the actual visual content categories. The latter is often ambigu- ous, as the same visual content may be attributed to different sport categories, depending on the context.

The paper is organised as follows. In Section 2 we give an overview of the system. We briefly describe the cue concept and the cue detector methods used to search for the cues deemed indicative of sport types in Section 3. Section 4 describes in details the decision tree classifier used for generating sports video content hypotheses, and the post- processing of the decision tree outputs using HMMs. The results of experiments designed to demonstrate the system performance are presented in Section 5. The paper is con- cluded in Section 6.

2. System Overview

In this section we give an overview of the system (see Fig- ure 3) and describe its various elements. Given a video stream that contains sports material from one or more disciplines, our goal in this paper is to automatically segment the stream into sequences and label each sequence with the cor- responding sport label. First, the video stream is segmented into shots which are the basic temporal units in our system.

For each shot, a number of key frames are extracted. The first stage of the decision making process is the cue detection. Cue detectors operate on the key frames and generate judgement about the presence or the absence of the objects they try to detect. The shot after this stage is represented in what we call the cues format or representation. This is distinct from the conventional approaches which are based on low level generic image features derived from colour and texture. Cues offer higher level representation which is ap- plication domain specific. Most importantly, they transform diverse input data structures into a standard form which fa- cilitates the decision making process and promotes modu- larity (i.e. exploiting additional cues).

The second stage aims at classifying each shot to one of the characteristic views, defined for each sport, using the information provided by the cue detectors. The functionality of this stage is realised by a decision tree classifier. The knowledge embodied in the decision tree is learnt from a set of labeled training samples covering all views the system is trying to detect. The decision tree is then used to classify each shot into one of the sport view categories.

The output of the decision tree may be subject to error due to errors in the cue extraction or genuine ambiguity i.e.

the presence of cues that are characteristic of more than one discipline (e.g. crowd views). The third stage is designed to minimise this error by exploiting the temporal context using HMMs. HMMs, which process the sequence generated by the decision tree, bridge the gap between the semantic video

(3)

Label

C₁ C₂ C_M

λ1 λ2 λX

post processing

Shot Detector Cues Detectors

Decision Tree

HMM models C₃

video stream

shots

shot?

exist in the objects (cues)do certain

depending on the cues in the shot, what view does it represent?

views

given the views sequence,

what sport is likely to generate

this sequence?

shots:cues representation

how likely

Figure 3: Proposed System

content labelling by human observer and the data-driven hypotheses generated by automatic classification methods.

The individual stages of the system will be described in more detail.

3. Cue detectors

The objective of the automatic annotation of video material is to provide indices that describe video content as usefully as possible. In much of the previous work in this area, the annotation consisted of the output of various feature detectors (e.g. MPEG7 descriptors). By itself, this information bears no semantic connection to the actual scene content — it is simply the output of some image processing algorithm.

In our approach we are taking the process one stage further.

By means of a set of training processes, we attempt to generate an association between low-level image data outputs and the semantics of the scene content. Thus for example we might train the system to associate the output of a texture feature detector with crowds of people in the scene. We can then use this mechanism to generate confidence values for the presence of a crowd in a scene, based on the scene texture. We denote the output of this process as a “cue”.

These cues can then be combined to generate higher-level information, e.g. the type of sport being played. [4] contains a complete description of what cues are and how to generate them.

Figure 4 illustrates the cue generation process, which in- volves three phases. Different cue detection methods have been developed. Each method can be used to form a num-

Templates Cue

New Data

Classifier GenerateHistograms classifier parameters

Stage 2 Stage 3

Stage 1

off−line processes run−time process

m

m Train

Classifier

p.d.f. values

Feature Detector

Generate p(m|C), p(m|C) histograms

Cue, Cue Training Set

Figure 4: Creating Cue Evidence

ber of different cue-detectors provided that suitable training data is available. These methods are:

neural network ([7]): Each cue-detector is a neural network trained on colour and texture descriptors computed at a pixel level on image regions containing the cue of interest and on negative examples, i.e. on image regions which are known not to contain the cue.

multimodal neighbourhood signature ([6]): In the Multimodal Neighbourhood Signature approach, object colour structure is represented by a set of invariant features computed from image neighbourhoods with a multimodal colour probability density function. The method is image-based – the representation is computed from a set of examples of the cues of interest.

texture codes ([5]): In the training stage, image regions representing each cue of interest are selected from the key-frames. Several examples are needed for each cue to account for appearance variations. Textu- ral descriptors are extracted from these regions using a texture analysis module based on a bank of Gabor filters. These outputs of the Gabor filters are coded and the texture codes together with the statistics of their occurrence form a model for the cue.

4 High-level classification

In this section we describe the process of classifying shots into predefined set of views, as well as the method developed to group consecutive shots into one scene. The approach relies on using a decision tree classifier to achieve the initial sport views classification. Hidden Markov Mod- els (HMM) then processes the output of the decision tree to group shots. First, the video sequence is automatically segmented into shots. For each shot, a number of key frame images is extracted. Let us suppose that we have a set of

trained cue-detectors. Each cue detector operates on the

(4)

key frame images and generates two pdf values and

, where is the cue looked for by the cue detec-

tor and denotes the measurement vector used. Assum- ing equal prior probabilities, we can estimate the a posterior probability of an instance of a cue, , existing in the image as follow:

(1) Thus, for each key frame, we will obtain values, one for each cue. A shot can then be represented by a vector

"!$#%&!(')%+*,*-*,%.!0/1%2*-*,*-%&!435 where^!0/ is the mean value of the posterior probabilities computed by the⁶⁸⁷⁹ cue-detector on the key frames that belong to .

4.1 View Classification

Based on earlier experience [2], we opted for a decision tree learning algorithm to build the model for solving the problem of classifying a shot to one of predefined sport views. The C4.5 algorithm [9] for building decision trees was adopted. Let^:;=< ^# ^%> ^' %2*-*-*,%?A@CB be set of^D shots, and let^EFG<H ^# ^%IH ^' %2*-*,*-%IHKJLB be a set of ^M sport views being investigated. The process of constructing a decision tree classifier requires a training setNO=<0QP&%>H / SRAPUT

:K%IH / TFEVB of shots, each being assigned a label of the

sport view it represents, a splitting criterion and stopping rules. The splitting criterion is used to recursively partition the training set in a way that increases the homogeneity of its partitions. The most popular splitting criterion is the information impurity (also called entropy impurity):

WXQYZ

[\]^

XA_a`Z

$bA[ ]dcfe

P

"gPVh Zji

' "gP (2) where"gP is the frequency of patterns in category ^g]P. Suppose that ^k is a test, testing one of the cue values^! , that partitions^N into^N ^# ^%IN ^' %2*-*,*-%INml ; then the weighted average information impurity over these partitions is computed by

WnXQYZo

Np]

l

e

P,q # NmP>

N5 WXQYZ

NmP (3)

In order to evaluate the goodness of the partitioning using

k , the information gain is expressed as the reduction in information impurity obtained if^! is applied, i.e.

i8r0WnX

ks]

WnXQYZ

NLc WnXQYZo

NL (4)

is computed. At each node, all the possible tests are investigated and the test with the highest information gain is selected to define the partition. The partitioning stops when one of the stopping rules is triggered at a node. This node becomes a class node, and a label^H ^/ which represents the sport view with the largest number of shots is attached to it.

The classification of a shot using decision trees pro- ceeds from top to bottom. Depending on its cue values,

navigates through the decision tree till it reaches a class node. The navigation is guided by the rules of the decision nodes visited. is assigned the label,^Ht/ : ^Hu/vTwE , attached to the class node at which its navigation terminates.

The accuracy of a class node over the shots in the training set can be used to estimate the reliability of the classification arising from that node. If^X is the number of training shots in the subset for a class node and^{^} is the number of training shots that do not have the label of that class node, then

H / ] is estimated as ^x8yAz

x

(central estimate) or x8yAz>y${+|} x(pessimistic estimate) [8].

4.2 Post-processing using HMMs

The HMM (described in details in [10]) is a powerful tool and a popular technique widely used in pattern recognition.

Its capabilities have been attracting many researchers work- ing on problems in video analysis and classification. HMM

~

is characterised by a finite number of states:

<j # %> ' %2*-*,*-%I$tB (5)

The HMM is always in one of these states; the state at time

_

is denoted as

7

. Generally the states are interconnected in a way that any state can be reached from any other state.

The transition probability distribution between states is represented as a matrix :

;<

r P/ B where^r ^P^/

7

# / 7

4Pn%+

W

%n6

(6) Per state, there is a number of distinct observation symbols. In our case, the set of observation symbols is the possible output of the decision tree classifier i.e. ^E

<jH # %IH ' %2*-*,*-%IHKJLB . Any state, ⁶ , contains the observation

symbol probability distribution :

;< / HK1?B where ^/ HK)

7

HK( 7

/ %

S6%+SM (7)

The initial state distribution is given by :

d<+PaB whereP # 4Pn%+

W

(8)

Given the observation sequence⁼ ^# ^' ^*+*2*?K (where each

7

is one of the symbols in^E , and is the sequence length), generated from classifying shots with the decision tree classifier, our aim is to compute the probability of the observation sequence given an HMM ^~ , i.e. ^~ . This is achieved using the forward algorithm in which the forward variable

7

is defined as :

7 W

# ' *2*+*>

7

%I

7

$P?

~ (9)

The forward algorithm can be summarised as follows:

(5)

Algorithm 1: Forward procedure

1. initialisation : ^# ^W ]$PPI

#

2. For^_ ^d to compute :

7

#

-60¡ ¢

P-q # 7 W r P /£

/

7

#

3. Return :

~

¢

P,q #

u

W .

We construct and train an HMM model for each sport we want to detect. We are considering two scenarios. In the first scenario the video sequence represents only one sport event. After extracting the shots and representing them as cues vectors, we engage the decision tree classifier to classify each shot based on its cues. The sequence generated by the above process is classified as that sport for which the HMM model generated the highest response, i.e.^~ .

In the second scenario, the video sequence may contain more than one discipline. We cannot use the previous method since we do not know the boundaries of the subsequences. One solution to this problem is to obtain

¤ 7

¥h Zji¦

# '

*+*2*I

7 ~

n§ at each^_, ^f ^_ ^¨ . Fig- ure 5(a) shows the output of four HMMs operating on a sequence. Note that the HMM output of a sport exhibits the minimum changes over the subsequence representing that sport. To exploit this, we compute the discrete derivative

© 7

for all the points in this subsequence. The subsequence is labeled by the identity of this HMM.

5 Experimental Results

In this section, we describe the experiments aimed at eval- uating the proposed system. The experimental data is com- posed of video material from the 1992 Barcelona Olympic games. The material includes four Olympic sport disciplines (hockey, swimming, track events, yachting). The material was manually ground-truthed and split into three sets. One set was reserved for training the decision tree classifier, the second set was used for HMM training and the remaining set was reserved for testing the system. Thirty seven visual cues were identified, trained and used to generate cue evidence for the study. A decision tree classifier was trained to carry out the classification task and its results were compared with the performance of the proposed system which incorporates temporal contextual post- processing.

The results of the experiments are summarised in Table 1 and Table 2. The average classification rate achieved by the

decision tree classifier was^ªs«m*¬s® . The proposed method shows an encouraging improvement in shots classification.

The overall accuracy of using temporal post-processing increased to^¯41*^°0® .

H S T Y Recall Precision

H 67 6 14 0 77% 82%

S 12 186 13 19 81% 93%

T 2 2 71 0 95% 72%

Y 1 7 1 38 81% 67%

Table 1: Confusion matrix for sports video shot classification using decision tree classifier

As far as the accuracy of classification for individual sports is concerned, we noticed that the hockey and swimming classification rate increased, yachting remained un- changed and track events deteriorated. However, it should be noted that the precision for all the sport has increased with the proposed system.

H S T Y Recall Precision

H 80 3 4 0 92% 85%

S 8 218 2 2 95% 94%

T 6 3 66 0 88% 92%

Y 0 9 0 38 81% 95%

Table 2: Confusion matrix for sports shot classification of the proposed System

One advantage of the proposed system is its ability to segment the sequence and label it. This information can be used to perform a further analysis on the sequence. Our initial results shows that ^j¯0® of the sequences generated by the system were mislabeled. Just over ^±1²0® of the remaining sequences were labelled with the correct sport label, but their boundaries did not exactly correspond to the groundtruthed test data. It was noticed that some of the errors are due to a genuine ambiguity in the material rather than the classification system, i.e. crowd shot at the end of one sport event and another crowd shot at the beginning of another sport discipline.

In summary, the preliminary results obtained are encouraging. In our future research, more experiments are planned to test the system behaviour when the number of competing disciplines increases.

6 Conclusion and Future Work

In this paper, a multi-stage decision making system for solving the problem of sports video classification was proposed.

(6)

0 10 20 30 40 50 60 70 80 90

−450

−400

−350

−300

−250

−200

−150

−100

−50 0

Shots

log[P(O|model])

Hockey Swimming Track Yachting

Hockey Swim. Yacht. Track Swim. Hockey

(a) The output of four competing HMM

0 10 20 30 40 50 60 70 80 90

−14

−12

−10

−8

−6

−4

−2 0

Shots

Discrete derivative log[P(O|model)]

Hockey Swimming Track Yachting Hockey

Hockey

Swim. Yacht. Track Swim.

(b) The processed output of four competing HMM

Figure 5: Output from four HMMs operating on a sequence generated from the decision tree classifier

The first stage of the decision making process detects appli- cation specific cues. The second stage attaches a label, from a set of prototypical views of each each sport, to each shot using the information provided by the cue detection stage.

The functionality of this stage is realised by a decision tree classifier. The third stage uses HMMs to process the sequence of view labels generated by the decision tree. The output of this stage is a final decision regarding the identity of the sport represented by the sequence, taking advantage of the temporal context.

Our future plans include providing cues that deal with modalities other than visual one. They are expected not only to improve the sports video categorisation performance but also help to detect highlights such as hockey goal, etc. It is intended to use audio, speech and motion cues for this purpose.

Acknowledgements

This work was supported by the IST-2001-34401 VAM- PIRE project funded by the European IST Programme.

References

[1] P. Chang, M. Han, and Y. Gong. Extract Highlights From Baseball Game Video With Hidden Markov Models. In Proceedings of ICIP’2002, 2002.

[2] E. Jaser, J. Kittler, and W. Christmas. Building Clas- sifier Ensembles for Automatic Sports Classification.

In MCS, pages 366–374, 2003.

[3] E. Kijak, G. Gravier, P. Gros, L. Oisel, and F. Bimbot.

HMM Based Structuring of Tennis Videos Using Vi- sual and Audio Cues. In ICME, pages 309–312, 2003.

[4] J. Kittler, K. Messer, W. Christmas, B Levienaise- Obadia, and D. Koubaroulis. Generation of Seman- tic Cues for Sports Video Annotation. In ICIP, pages 26–29, 2001.

[5] B. Levienaise-Obadia, J. Kittler, and W. Christmas.

Defining Quantisation Strategies and a Perceptual Similarity Measure for Texture-Based Aannotation and Retrieval. In ICPR, 2000.

[6] J. Matas, D. Koubaroulis, and J. Kittler. Colour Im- age Retrieval and Object Recognition Using the Mul- timodal Neighbourhood Signature. In ECCV, pages 48–64, 2000.

[7] K. Messer and J. Kittler. A Region-Based Image Database System Using Colour and Texture. In Pat- tern Recognition Letters, page 1323 1330, 1999.

[8] J. R. Quinlan. Decision Trees as Probabilistic Classi- fiers. In Fourth International Workshop on Machine Learning. Morgan Kaufmann, 1987.

[9] J. R. Quinlan. C4.5 : Programs for machine learning.

Morgan Kaufmann, 1993.

[10] Lawrence R. Rabiner. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recog- nition. IEEE, 77(2):257–286, 1989.

[11] P. Xu, L. Xie, S. Chang, A. Divakaram, A. Vetro, and S. Sun. Algorithms and System for Segmentation and Structure Analysis in Soccer Video. In ICME, 2001.