The two approaches used to detect human actions in this thesis are Histogram of Oriented Gradient of Motion History Images (MHIHOG) and Contour Features. In the next two sections we explain how each approach is imple- mented to detect human actions. As is the case when training any machine
Figure 3.5: Indoor camera view
learning system, providing the classifier with features which give a strong generic representation of the class in question is vital to obtaining suffi- ciently accurate results. Several preprocessing steps are necessary to obtain visual features which will help the classifier in the training and testing stage.
3.5.1 Histogram of Oriented Gradient of Motion History Im- age
This approach is based on Motion History/Energy Images, which was dis- cussed in Section 2.2.6 and is a commonly used approach to detect human actions by representing human motion using temporal templates. Our pre- liminary research into Motion History Images (MHI) has concluded that this approach is most productive in representing basic human actions. More complex actions tend to obstruct information related to the beginning of the movement, which makes this approach unsuitable for complex human actions such as boxing or running, where there is repetitive limb movements. An example of this problem can be in rapid movement actions such as jogging or boxing where the rapid movement of the arm forwards and backwards ob- scures some of the visual detail captured by later frames. Another challenge related to using traditional MHIs is how to best represent the information
from the resulting images. In [45], the authors used Hu Moments [75] to represent basic actions and these are known to give good representation in a scale invariant and view invariant manner.
To mitigate against these drawbacks, we implemented a method based on MHI and Histogram of Oriented Gradients (HOG), called MHIHOG. The method was first described in [77], however our approach is a variation of that detailed by the authors in [77] and has several differences. One significant difference is that we normalise the HOG features so that the features from human subjects of different sizes can be compared. The second difference is that we train our features on an Instance Based Learning classifier to compare which gives the most accuracy in classifying the HOG features. The overview diagram of our proposed method is shown in Figure 3.6. Our approach also obtains a HOG of the foreground MHI, unlike the method detailed in [77].
To generate a motion history image, our approach processes four images per second. Investigations were conducted for different sampling frequencies but our tests concluded that due to the diversity in the actions in the dataset, we obtain better classification results with a low sampling rate. When MHIs are generated for each action a HOG is used to extract features from the motion history image. The first step in this method creates a MHI with frame differencing which represents the action. The second part computes the HOG of the MHI. In our method, we adopt the approach introduced by [77] and train a classifier to learn the actions using the HOG features of the MHI.
3.5.2 Contour Features
Contour features are commonly used to identify basic human actions from a 2-D scene [35] [56]. The concept of contour features is to connect from
Figure 3.6: Overview of the process of extracting Histogram of Oriented Gradient of Motion History Images
the center to the peripheral extremities of a human contour as illustrated in Figure 3.7. After segmenting each action into four second video clips using frame differencing, we segment the video into 120 frames in total (30fps). For each frame, we extract the human subject using foreground extraction and the resulting human silhouette is then sliced into pie segments (16 in this case). Then the distance from the centroid of the human foreground to the furthest foreground pixel along the pie line is calculated in a clockwise or counter-clockwise order. The end result is that for each image we obtain is 16 features which give a good representation of the human posture. Contour features are calculated for each image in the action sequence.
3.5.3 Action Classification
Using an Instance Based Learning, a model can be generated for each of the actions that will be recognised. Instance based learning classifiers have proved suitable for our needs and the time taken to generate classification
Figure 3.7: Extracting contour features from the human silhouette
models is only a few seconds so we do not perform any linear or nonlinear dimension reduction techniques. In both feature recognition approaches we normalise the features to compensate for height variances between different human subjects. The normalisation step scales the features so that the variance is equal to 1. By normalising the features the classifier will be more accurate at finding similarities between the actions performed by people of different sizes. Each human subject is wearing different colours of clothing but this does not cause a conflict when using foreground extraction, which is in contrast to optical flow, which can sometimes be adversely affected by different colours of clothing.
Once the training model is learned as per the method in Figure 3.9(a), an unknown input action is tested against the training model. Figure 3.9(b) shows the workflow used to test each input action against the training model. The classifier then predicts which action from the training set is most similar to the unknown input action.