The appearance feature set worked well when compared with Marini et al.’s on all four standard classifiers. All datasets and methods performed better with the RF classifier
than the other classifiers. We have shown that this classifier’s correct classification rate is statistically significant compared with the other classifiers using the Wilcoxon’s sign ranked test. On the seven classes dataset, the proposed appearance features achieved between 6-13% higher classification rates compared with Marini et al.’s. However, on the thirteen classes dataset, the proposed appearance features achieved between 6-14% more correct classification rates. Finally on the Caltech-UCSD Birds-200-2011 dataset, again, the proposed method outperformed Marini et al.’s by approximately between 5-12% using 2 species, 12-16% using 5, 13-17% using 17 and only 3-4% using 200.
The misclassification of bird species was due to illumination and similar colour pat- terns in some species. Marini et al.’s method uses only colour features and therefore had more misclassifications with similar colour patterned species than the proposed. For the seven classes dataset, the misclassification of Budgerigar (wild-type) as Nanday para- keets reduces by 9.1% when the proposed method was used. Similarly, with the extended dataset, misclassification of Alexandrine parakeets as Nanday parakeets reduced by 0.5%; Nanday parakeets as Alexandrine parakeets by 3.0%; and Blue-crowned parakeets as Alexandrine parakeets by 1.8%, when the proposed appearance method was used. Fi- nally, due to the distinct colour features of the three Budgerigar species, misclassification among them is low.
Marini et al. (2013) have shown that increasing the number of species reduces the correct classification rate. Classification rates dropped significantly when moving from seven to thirteen classes and whilst the RF classifier remains effective, the result is a reduction of approximately 10% in the correct classification rates. When the number of classes is increased from 2 to 5, 17 and 200, the classification rates dropped with the proposed method and that of Marini et al.. Therefore, the conclusion is that increasing the number of classes whilst using only appearance features may result in a drop in the classification rates, irrespective of the appearance features used.
The next chapter will identify relevant motion features which can be extracted from video of birds in flight, and use them to classify species automatically. The appearance and motion features will be effectively combined and evaluated to determine whether this
4.5. CONCLUSION 113
Classification of Bird Species using
Motion Features
In the previous chapter, species were classified using appearance features and the results compared with the state-of-the-art method proposed by Marini et al. (2013) using the three datasets. Experimental results revealed that the proposed appearance features outper- formed Marini et al.’s on all three datasets using all four standard classifiers (Naive Bayes (NB), Random Forest (RF), Random Tree (RT) and Support Vector Machine (SVM)). In particular, using the random forest classifier, the proposed method greatly improved cor- rect classification rates over that of Marini et al. (2013) by about 6% on the seven species dataset and 9% on the thirteen classes dataset. There was also an improvement in correct classification when using the CUB-200-2011 dataset, by approximately 6-12% using 2 species, 12-16% using 5, 13-17% using 17 and 2 - 4% using 200 species.
The methods used in Chapter 4 use single images and appearance-based models for classification; however, bird species also exhibit distinguishing behaviours (flying, mov- ing, poses, etc) which could also be used to help robust automated identification. This is particularly relevant to the identification of birds in flight, especially at a distance where appearance-based features such as colour tend to attenuate, whilst motion-features remain discernible. The aim of the work presented in this chapter is to investigate the potential of motion-based features for differentiation of species with closely related appearances, and also to determine whether motion and appearance features can be merged to produce
5.1. DATASETS, METHODS AND PREPROCESSING 115
results which exceed either set alone. Firstly, a richer feature set based on motion is in- troduced, and use to determine whether they can classify species across the two datasets introduced in the previous chapter. In particular, motion features were investigate to de- termine whether they can discriminate between species with similar appearances (that is, species which were less well differentiated using appearance features in Chapter 4). Mo- tion and appearance features were then fused and using standard classifiers, determined whether these combination is more effective than either set alone. This chapter is struc- tured into the following sections:
• In Section 5.1 the datasets used and the processing techniques applied before mo- tion feature extraction were introduced.
• The set of motion features used for all the experiments in this chapter are described in Section 5.2.
• In Section 5.3 the experimental work is described.
• Results from experimental work, including the motion feature and full feature set, and evaluations to determine real time performance of all the models are presented in Section 5.4.
• Finally, conclusions are drawn to the chapter in Section 5.5, which include sum- marising all results in the chapter and introducing briefly what will be described in the next chapter.
5.1
Datasets, Methods and Preprocessing
The extended dataset detailed in Chapter 4 have been used for all experiments presented in this chapter. As a reminder, this is "Dataset #2", which is an extended set of videos covering thirteen classes made up of eleven bird species, one ( Budgerigar (Melopsittacus Undulatus)) with three colour forms.
For each video, appearance features are calculated per frame starting from the 64th frame, while motion features are calculated using 64 frames, in strides of one frame. The
first set of motion features from a video is calculated using the first 64 frames and this is merged with the appearance feature from the 64thframe, to form the first combined feature of that video. Thus, for experiments in this chapter and beyond, videos that are shorter than 64 frames are not included in the dataset. Therefore, all experiments performed using the combined or motion features, had fewer videos than those performed using appearance. Likewise, the number of images in these experiments (experiments using the combined or motion features) are also fewer, as the first 63 frames are not used in the computation of appearance features that are merged with the motion to form the combined set (see Table 5.1).
Table 5.1: Table showing the number of videos and images in thirteen classes dataset when features are combined. There are fewer videos and images when compared with
the original dataset used to perform appearance features only experiments.
Species # of videos # of images
Alexandrine Parakeet 77 7,845
Nanday Parakeet 59 6,246
Blue-crowned Parakeet 58 5,332
Common House Martin 114 17,896
Eastern Rosella 40 3,247
Budgerigar (yellow) 47 4,329
House Sparrow 74 5,318
Budgerigar (wild-type) 41 3,349
Common Wood Pigeon 35 2,027
Black-headed Gull 142 29,695
Cockatiel 58 5,687
Budgerigar (blue) 76 7,030
Common Starling 71 5,392
Total 892 103,393
Before extracting the motion features, the centroids of the segmented bird silhouette are first extracted from each frame of the videos, after performing the pre-processing
5.1. DATASETS, METHODS AND PREPROCESSING 117
described in Chapter 4. The 2D centroid positions are used to form a trajectory in the image frame. For any bird tracked throughout N frames, such a trajectory is described as:
T = {(x1, y1) . . . (xN, yN)} (5.1)
jis the frame index represented as j = {1, . . . , N}, and T represents the entire trajec- tory of a bird, represented as a series of x and y coordinates of the centroid in the image frame. Using the entire trajectory (Eqn. 5.1) for each video, shorter overlapping sub- trajectories tk were defined, which starts on frame k of the video, and k ranges from 1 to
N− Q + 1; where Q is the window size:
tk=(xk, yk) . . . (xk+Q−1, yk+Q−1)
(5.2)
In this case Q = 64. The overlapping windows are in steps of a one-time frame. For example, in a video of size N = 120 frames, the first window starts from frame 1 . . . Q = 64, the second window started from frame 2 . . . Q + 1 = 65 and so on. In general terms, the k window starts from time frame k . . . k + Q − 1. The total number of short overlapping trajectories in this example will, therefore, be N − Q + 1 = 57. A box filter (Gonzalez and Woods, 2002) with a 1 x 3 kernel was then applied to reduce the effect of noise in the trajectory. The idea of using the box filter is simply to replace each trajectory value with the mean value of its neighbours, including itself. The box filter is usually thought of as a convolution filter. Like other convolutions it is based around a kernel, which represents the shape and size of the neighbourhood to be sampled when calculating the mean. Given the trajectory tk, the smoothed trajectory stk is the convolution of the kernel ker and the trajectory tk given by equation 5.3
stk= ker ∗ tk (5.3)
Where the kernel ker is 13
1 1 1
noise (reduces the variance), leading to a more accurate estimate of the trajectory. Gaus- sian filter (Gonzalez and Woods, 2002) can also be applied for noise smoothing in tra- jectories, and is similar to the box filter, but uses a kernel that represents the shape of a Gaussian. A simple form of the Gaussian filter is a box filter. The motion features were then extracted from the set of smoothed short trajectories (see Eqn. 5.2) to form a feature sequence which were used for classification. The motion features extracted are described in the following section.