Temporal Trajectory Segmentation based on Statistical Sequential

4.2 ADL Representation

4.2.3 Temporal Trajectory Segmentation based on Statistical Sequential

Current SoA action recognition methods that incorporate trajectory information in a BoVW representation scheme [21] consider that trajectories have a fixed (manually derived) temporal length. However, in practice there may be large variations in the way each activity is carried out, as similar actions are often performed at different time scales, dependent on camera properties like frames per second being recorded, or anthropometric variations between the human subjects that perform the activities.

To address these issues, we propose a novel solution for the extraction of temporal trajectories based on statistical sequential change detection (SSCD) [95], [96], which finds moments of change in the HOFs corresponding to meaningful changes in the activities taking place. The Cumulative Sum(CUSUM) [97] test is applied to each trajectory’s HOFs for meaningful temporal segmentation and consequently improved recognition results. A strength of CUSUM, and SSCD in general, is that it is able to detect an unknown change point between two unknown data distributions [98]. Additionally, sequential algorithms only process the data made available up to the current point in time, allowing them to operate in an online manner.

In order for CUSUM to detect changes in the video’s motion, the original and current distributions of the data need to be extracted empirically [96]. Data distributions can be approximated and changes between them can be detected by using as few as W₀= 8 − 10 video frames. In this work, the data comprises of a set of HOFs, i.e. a set of N × 1 histograms, so we are essentially finding the distribution of these histograms. The initial distribution f0, is approximated from the first W0 motion histograms (HOFs) f0 = f {¯h₁, ¯h₂, ..., ¯h_W₀} and the current data distribution f₁, uses the latest W₁ observations, including the current frame’s HOF ¯hk: f1 = f {¯h_k−(W₁−1), ¯hk−W1, ..., ¯hk}. For simplicity, in the sequel we consider that W0 = W1, as this does not affect the accuracy of the distribution approximation. Thus, for modeling the distribution of the HOFs, we have as input data a W0× N matrix, where W₀ is the number of samples and N the number of bins, i.e. the initial length of each HOF. The minimum time that SSCD takes to converge (W0+ W1) is on average 1 second (20-30 frames), while we do not consider that trajectories that are larger than 4 seconds (80-120 frames). This algorithm performs well when camera frequency is larger than 8 frames per seconds.

4.2.3.1 Multivariate Gaussian HOF Modeling

We make the assumption that the initial and current distributions of the HOFs can be approximated by a multivariate Gaussian distribution. We validate this assumption by applying the Mardia tests, which are detailed below. The Gaussian distributions of the HOF pdfs are given by:

These pdfs can be approximated by estimating their mean ¯µ_l, l = {0, 1}:

We make the assumption that the HOFs, representing the distribution of the directions of optical flow in each temporal window, are uncorrelated, as in most cases there are no great variations in motion over time and between nearby spatial neighborhoods. Under the Gaussian assumption, this leads to the representation of HOFs as independent and identically distributed (i.i.d.) in each temporal window. In practice, the joint pdf of W₀ HOFs can then be represented by the product of the individual HOFs in each frame k = {1, 2, ..., W₀}. The HOFs’ covariance matrix is diagonal, as the data is independent between different time instances i 6= j, with zero cross-correlation for different trajectories and auto-correlation equal to the square of the variance:

C_0|1(i, j) =

( 0 i 6= j, σ² i = j,

In order to validate the Gaussian approximation of our data, we applied appropriate tests of multivariate normality namely the Mardia tests [99], [100]. The Mardia tests involve the calculation of (a) the multivariate skewness, (b) the multivariate skewness corrected for small samples and (c) the multivariate kurtosis. When the null hypothesis for all these tests is validated, it implies that the residuals are normally distributed. In order to verify the Gaussianity of the data, we normalize samples ¯h_i via ¯h^norm_i = (¯h_i− ¯µ)/σ, and apply the Mardia tests to ¯h^norm_i to examine its normality.

For a set of L d-dimensional random variables X = [¯x1, ¯x2, ..., ¯xL], ∀¯xi ∈ R^d, where d is

where Σ is the sample covariance matrix of X. Mardia showed [99] that under the null hypothesis of multivariate normality, ^N₆b1,d is asymptotically distributed as a Chi-square distribution with d(d+1)(d+2)

6 degrees of freedom. A measure of the multivariate kurtosis is given by:

Under the null Mardia hypothesis, b_2,d is asymptotically normally distributed with mean d(d + 2) and variance 8d(d + 2)/L.

Taking into consideration these measures, we applied the Mardia skewness and kurtosis tests to 24 different samples from the KTH [9] activity recognition dataset, in order to validate the Gaussian assumption of our HOF data. The KTH videos contain a variety of human actions, like walking, jogging, boxing etc, performed by various subjects both indoors and outdoors. Our experimental results showed that the skewness null hypothesis is accepted in more than 99,6% of the cases, while the kurtosis null hypothesis is always accepted, indicating that our data (the HOF orientations) can indeed be modeled by a multivariate normal distribution.

4.2.3.2 Statistical Sequential Change Detection (SSCD)

Based on the empirical Gaussian approximations of the HOFs’ distribution before and after a change, we approximate the log-likelihood ratio, i.e. the test statistic T_k for frame k in the implementation of CUSUM [101], by:

T_k= ln f1(¯h_k) f0(¯hk)

= ln(f1(¯h_k)) − ln(f0(¯h_k)) (4.3)

CUSUM is applied via the computationally efficient iterative form of Page [101]:

Sk= max(0, Sk−1+ Tk), S0 = 0

From Eq. (4.3) we can see that when the current histogram is close to the initial one (i.e. there has been no change in motion), the log-likelihood ratio will be nearly zero, while when the current histogram deviates from the initial one, there is an increase in T_k and consequently in S_k. Thus, changes are detected at the points in time where there

is a sharp increase in the Sk curve. Replacing Eq. (4.2) into Eq. (4.3), we obtain the Considering that our covariance matrix is diagonal, its inverse form is given by:

C_l⁻¹= elements of the covariance matrix are non-zero, since they correspond to the autocorrela-tion of the HOFs, so the covariance is indeed invertible. Under the i.i.d. data assumpautocorrela-tion, all values of the variance are σ²_l,n= σ², l = {0, 1}, n = [1, 2, ..., W0], as explained in the previous section.

By plugging in Tk calculated at each frame, we get a value for the test statistic Sk which significantly increases when there is a change in our data, i.e. the motion features. In order to detect this increase, we compare the test statistic S_k at each frame with a threshold derived from its previous values. In the statistical literature, this threshold cannot be determined in a theoretical manner. The widely accepted solution is to experimentally find a threshold that leads to few false alarms [102]. We automatically estimate the threshold at each time instant k based on the recent test statistics values:

η_k = mean[S_k−1] + c · std[S_k−1]

where mean[...] is the mean of the test statistic’s values until frame k − 1, std[...] is their standard deviation and c is the threshold parameter, empirically selected to equal 2.5.

We decided to use this value after cross validation performed on numerous characteristic ADL videos for a set of eight different numbers in the interval [0.001, 10]. When the test statistic S_k surpasses this threshold η_k, a change in motion is detected, signifying the end of that trajectory.

This procedure leads to the temporal segmentation of the extracted trajectories based on actual changes in the activities taking place, rather than by using a manually selected constant threshold as in the literature. Introducing this non ad-hoc approach to segmenting trajectories indeed improves the recognition results of our method, as can be

Figure 4.8: Optimal trajectories depicted for several activities. Answer phone, eat snack, eat banana are depicted for URADL in the top row, while handwave,jog and run are depicted from KTH in the bottom. Green lines denote the extra extension from

fixed trajectory with temporal window of 15 frames.

seen in the experiments of Sec. 4.4. In Fig.4.8 we can see several examples of optimal trajectories in several activities taken from URADL and KTH dataset. Green lines depict the extension size that optimal trajectories introduce to a fixed size trajectory of 15 frames. We can see that trajectories are terminated in a more meaningful way when a change in the motion pattern occurs or when the activity stops.

In document Video processing and background subtraction for change detection and activity recognition. (Page 85-89)