Model Based Tracking - Model based methods for locating, enhancing and recognising low resoluti

Tracking describes the process of locating an object in successive frames of a video se-quence, where the location of the object is usually defined in respect to the camera co-ordinate system. In 2D tracking the transformation between video frames can be de-scribed with affine transformations or projective transformation, e.g. homography, while 3D tracking requires the calculation of the 3D pose and orientation of the object. The pose parameters T^extare then defined as translations [dx, dy, dz] and rotations [φx, φy, φz] around the camera axis.

Most 2D tracking approaches work well for planar objects and under constant lighting conditions. However, when tracking non-planar objects, large view point changes may alter the appearance of the object and cause most 2D based trackers to fail. Model based tracking tries to overcome these limitations by incorporating the 3D model of an object to help find the new pose and orientation in the next frame. By knowing the 3D geometry of the object, changes in pose and lighting can be modelled and thus, better accounted for.

Model based tracking can either be feature based or appearance based as shown in Fig-ure2.2. Feature based methods track single features from the previous frame to the current frame and then estimate the pose parameters and the deformation parameters of the ob-ject model by fitting it to retrieved feature points in the current frame. Appearance based tracking methods used the entire texture of the face, rather than single feature points, to find the new position in the next frame. Representatives of both methods are described in the following sections.

2.2.1 Feature Based Tracking

Feature based tracking methods usually calculate a motion field first and then deform the object model to fit the new feature points. Optical flow (Gautama and van Hulle, 2002) is commonly used for calculating a two-dimensional motion field based on image intensities. Given a feature at image location (x, y) in frame i the optical flow assumes that the intensity of this feature is constant in the next frame i + 1 as:

I(x, y, i) = I(x + ∆x, y + ∆y, i + 1) (2.16) where I(x, y, i) is the intensity of the image feature at location (x, y) in frame i and (∆x, ∆y) is the feature point displacement in image coordinates.

The position of each feature (x + ∆x, y + ∆y) in frame i + 1 can be calculated using block-matching algorithms (Shi and Tomasi, 1994) or differential methods, for example.

A good overview including a performance evaluation is given byBarron et al. (1994). The accuracy of the motion field is highly dependent on the selected features. Feature points lacking a rich local texture are more likely to be mismatched, i.e. their position may not be recovered accurately.

The facial tracking method proposed in Tao and Huang(1999) uses a template matching approach to estimate a motion field in two consecutive frames. The 3D motion of the entire mask, i.e. translation and rotation as well as the expression parameters are then calculated using a least square estimator by linearising the derivatives. This way single mismatched image features are corrected by using the deformability of the mask as constraint. This tracking approach is extended to learn new facial expressions that are not yet represented by the face model. After each frame the resulting tracking error is analysed and the predefined deformation parameters are adjusted, if required.

2.2.2 Appearance Tracking

Appearance based tracking methods use a textured 3D model of the object to find the location of that object in the next frame. Instead of an accurate 3D face model the method proposed by Cascia et al. (2000) uses a cylindrical head model to track the motion of a person’s head. It is argued that a more complex head model does not improve the quality of the track as it requires more parameters and is less robust to perturbations in the initial position. The cylindrical model is initialised automatically using the result of a

face detector. The authors reported that it was not possible to automatically initialise a complex human head model.

The cylindrical model is projected into the first frame of the tracking sequence and a refer-ence texture J0 is extracted. This reference texture is used to calculate so called warping templates by perturbing the initial position. A displacement matrix N_a = [n₁, n₂, ..., n_K] stores the displacements of each pose parameter T = [dx, dy, dz, φ_x, φ_y, φ_z, ]. Each warping template ok is then calculated as:

J₀ = P(I₀, T₀)

o_k= J₀− P(I₀, T₀+ n_k)

where P projects the cylindrical head model into image I₀ using parameters T₀ and J₀ is the resulting reference texture. In practise, four displacements per parameter are sufficient.

The illumination is modelled similarly as a linear system using illumination templates.

These templates U are calculated by applying Singular Value Decomposition (SVD) to a large set of training images of different persons under different lighting conditions. The difference in illumination between two frames can then be approximated as:

J − J₀ ≈ U c (2.17)

where the columns of U are the illumination templates and c is a vector of coefficients.

Tracking is then realised by calculating the difference between the reference texture J₀ and the texture of the current frame J . Using the warping templates and the illumination templates this difference is approximated as:

J − J₀ ≈ Oq + U c (2.18)

where the columns of O contain the warping templates. The linear coefficients q and c are estimated as a weighted and regularised least square solution.

According to Wen and Huang(2005) the best model-based tracking results in low resolu-tion video are obtained by combining feature based and appearance based methods. They initialised the face mask manually in the first frame and for each new frame two texture residuals are calculated using an appearance based and a feature based tracking approach.

The method that results in the smallest texture residual is then chosen for the current frame.

2.2.3 Particle Filters

Particle filters (Kitagawa, 1987; Doucet et al., 2000) are statistical tools to model non-linear discrete time series and are used for feature based tracking as well as appearance based tracking. They are a special variant of the Kalman Filter that allows the state variables to be non-Gaussian distributed. The theory of Monte Carlo samples is applied to estimate the posterior probability distribution of the current state xt. When used for tracking, each state represents the location and/or pose parameters of the tracked object in the current video frame t. The posterior of each state is factorised as follows:

P (xt|y_1:t) ∝ P (x1) can be low level image features, for example. Each state usually describes the location and/or pose parameters of frame t and all observations are conditionally independent given the state.

The conditional state density P (xt|y_1:t) at time t is represented by a set of samples, i.e. particles, x⁽ⁱ⁾_t with corresponding weights w⁽ⁱ⁾_t representing the sample probability.

Given a set of particles at time t − 1, new samples at time t are drawn depending on the sampling scheme (MacKay,1998). The bootstrap particle filter uses importance sampling and Monte Carlo samples x⁽ⁱ⁾_t are drawn as:

x⁽ⁱ⁾_t ∼ P (x_t|x_t−1) (2.20)

we_t⁽ⁱ⁾= P (y_t|x_t), w_t⁽ⁱ⁾= we_t⁽ⁱ⁾ P

i we_t⁽ⁱ⁾

(2.21)

where ∼ means “sample from” and w⁽ⁱ⁾_t is the weight associated with sample i. After each time step t all particles are resampled (Doucet et al.,2000).

There are a number of variations to this particle filter approach which differ in the way the samples are drawn, or the resampling rate. The annealing particle filter (Deutscher and Reid,2005) for example decreases the spread of sampling distribution in order to ensure convergence.

Such an annealing particle filter tracking approach, combined with incremental weighted Principal Component Analysis (PCA), is used in Tu et al. (2006) to track faces in low

resolution. They use a face model that only covers the top part of the face as this part is mostly undisturbed by facial expressions. The samples are used to represent translation and rotation parameters in 3D space and incremental PCA is applied to model appearance variations and thus, track the face.

In document Model based methods for locating, enhancing and recognising low resolution objects in video (Page 38-42)