3.3 Principal Component Analysis
4.0.2 Alignment based methods
Alignment based methods attempt to fit a rendered model to an image by minimizing the error between a rendered image of the face and an input image. The rendered face image is adjusted such that it matches as closely as possible the input image, thus minimising
the error. In these systems the face is described using an Appearance Model, or Morphable Model, to describe both the shape and colour of the face. Details related to the pose, lighting are separated from the Appearance Model using physical modelling of these attributes, with a set of adjustable parameters so that the simulated physical attributes can be made to match those depicted in the image. The parameters of both the Appearance Model and the physical model are iteratively adjusted such that the error function is reduced towards a global minimum that represents the best match between the rendered face model and the input image. The error function used is normally the l2-norm or sum of squared pixel
error. Although variants such as weightedl2-norm exist, which are potentially more robust to noise and occlusions in the image. Normalized cross-correlation can also be used, in this case the error function is maximized. The error-functions are normally minimised by finding a relationship between the changes in pixel intensity brought about by varying the models physical and appearance parameters and the differences in intensity between the rendered and input images. However this does not in general result is a linear relationship. Many changes in both physical (e.g. a translation) and shape parameters of the model do not result in a linear-change in the intensity values of the pixels. This problem is compounded in three-dimensions as a transform that is linear in three-dimensions is not necessarily linear when projected onto a two-dimensional plane, this is the case with rotations. Alterations of parameters such as rotation, position of lighting and some shape changes introduce changes to the face’s silhouette, or distribution of shadows, that can have a marked effect on pixel intensity value while representing a small change in the offending parameter. Finally some of the face can be occluded by a non-face object resulting in pixel values that are unrelated to the face model. It is in tackling these problems that much of the variety between various fitting methods is produced.
By separating the key elements of alignment based fitting we can get an overview of the various directions researchers have taken in tackling the problem of fitting a face model to an image.
1. The sophistication of the rendering model: The more accurately the rendering model can synthesise human face images in variety of physical conditions, the more ac- curately it can match the pixel values in a particular image, given the correct pa- rameters. Rendering Models range from having a simple point-source and ambient lighting model [16], use of a 9-D Spherical Harmonic basis for lighting [95] to detection and modelling of specular highlights [67].
2. The error function: The error function describes the difference between the rendered face model and a input image in a sensible manner. The ideal error function both ig- nores irrelevant features, e.g. occlusion, shadows etc. and weights the fitting towards features relevant to the face image. It should also be continuously differentiable so that a gradient descent method can be used to find the global minimum. The cor- rectly match face must minimise this function. Most fitting methods, in both two and three-dimensions, use a squared pixel difference metric. Patterson et al. [58] evalu- ated thel2-norm together with the Mutual Information, in terms of the individual and joint entropy of the rendered and input images, and a correlation ratio between the images.
3. Outlier removal: Many methods make use of, or are derived from, thel2-norm of pixel differences between the input and rendered image. This has the disadvantage of being highly vulnerable to outliers. Outliers in the pixel difference can be an indi- cation of occlusion of the face by a foreign object, or the presence of facial blemish, e.g. a mole, or facial hair that is not captured by the Appearance Model. One method of removing outliers involves weighting thel2-norm according to a measure of confi- dence that the sample point is part of the face image and not part of the background. Romdhani et al. [69] use a Talwar function to remove occluded parts of the face model. Romdhani also used aLorentzianestimator to remove outliers [67]. De Smet et al [76] modelled a visibility map as a binary Markov Random Field to take ad- vantage of the spatial coherence of outliers. In two dimensions, Theobald et al. [81] compared the effectiveness of a number of different methods of detecting outliers on AAMs and concluded that, where known, a weighted probability function based on the standard deviation of residuals resulting from fitting AAMs unoccluded image at each pixel was superior to the other methods tested. But in the case where this data is unavailable, the Talwar or Cauchy functions out-perform the others. The choice of functions investigated was limited to those that required minimal re-computation of the Hessian matrix.
4. The method of minimization (or maximization): The algorithm to minimize the cho- sen error function(s). Most of the error functions used in face fitting suffer from local minima, a varitey of methods to avoid these have been developed. Blanz and Vetter used a stochastic gradient descent method [16], Romdhani and Vetter [69] adapted the inverse compositional alignment method of Baker and Matthews [10] to 3DMMs, this method is based on Gauss-Newton gradient descent. These methods
rely on the computation of the derivatives of the cost function in the image-space or some suitable proxy. This is possible usingl2-norm based function, but non-trivial in other cases. They also suffer from local-minima problems when the data is noisy or the error function is non-convex. Moghaddam used adownhill Simplexmethod to minimize a cost function based on aligning silhouettes [54]. Learning methods such as the regression technique of Cootes et al. [20] have also been used.
5. The method of combining different sources of error: Some of the algorithms use multiple error functions or separately use multiple filters on the images (both input and rendered). These are applied to the model separately, resulting in an iterative multi-pass algorithm [95, 76] or combined using a Bayesian metric to create a single parameter update each iteration [67].