Review of Literature
2.2. HOLISTIC METHODS 17 formed by affine or rigid transformations. Statistical shape models construct a space using
eigen-vectors where each point represents a different face each eigenvector represents an exemplar face that contributes to an overall face. Deformable mesh models consist of a template mesh that is wrapped around the face or object. Each point in the template mesh is moved close to the target surface using affine transformations.
Deformable Mesh
The deformable mesh representation is characterised by using a template mesh and applying piece-wise affine transformations. Each vertex of the mesh is transformed to align with the target shape.
This is a non-rigid 3D registration where correspondences are implied by the final registration of the template to the target shape.
Bulpitt and Efford[36] utilise a deformable triangular mesh for segmentation in magnetic reso-nance (MR) images. The template mesh begins as a sphere around the object being modelled and is iteratively contracted around the target shape. In each iteration, all vertices are moved to a new position determined by minimum energy:
E =
N
X
i=1
(αiEcont+ βiEcurv+ γiEimg+ Eext), (2.1)
which is calculated over the neighbourhood N of each vertex. The Econtterm is the centroid of the neighbourhood points and ensures the vertices are evenly spaced. Ecurvis an approximate measure of curvature at the measured points. It is determined using the distance of the vertex from the average plane defined by its neighbours. These measures together form a stiffness constraint on the mesh deformation. The Eimg relates to the normalised grey level of the image, which is used to drive the fit once it is close to a final solution. Finally, Eext relates to external constraints for the model such as invalid positions or minimum node spacing. As the mesh contracts around an object, if it becomes overly stretched vertices may be added to allow more detail; in smooth areas redundant vertices are removed. This process is applied to segmenting a head and a metacarpal in MR images.
In establishing a point-to-point correspondence and registering a template to a target shape, Allen et al.[37] and Amberg et al.[38] take a similar approach. Allen et al.[37] focus on registering a deformable whole body template to scans while Amberg et al.[38] present a general, non-rigid Iterative Closest Point (ICP)[39] method that is applied to faces. Both deform a template mesh using affine transformations at each vertex. Every one is moved towards implied correspondences from the closest point on the target shape. Allen et al.[37] optimise a weighted energy function using a Newton-type optimiser (L-BFGS-B by Zhu et al.[40]) to find a set of affine transformations,
one for each template vertex. The energy function contains three error terms: data, smoothness and marker error. The data error is the summed squared distance between each transformed template vertex and the corresponding surface point. Smoothness error acts as a stiffness term, penalising vertex transforms that differ greatly from their neighbours. The marker error term provides an initial alignment for the registration, this requires the scanned surface to have landmarks placed that correspond to the template mesh. Without the marker term, the registration could only function if the template and target surface have a close initial alignment otherwise there would be a tendency to be caught in local minima during the optimisation. Amberg et al.[38] use a similar weighted error term for their template registration, using a weighted difference of transformation in the stiffness term. The primary difference between the methods of Allen et al.[37] and Amberg et al.[38] is the optimisation method. While Allen et al.[37] use a Newton-type optimiser, Amberg et al.[38] modify the ICP algorithm introduced by Besl and McKay[39]. The ICP algorithm is a rigid registration method between two point sets which aligns the sets based on the implied correspondences of the closest points in the two point sets. Like the optimisation method used by Allen et al.[37], ICP requires a good initial registration in order to avoid local minima and align the point sets correctly. Amberg et al.[38] use a non-rigid ICP where the energy function is minimised in each iteration. Additionally, the weighting of the stiffness term is reduced in each iteration allowing for an initial global alignment and then progressively greater local deformations;
this allows for markerless registration in some cases. Using markers, Amberg et al.[38] method’s is later used by Paysan et al.[26] to construct the Basel face model.
Deformable mesh type models, are able to fill in holes and provide an accurate reconstruction of a surface using non-rigid registration techniques. The deformable mesh is able to provide a point to point correspondence between template and fitted surface and between two surfaces by fitting them both to the same template. Correspondence can only be achieved if the template settles in a global minimum state and as such, landmark points are often hand labelled on the target surfaces.
These types of models can also separate the style and content of the modelled shape, Anguelov et al.[29] use a model fitted in a similar manner. In the model the body pose (style) is modelled separately from the body shape(content).
A similar method to the deformable mesh is presented by Bronstein et al.[41] in the Face2Face system. In this system different expressions are modelled as isometries of the facial surface. This has been empirically shown to be valid by Bronstein et al.[42]. Correspondences between facial surfaces are found using a generalised multi-dimensional scaling algorithm (GMDS)[43] on a sample of points from the surface. Using an initial face scan as a template, the GMDS algorithm minimises the distortion of geodesic distances on the surface between faces. This establishes a correspondence
2.2. HOLISTIC METHODS 19 between two surfaces. The Face2Face system is used for virtual make-up, where a 3D video of a face has a virtual make-up applied to the whole sequence of scans based on the initial frame. The GMDS establishes a correspondence between the initial frame and other frames in the sequence.
Statistical Shape Models
Statistical shape models represent another method for modelling a surface that is distinct from the deformable mesh models. Where deformable mesh models have a template mesh that is transformed to conform around the target shape, statistical shape models build a multi-dimensional object space to capture the variation within a population of objects. When applied to faces, each point in this face space represents a different face and the basis vectors of the space are exemplar faces. With a statistical shape model, an object is defined by a vector in the space. This parameter vector and the model can be used to fully reconstruct the object surface with dense shape models.
Having a dense correspondence between surfaces is critical in both building and fitting a sta-tistical shape model. A measure of how each point on a surface changes within a population of objects like faces is required for a statistical shape model to be constructed. Therefore, a dense cor-respondence between a population is required. Styner et al.[44] note that statistical shape models provide excellent promise but require a correct dense correspondence for a good parametrisation.
When applied to faces, the modelled population must be in the same coordinate frame and a dense correspondence known between each individual. Only then can an accurate model be constructed.
Without this dense correspondence, the parametrisation of the model will be faulty and at worse contain no meaning at all.
In the seminal work by Blanz and Vetter[45], a 3D morphable model, also known as a statistical shape model, is constructed for the synthesis of faces from 2D images. The model is constructed from 200 faces where a dense correspondence is found using a gradient based optic flow algorithm.
In the morphable model, the shape and texture of the faces are modelled separately. Each face is represented by a shape vector S = (x1, y1, z1, ..., xn, yn, zn)T ∈ R3n and a texture vector T = (r1, g1, b1, ..., rn, gn, bn)T ∈ R3n. The model is constructed from the eigenvectors of a principal component analysis (PCA) transform on the zero-meaned shape and texture data. The eigenvectors are an orthogonal basis for the face space. A face consisting of shape data S0 and texture T0 is parametrised by:
where siand ti are eigenvectors in the shape and texture model and αiand βiare their associated
parameters. Each eigenvector si and ti represent an exemplar face with some attribute, so by adjusting the parameters αi and βi a new face is synthesised.
When synthesising a 3D face to match a 2D image or fitting the model to a 3D scan, a stochastic gradient descent algorithm is used to determine the shape and texture parameters for a given face.
For the 2D image matching case, the model is rendered using a perspective projection rendering and modelled lighting conditions. To fit the model and avoid local minima, landmarks are placed on the image to ensure an initial fit[25, 46].
Later, Blanz and Vetter[25] use this shape and texture model to perform face recognition in 2D images. When using 2D images there are limiting factors in pose, illumination and perspective that must all be accounted for. Using a model to perform face recognition is beneficial because the parameter vectors defining faces are easily compared and represent only shape and texture information. Since the parameter vectors represent only intrinsic information to the face, straight-forward vector comparisons of parameters indicate the similarity between faces. At the same time, limiting factors in imaging, such as illumination and pose are discarded. Similar models used to synthesise 3D faces from 2D images have also been presented by Lee[47] and Pighin et al.[48].
Paysan et al.[26] present another 3D deformable model primarily for fitting 3D faces, the Basel face model (BFM). This model is constructed from 200 database faces, consisting of 100 males and 100 females using high resolution scans. The model is constructed in the same manner as Blanz and Vetter’s[45] model using PCA, where texture and shape are modelled separately. The dense correspondence of the scans is found using the deformable mesh model presented by Amberg et al.[38]. This method implicitly fills any holes in the input scan without an explicit hole filling process. When using the model to perform 3D face recognition, only shape information is utilised.
The non-rigid ICP method of Amberg et al.[38] is used to capture the complete shape and correct model correspondence of an input face scan. This is initialised by localising the nose tip using the method by Haar and Veltkamp[49]. Using the registered template from Amberg et al’s[38] method with a dense correspondence, finding parameter vectors is trivial; faces are compared by measuring the angle between parameter vectors.
Both models by Blanz and Vetter[45] and Paysan et al.[26] model the shape and texture in sep-arate models. They also model different regions of the face sepsep-arately to allow for a better fit. An-other form of deformable model are those where the expression and shape are separated[50], these are also examples of separating content and style as suggested by Tenenbaum and Freeman[35].