2.2
Shape from Shading
The most important depth information in a single image is in the form of shading. The problem of recovering depth from a single image is addressed by the research area known as shape from shading (SFS) [HP86, Fau99]. Techniques for shape from shading can be categorized as local, linear, minimization and propagation.
Traditionally, shape from shading (SFS) techniques have a number of limiting assumptions in- cluding the use of orthographic projection, Lambertian reflection, constant albedo and infinitely distant light sources. Later SFS based techniques have incorporated perspective projection and variable albedo. Other drawbacks of shape from shading techniques are that they suffer from concave-convex shape ambiguities and are sensitive to shadows.
Shape reconstruction using SFS may give acceptable results for simple shapes such as spheres, but for complex shapes with variable albedo and texture such as the 3D face surface it often fails as shown in figure 2.1. The problems are caused by specular reflectance and the presence of facial hair. The mesh shown in the figure 2.1 has sunken areas around the eyes and nose because of such variations.
(a) (b)
Figure 2.1: Illustration of typical defects in the surfaces obtained using Tsai shape from shading algorithm [TS94].
The traditional SFS approaches are appealing from the theoretical perspective as they try to be all encompassing in terms of the shapes that they can recover. Being generic is another reason for failure of these methods. There is the need for shape from shading approaches that are specific to a particular application such as face reconstruction.
Atick et al [AGR96] recovered 3D shape in an approach that used shading information along with a statistical shape model. It is well known that expectation and prior knowledge in addition to superior sensory mechanisms play an important role in human interpretation of real world shapes. The prior shape information is obtained from principal component analysis of about 300 laser scanned 3D models. The ill-posed SFS problem is thus transformed into a parametric SFS problem. The proposed approach constructs a linear face model as follows:
r(θ, l) = r◦(θ, l) + N
X
i=1
aiψi(θ, l) (2.1)
In this case shape r is parameterized by cylinderical coordinates θ and l, r◦ is the mean shape,
ai is a shape parameter and ψi is an eigen vector.
With the assumption of orthographic projection and Lambertian surfaces, the rendering equa- tion for a single light source is given by:
I(x, y) = η(x, y)L.ns(x, y) ≡ η(x, y)R(L, ns) (2.2)
where L = (Lx, Ly, Lz) is the direction of light, ns(x, y) is the normal to the surface, η(x, y) is
called the albedo, R(L, ns) = L.ns is known as the reflectance map. The normal to a surface S
is calculated as:
ns =
1
p1 + p2+ q2(−p, −q, 1) (2.3)
Where p = ∂z∂x, q = ∂z∂y are surface gradients. Finally the shape from shading problem is formulated as an optimization problem to minimize the error in shape space. The shape error is given as:
E = Z
(I(x, y) − R(x, y))2dxdy (2.4) The optimization process minimizes the error in equation 2.4 to obtain the optimal shape parameter vector (ai).
In addition to the generic shape assumption, constant albedo is a well known limitation of most SFS approaches. In [ZC00], the authors presented a model enhanced SFS method for
2.3. 2D Shape Models 17
face recognition that is robust to illumination changes. In this approach authors assume varying albedo for accurate shape recovery from a single face image. The authors propose symmetric SFS for illumination normalization, thus symmetry is recognized as useful facial shape property which is exploited for shape recovery.
More recent efforts on statistical shape from shading have utilized intensity variation for evalu- ating the surface orientation (or surface normals) rather than as an indication of depth variation. Smith et al [SH06] capture shape variation using a statistical model of variations in the surface normal direction with azimuthal equidistant projection. The surface normals are obtained from 3D range surfaces and a statistical model is constructed after azimuthal equidistant projection. This model is fitted to intensity images using constraints provided by Lambert’s law and image irradiance equations.
Despite all the effort which has gone into shape from shading, many SFS approaches have only achieved poor overall reconstruction results [ZTCS99]. The major limitations of the shape from shading approach include restricting assumptions about the constant albedo, the camera projections (orthographic versus perspective), a single image, the single parallel light source, the lack of ability to handle shadow areas and the lack of prior knowledge about shape space.
2.3
2D Shape Models
Study of shape variations from 2D face images is essential for overcoming problems due to the 3D nature of face. This realization has led researchers to create various shape models from 2D face images to overcome pose and expression variations.
The application of principal component analysis (PCA) to face recognition was a major devel- opment for facial image analysis [Tur90, KS90], because it derived an implicit representation of the face space.
The development of active shape models (ASM) proposed by Cootes et al [CTCG95] has been widely recognized as important for facial image analysis. The authors used manually placed
landmarks to model the shape variation of specific class of faces. The landmarks from each image were represented in the form of a shape vector; and aligned with corresponding landmarks on a reference image using Procrustes alignment.
The application of PCA on the resulting shape vector resulted in a statistical shape space, which allowed the shape to be manipulated by varying shape parameters. The resulting model is also known as a point distribution model (PDM). This model was later extended to include texture variation as well, and became known as the active appearance model (AAM). The texture variation is modeled by warping an input image onto the mean shape to obtain a shape free texture. Both active shape and active appearance models are 2D models and thus do not cope well with pose variation.
The AAM was extend to handle pose variation by proposing a view based active appearance model which involved separate models for profile, half-profile and frontal views [CWT00]. This approach is limited by the available head pose variation and illumination conditions found in the image database used during training.
Yongmin Li et al [LGH01] proposed using a 3D model constructed using 2D landmarks to handle two problems: large pose variation and modeling faces dynamically in video sequences. The 3D shape model was built using 44 landmarks. The texture was decoupled from shape and pose by warping it to the mean shape and frontal view thus giving a shape-and-pose-free texture. The proposed approach requires a large number of images per subject (50 per subject) to model shape variation in addition to requiring a large number of landmarks.
A useful addition to 2D statistical shape modeling is the multidimensional morphable model by Jones et al [JP98]. The main improvement over traditional shape models is the method for establishing correspondence. Rather than using a sparse set of landmarks as used in ASM and AAM previously, pixel wise correspondence was established. This step actually makes it possible to represent a given class of objects as linear combinations of shape and texture vectors. A stochastic gradient descent algorithm is used to optimize over 40 randomly chosen points during each of the iterations of the matching algorithm. Although the results of this paper are immensely useful, the goal of handling pose variation still remains elusive in image
2.3. 2D Shape Models 19
analysis applications.
Xiao et al [XB+04] have proposed an active appearance model that has similar capabilities to
the 3D shape models. The 3D shape is represented by using a manually crafted 2D triangulated mesh based on a spatial arrangement of vertices. First an AAM is constructed for 5 people while using 20 images for each person. This AAM is then fitted to short videos (having 900 frames) of each of the 5 people and the results are used to compute the 3D shape models. Thus 3D modes are being computed using 2D AAM shape modes and the 2D AAM tracking results. It is claimed that a 2D shape model can represent anything a 3D model can represent. This point is illustrated by using randomly generated shape parameters, camera projection matrices and synthesizing the 2D shape of 60 3D model instances.
To overcome these problems, methods based on 3D face geometry have been proposed. Face transfer based on multi-linear models has been used for transferring expression from an in- put video to a 3D model. The multilinear models are built using 3D face scans that encode expressions, visemes and identity [VBPP05]. Pighin et al describe a method for producing real- istic expression synthesis by fitting a hand-crafted 3D model using landmarks and by blending texture from target images using appropriate weights [PHLS98].
One problem with existing methods is that expressions are not synthesized by geometric trans- formation of the underlying shape but by transformation of texture. To synthesize facial expres- sion at varying degrees of expression intensity, given frontal face images with neutral expres- sions, the shape reconstruction approach proposed in this thesis could be combined with a 3D statistical deformation model of facial expressions [JLAG08]. The resulting high-resolution 3D models with texture could then be rendered under a variety of pose and illumination conditions.
The 3D models based on the 2D AAM are sparse when compared to 3D models obtained using scanners. It is difficult to apply these models for photorealistic rendering or expression synthesis because these applications tend to require detailed shape representation (see figure 2 in [XB+04]).
instances which are not physically realizable. By requiring a large number of images with different head pose per subject they are made difficult to extend to novel subjects. Thus such models have been applied only to specific applications such as face tracking. It is difficult to extend their use to other applications such as face recognition.