Rochester Institute of Technology
RIT Scholar Works
Theses
Thesis/Dissertation Collections
2004
Efficient Human Facial Pose Estimation
James C. Schimmel
Follow this and additional works at:
http://scholarworks.rit.edu/theses
This Thesis is brought to you for free and open access by the Thesis/Dissertation Collections at RIT Scholar Works. It has been accepted for inclusion in Theses by an authorized administrator of RIT Scholar Works. For more information, please [email protected].
Recommended Citation
Efficient Human Facial Pose Estimation
by
James
C.
Schimmel
A Thesis Submitted
IIIPartial Fulfillment of the
Requirements for the Degree of
MASTER OF SCIENCE
IIIComputer Engineering
Approved by:
Principal
Advisor _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_
Dr. Andreas Savakis, Associate Professor and Department Head
Committee Member
-Dr. Juan C. Cockburn, Associate Professor
Committee
Membet
__
_
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_
Dr. Lawrence Ray
, Research Associate, Eastman Kodak Company
Department of Computer Engineering
College cif Engineering
Rochester Institute of Technology
Rochester, New York
RELEASE PERMISSION FORM
Rochester Institute of Technology
Efficient Human Facial Pose Estimation
I,
James C. Schimmel, hereby grant pennission to the Wallace Library of the Rochester Institute
of Technology to reproduce my thesis in
whole or in
part. Any reproduction will not be for
commercial use or profit.
Abstract
Pose estimationhas become an
increasingly
important areain computer visionand morespecifically in human facial recognition and activity recognition for surveillance applications. Pose estimationis a process
by
whichthe rotation, pitch, oryaw of ahuman head is determined. Numerous methods already exist which can determine the angular change of aface, however,
these methods vary in accuracy and their computational requirements tend to be too high for real-time applications. The objective ofthis thesis is to
develop
a method forpose estimation, whichis computationallyefficient,whilestill maintainingareasonabledegreeof accuracy.Inthisthesis,afeature-basedmethod ispresentedto determinethe yawangle of ahuman facial pose using a combination of artificial neural networks and template matching. The artificial neural networks are used for the feature detection portion ofthe algorithm along with skindetectionand otherimage enhancement algorithms. Thefirst head model, referredto as the Frontal Position
Model,
determines the pose ofthe face using two eyes and the mouth. The secondmodel, referredto as the Side PositionModel,
is used whenonly one eye canbe viewed and determines pose based on a single eye, the nose tip, and the mouth. The two models are presented to demonstrate the position change offacial features due to pose andto provide the meanstodeterminetheposeasthese featureschangefromthe frontalposition. Theeffectiveness ofthispose estimation methodis examinedby looking
atboth themanual and automatic feature detection methods. Analysis is further performed on how errors in feature detection affect theresulting pose determination. The method resulted inthe detection offacial pose from 30 to -30 degrees with an average error of4.28 degrees for the Frontal Position Model and 5.79 degrees forthe Side Position Modelwithcorrectfeaturedetection.
The
Intel(R)
Streaming
SIMD Extensions(SSE)
technology
was employed to enhancethe performance of
floating
point operations. The neural networks used inthe feature detection process requirealargeamountoffloating
pointcalculations, duetothecomputation oftheimage data with weights and biases. With SSE optimization the algorithm becomes suitable forTable
ofContents
ListofFigures iv
ListofTables vii
Chapter1: Introduction 2
Chapter2: Background onPose Estimation 5
2.1 AnthropometricResearch 5
2.2 Appearance-based Approaches 7
2.3 Feature-based Approaches 9
2.4 An Overview ofSkin Detection 12
2.5 An Overview ofArtificial Neural Networks 15
Chapter 3: Feature DetectionandNormalization 19
3.1 LimitationsandDifficulties 19
3.2 Skin Detection 21
3.3 Image Normalization 27
3.4 Neural Network
Training
283.5 Feature Detection 37
3.6 Template
Matching
Algorithm 42Chapter 4: Pose Estimation
Theory
andModels 454.1 Frontal Position Head Model 45
4.2 Frontal Position Head Model Pose Estimation Process 53
4.3 Side Position Head Model 55
4.4 Side Position HeadModel-Pose Estimation Process 60
Chapter5: Pose Estimation Results and
Accuracy
645.1 Image Set 64
5.2 Manual Feature Detection 65
5.2.1 Frontal Position Model 66
5.3 Automatic FeatureDetection 74
5.4 Error AnalysisandPropagation 75
Chapter 6: Implementation andSSEOptimizations 88
6.1 Matlab Implementation 88
6.2 C++Implementation 89
6.3 IntelSSE Optimizations 91
6.4 Performance Analysis 95
Chapter 7: Conclusion andFuture Work 100
ListofReferences 102
List
ofFigures
Figure 1:Possibleanglesforpose estimation 3
Figure 2: Exampleheadmodelusinganthropometric measurements 6
Figure 3: Neuralnetwork 16
Figure 4: Nodeoperation 16
Figure 5: Algorithm flow 19
Figure6: Skindetection histogram 22
Figure 7: Original image anddetectionskin area 23
Figure 8: Resultof morphologicalfilters 23
Figure 9: Skinregionimage 24
Figure 10: Final facialregion 24
Figure 11: Original imagewithhand 25
Figure 12: Region imagewithhand 25
Figure 13: Final facialarea withhand 25
Figure 14: Effectofconnectedskinregions 26
Figure 15: Skintones 27
Figure 16: Samplefaces for
training
29Figure 17: Eye feature images 30
Figure 18: Non-eye featureimages 30
Figure 19: Mouthfeature images 30
Figure 20: Non-mouthfeatureimages 30
Figure 21: Eyenetwork configuration 32
Figure 22: Mouthnetwork configuration 32
Figure 23: Transferfunctions
[27]
32Figure 24: EyenetworkROC graph 35
Figure 25: MouthnetworkROCgraph 35
Figure 26: Processofscanningtheimage 38
Figure 27: Original colorimage 38
Figure 29: Grayscale image 39
Figure 30: Mouthweightedlocationmap 40
Figure 31: Eyeweightedlocationmap 40
Figure 32: Eyeweightlocationmap afteraveragefilter 41
Figure 33: Facialtemplates 42
Figure 34: Finalappliedtemplate 44
Figure 35: Headmodel
-overheadview 46
Figure36: Headmodel
-overheadview
during
rotation 47Figure 37: Headmodel
-sideview 48
Figure38: Geometrictemplate onthe2-Dplane 49
Figure 39:Magnification
during
image capturing 50Figure 40: Theoreticalanglesduetopose 52
Figure 41: Comparisonof
theory
anglestoa realface 53Figure42: Eyeangles alphal vs. alpha2 54
Figure 43: Inversegraph of alphal vs. alpha2 54
Figure44: Headmodel
-overhead viewwith nose 56
Figure 45: Headmodel
-rotated overheadview withnose 57
Figure 46: Headmodel- sideview with
nose 57
Figure 47: Geometrictemplate onthe2-Dplane 58
Figure48:
Theory
angles forsideviewheadmodel 59Figure 49: Comparisonof
theory
anglesto realfaceforsideviewheadmodel 60 Figure 50: Pose Anglevs. theproportion oftemplateangles 61Figure 51: Linearapproximationforpose from
theory
data 62 Figure 52: Linearapproximation forpose comparedtoimage data 62Figure 53: Sampletestimages 65
Figure54: Bestcase manualdetectionposeresults, FrontalPosition Model 68
Figure 55: Bestcase manualdetectionerror, Frontal Position Model 68
Figure 56: Worstcasemanualdetectionposeresults,Frontal Position Model 69 Figure 57: Worstcasemanualdetectionerror,Frontal Position Model 69
Figure 58: Bestcasemanualdetectionposeresults, Side Position Model 72
Figure 60: Worstcase manualdetectionposeresults, SidePosition Model 73 Figure61: Worstcase manual detectionpose error, Side Position Model 73
Figure 62: Samplecomparisonforpose results 75
Figure 63: Errorinthe template 76
Figure 64: Poseerrorduetofeatureerrorinclosest eye 78 Figure 65: Poseerrorduetofeatureerrorin farthesteye 79
Figure 66: Poseerrordueto featureerrorinmouth 80
Figure67: Poseerrordueto errorinallfeatures 81
Figure68: Allowederrorinfeature detection 81
Figure69: Graphicalrepresentation of error 82
Figure 70: 3-Dplot of pose errorduetorelative errorineyedetection 83 Figure 71: 3-Dplot of pose errorduetorelative errorinmouthdetection 83 Figure 72: Monte Carlo simulation,
introducing
errorintoallfeatures 84 Figure 73: Monte Carlo simulation, effectofvaryingtheerrorrange 85Figure 74: Effectofscalingon eyedetection 86
Figure 75: Effectofscalingonmouthdetection 86
Figure 76: Matlab functions 89
Figure 77: UML diagram 91
Figure 78: Typicalxmmregister representation 92
Figure 79: ExampleofSSE operation 93
Figure 80: ANNoptimization 94
Figure 81: Colorconversion optimization 95
Figure 82: Relative
timing
infunctionswithAthlonprocessor 96 Figure83: RelativetimeinfunctionswithPentium 4processor 96 Figure84: Relativetimein functionswithPentium 4 andSSE optimizations 97Figure 85: Effectofimage sizeonoptimizations 98
List
ofTables
Table 1: Typicalconfusion matrix 33
Table2: Eyenetwork confusion matrix 36
Table 3: Mouthnetwork confusionmatrix 36
Table4: Poseerrorfor45 to-45 degreesusingmanualfeaturedetection 66 Table 5: Poseerrorfor 30to -30degreesusingmanualfeature detection 67 Table6: Poseerrorfor 20 to-20degreesusingmanualfeaturedetection 67 Table 7: Poseerrorfor 45to -45 degreesusingmanualfeaturedetection 70 Table 8: Poseerrorfor30to-30 degreesusingmanualfeaturedetection 70
Table 9: Poseerrorfor 20 to-20degrees usingmanualfeature detection 71 Table 10: Pose errorfor 30to-30degrees usingautomaticfeaturedetection 74
Table 11: Comparisonchartforall pose methods 74
Table 12: Poseerrorduetofeatureerrorinclosest eye 77 Table 13: Pose errorduetofeatureerrorinfarthesteye 78
Table 14: Poseerrorduetofeature errorinmouth 79
Table 15: Poseerrorduetofeatureerrorinallfeatures 80 Table 16: Averageandstandarddeviationvaluesfor Monte Carlo simulations 85
Chapter
1:
Introduction
Facialpose estimationhas become animportantresearchtopic incomputer visiondueto the large number of applications that may benefit from an efficient method. Faces in images rarely appear frontal and upright, so detection of faces and their associated information has provendifficult. A face can undergo any ofthree rotations out ofthe plane ofthe image. Such rotations introduce deformations into the image affecting the facial
features,
which are presentand the overall shape ofthe head. Pose estimation introduces a way to detect the change in direction and then use this information to augment further processing ofthe image. Real-time
tracking
ofhumanfaces,
image classification, and 3-D modeling are just some ofthe currentapplications for which pose estimation can play a significant part. Current methods for facial pose estimationhave focused on a range ofmethodsfrom using wavelet networksto calculating posethroughgeometric means.
The problem offacial recognition and facial pose has been around for sometime. The humanvisual system can easily detectnot only ahuman
face,
butalsothe relative directionthatface is aligned in 3-D. In computer vision systems those same visual task have been proven difficult. Variations in
lighting,
image size, orientation viewpoint,body
movement, and facialexpressionhavecreatedawide range ofdifficultiesthatmusteventually beovercome. Thescope ofthehumanvisual recognitionisvastmaking it importantto subdividetheprobleminto smaller tasks, suchas poseestimation.
While pose estimation has been examined
by
a number ofresearchers, papers on thissubject tend to combine pose estimation with otherforms offacial detection or 3-D modeling.
Estimating
the pose of a human face may be performed in a variety ofwaysincluding
whichdirections of rotationto examine. The
'yaw'
angle of rotation occurs whenthefaceturnsrightor left from the frontal position. The
'pitch'
ofthe face is the degree to which the faced is turned
vertically up or down. The
'roll'
[Pitch!
Figure 1: Possibleanglesforposeestimation
Inthis approachtopose estimation wefocusprimarily ontheyawangle forpose change.
The yaw angle has the greatest possible benefit to applications such as
tracking
and modeling,since it is a good indication ofthe direction a person is
facing
orheading.By
focusing
on onlytheyawangle change,we eliminate manyofthecomplicationsassociated withfacial recognition
andfocus closelyontheproblem athand.
Methods of pose estimation are oftenbroken down into two categories: feature-based or
appearance-based algorithms. With feature-based pose estimation, the determination ofpose is
basedonthelocationofindividual features. Thesemethodsrequiretheuse offeaturedetectorsto determinethelocationofimportant featuresinthe face. With appearance-basedmethods,pose is
determinedbasedontheface as a whole. Oftenappearance-basedmodels incorporate sometype
offilterto analyzetheimageandthenmake adecision fromtheresultingimage. Anoverview of
feature-basedandappearance-basedtechniquesisprovidedinChapter 2.
With feature-basedapproachestoposeestimation, theway in whichfeaturesaredetected is just as important as which features are detected. For this method we have chosen to use
Artificial Neural Networks as the basis for the feature detectors ofthe system. Artificial Neural Networks
(ANNs)
are an attempttoallowcomputer systemsto learnand adaptby
modelingthestructure ofthe neurons inthe humanbrain. ANNs learn
by
adjusting theirinternalweights andOnce the features for the face have been
determined,
the question ofhow to determinepose fromthesefeatures israised. Againthereare a number ofways inwhichto determinepose
from known features points; these methods depend on the number of features and often the
methods used to acquire them. In Chapter 4 our own models are presented which use three featurepoints onthe face. The firstmodel, referred tohenceforthas theFrontal Position
Model,
makesuse ofthe detection ofthe two eyes andthemouth. This modelmakes use oftheinherent symmetry in the face and is intended primarily for detection of pose in frontal images. The secondmodel, referredtoas theSide Position
Model,
makesuse ofthedetectionof oneeye, the nosetip
and the mouth, and potentially detects pose at greater angle variations. Chapter 5discusses the results ofthese models in
detecting
pose and also introduces some error analysisregarding howerrorsinfeaturedetectionpropagatethroughthe systemeffectingpose estimation. Since the algorithm efficiency is
important,
it is necessary to observe how currenttechnologies mightbe usedto increase the system's speed. Intel's SSE
technology
provides thebest means to enhance the performance due to the algorithm's due high use of
floating
pointoperations.
By
parallelizing many of the calculations performedby
the ANN we achieve anoticeable performance improvementofthe algorithm. Chapter 6 shows howthe algorithmwas implemented into
Matlab,
C++,
and C++ with SSE optimizations. Performance analysis isconductedto showthespeedup fromSSEoptimizations.
Thetechniques presented
here,
such as theuse ofANNs for featuredetection,
may later be extended oradapted toother methods. Themodels presented are also generic enoughto later be expanded to include other angles ofdirectional change. The most important contribution ofthis thesis isan efficient, yetaccurate, means of pose estimationthat caneasily be integrated in
Chapter
2:
Background
onPose Estimation
In understandingtheorigins ofthepose estimationproblem, it is importantto lookat past
research relatedto this topic.
Comparing
previoustechniques inpose estimationhelpsserve as a referenceto themethod presentedinthisthesis.Not only is itnecessarytolookat other methodsofpose estimation, butat methods that are relatedtofacial recognition such as featuredetection
and skin detection. This chapter also provides a briefoverview ofArtificial Neural Networks withregardstotheirgenerals operation andhow ANNs have beenusedinotherresearch areas.
2.1
Anthropometric Research
Anthropometry
is afield ofstudythat involves the measurement ofthe humanbody
andits respective parts. Anthropometric researchcovers not only to the size ofhuman
features,
butalso their relative proportions across different races and ages. While this area of study is
relatively old, it still hasmanypractical uses. Withpose estimation and facial feature
detection,
anthropometric researchcanbeused asthestartingpointfor
building
ahuman headmodel. Such models can useknown anthropometricdata andinformationto determinethe relative locationoffacial features.
Examples of anthropometric study have been seen throughout human history. Most
people are familiar withLeonardo Di Vinci's "Universal Man." This workof art makes note of
many proportions ofthe human
body
that appearto be common among all humans. While this work focuses on limb length andheight,
it makes the point that using anthropometric data is well-known. In the 1970's the scientific community began to take a more serious look atanthropometric data
by developing
large databases [1]. This information has been used forapplications such as ergonomics and other biometric applications. These databases took
measurementsof various
body
parts forthousandsofindividuals. Againthisdatafocusedmostly onheight and limb proportions, but research has done withfacial measurements. Reference[2]
offersonesuch exampleofresearchthathas been done using facialproportions.
To make any pose estimation process useful it must be person-independent across the
spectrumoffacial appearances, sowithanthropometry thequestionarises astohowuseful such datamight betowards facialpose estimation. Reference
[3] by
Horprasert and Yacoob supportsboth genetic and environmental constraints. The shape ofthe eyes, therelative feature
distances,
and even the size ofthe head are all limited within certain ranges regardless ofthe individual
being
observed. Reference[3]
uses anthropometric data extensively to even classify the individuals'gender, race and age before performing its
tracking
operation. Such methods allowfornoprior knowledge ofthe individual in
tracking
or faciallocation,
andtherefore tend to bemore adaptabletoother applications.
Even throughall humanproportions are limitedto ahigh degree ofvariability, there still
exists roomforerrordueto these featuredifferences. The greatest source of error canoftencome from young children, since their relative face proportions are closer together at birth
[4],
mostnotably the distance between the eyes. As the child
develops,
and the nose becomes morepronounced, the proportions change. Race and gender also play an important role in feature
placementand proportionsbuttoalesserextent. Thegoalthenisto
develop
aheadmodelthatisgeneric enoughto fitmost
individuals,
but can usedifferences intheplacement ofthese featurestodeterminepose.
Reference
[2]
by
Alattar and Rajala offers a good example of a head model built usinganthropometricdata. Thismodelispresentedin Figure 2.
1
Figure 2: Example headmodelusinganthropometric measurements
The work presented in this reference primarilyuses elliptical approximations to first determine
the location ofthe head and then incorporates the model to
help
localize the features. ImageOften researchers use the term model-based in reference to their approach for
dealing
withfacialrecognitionor morespecificallypose estimation. Model-basedapproachessimplyuse
some formofahumanheadmodeltoaccomplishtheir goal,andanthropometric researchusually plays apart inthe development ofthis headmodel. While the bulk ofmodel-based approaches are also feature-based approaches, some appearance-based methods to pose estimation also
includetheuse ofaheadmodel and,
therefore,
canbe considered model-based. The debate overwhat
terminology
touse indealing
with certain methodsis explored inanumber ofotherpapers[3]
[4]. Since themethod presented inthis thesis is primarily a feature-based approach,breaking
down other research into either appearance-based or feature-based categories helps in
understanding how this method differs from previous methods. Model-based approaches are classified as
being
feature-based,
ifthey
make any use offacialfeatures,
otherwise,they
are considered appearance-based.2.2 Appearance-based Approaches
Appearance-based approaches,also knownasview-based or holisticapproaches, attempt to determine the pose ofa human face
by looking
at the face as a whole. Methods vary fromusingmorecomplex methods such asGaborwaveletfiltersandneuralnetworks, to more simple methods like color-based comparison. Appearance-based methods seek to avoid the
decomposition offaces into parts such as
features,
and instead look for the face in its entirety.Once the face has been determined these methods base their classification ofthe face from
preexistingdataoffacestakenat variousposes.
The advantages of appearance-based methods are often dependent on how
they
areimplemented,
but here is the general overview. Appearance-basedmodels require informationregarding the face as a whole and are not based on individual
features,
thus, the amount oftraining
and feature extraction required are reduced. These methods in general require lowresolution images. Reference
[5]
requires only a 20x20 image of the face to make itsclassification. With such low resolution, the computation required has the potential of
being
quitelow,
howeverwith presentapproachestheoppositehasprovedtrue.The disadvantages presented
by
appearance-based methods are the primaryreasons whywork into this area has been limited compared to its feature-based counter part.
Often getting the face into a form suitable for classification requires a significant amount of
preprocessing.
Finally
appearance-based methods tend to be less accurate and less robust thanother methods, since
they
require computing a global, lower-dimensional representation oftheimage
information,
although this is subject to debate [3][4][8]. Since the problems associatedwithappearance-based methods are well
known,
it is importanttolookathowother researchhas attemptedtodealwiththeproblem of robustness.An approach taken
by
some researchers[4] [6] [15]
has been to use eigenvectors andeigenvaluesto compute an eigenspace that capturesthe facialinformation in a low dimensional feature space. Aneigenspace is computedusing the eigenvectors fromthe covariance matrix of
the
training
set. The eigenvectors associated with the largest eigenvalues are used as therepresentationofthe
training
set.Eigenspacesthatare usedtorepresentthe faceare also referredto eigenfaces. The approach assumes that low-resolution facial images form a linear subspace
thatcan beused to classify images either for facial detection or pose estimation [4]. With pose
estimationthe methodbecomes more complexrequiringmultiple eigenfaces to be computedfor eachpose desired. Theuseofeigenfaceshasthebenefitsof
limiting
theinfluence oflighting
and other noise ontheimage,
whileproviding accurateclassification offaces. Reference[6]
obtainedan 83% rate ofrecognition using eigenspaces; however the computation of such methods is
relatively high.
Another area of appearance-based pose estimation research is using Gabor filters and
wavelet transforms to determine pose. Gabor kernels are sinusoidal modulated Gaussian
functions. Gaborwavelets are sets ofkernels withdifferent spatial frequencies and orientations.
A Gaborwavelet-transformisproduced
by
convolutionwithGabor kernels [4].The benefits ofusing Gabor filters and wavelets allow for better processing offacial
imagesand classification. Wavelettransforms alsohavethe advantage ofproviding some degree
ofinvariance to
illumination,
skin tone and hair color. Complex uses oftransforms even allowfor selective scalinganddeterminationoforientation. Reference
[7]
makes use ofGaborwaveletfilters in the determination offacial pose. First aclassification is setup using 1470 ofimagesat 10 degree intervals. These
training
images arethenprocessed usingdifferent Gaborfilters. Fromthe initial
training
a filter is chosen to represent the optimal filter for each pose.During
thedeterminationofpose,animage isprocessed with each optimal filter. Theresulting image isthen
were poor, achieving the determination of pose only within +/-10 degrees 90% ofthe time.
Gaborwavelets have been combined with other forms of classification such as artificial neural
networks andBayesian networks [8]. Theuse ofANN inappearance-based models is presented in section 2.5. Although Gabor filters have their limitations
they
do offer a way ofreducingmanyoftheerrorsthatcan occur with appearance-based models.
The examples of appearance-based models presented
help
show their limitations andrelative effectiveness in
detecting
pose. Theproblems of robustness in such applications andthe general agreement that appearance-based methods are more inaccurate than feature-basedmethods,
help
support why the feature-based approach was chosen as the preferred method topose estimation. In the next section we look at feature detection and a number offeature-based
approaches.
2.3 Feature-based Approaches
Feature-basedapproachesdepend onthe extraction offeaturepoints inan
image,
andthe use of these features points in some manner to estimate pose. Feature detection is simply a method ofidentifying
theposition of a specific part of animage;
inthis case an eyeor a mouth. Oftenwithfacialfeaturedetectionsuch methodsrequiretheuse of aheadmodeltoformulatethe final calculation, which is why such methods are often referred to as model-based approaches.Obviously
feature detection itself becomes the focus of most of these methods due to itsdifficulty. Sincefeaturedetection isso
important,
it is necessarytolook intoresearch onthisarea alongwithwaysdetectedfeatures have beenusedto determinepose.An importantpoint of all pose estimation methods isthe determinationofwhich human facial featuresto use.A largenumber offacial featuresoften require additional computationbut have abetterchance at
determining
pose correctly. Reference[9]
uses thedetectionofanyfacialfeatures with edges to match to a facial template for 3-D reconstruction; the number offeature points is
fairly
high and also variesdepending
on the face. Reference[10]
uses fifteen featurepoints spread around the eyes, nose, and mouth region. Reference
[11]
uses only five feature points, namelythe eyes, eyebrows and mouth todetectboth face and pose. The use ofthe eyesasfeature pointstends tobepresent inmost pose estimationtasks due its ease ofdetection from therest ofthe face.
Eyes,
especially the pupils, still have the same general shape across humanrest ofthe face.
Finally
the intra-ocular distance isthe measurement ofthe distance betweentheeyes. The intra-ocular distance varies from individual to individual sothis distance can be used
forpose estimation only ifthe distance is known for the frontal position of a specific person or
can be normalized. Reference
[12]
states that as facial pose changes, the distance between theeyes also changes in size,making theeyes andthe intra-ocular distance ideal forpose estimation
methods.
The mouth is another commonly used
feature,
but it introduces some challenges to itsdetection. Themouth tends to vary in shapeconsiderably between individuals. As theface turns,
the shape ofthe mouth deforms at a greater rate compared to the eyes.
Finally,
from a colorperspective, the natural color tone ofthe mouth often matches the surrounding skin or other
facials tones with some individuals. The color ofthe mouth can be influenced with the use of
makeup such as lipstick. This makes edge detection and color based methods difficult. The
mouthis also sensitiveto theemotion oftheindividual andis influenced
by
smilingor frowning.Thenose, onthe other
hand,
often causesmanyproblemsdue to its lackof contrast withthe rest ofthe
face; however,
methods such as references[10] [13]
still makeuse ofthis featurewith some filtering.
Depending
onthepose, thenostrils arenot always effectivedetectionpointssince changing the pitch ofthehead removesthese features fromthe 2-D image.
Also,
the noseshapeand sizemay obscure nostrilfeatures eveninthefrontalposition.
A number ofmethods attempt to make use ofthe head outline to estimate pose. These
approaches involve elliptical approximations ofthe head shape, which can be computationally
complex in applications.
However,
hair styles and background interference tend to worsen theresults of these experiments as shown in reference [14]. Facial hair causes problems with
elliptical approximations andwith the detection or use ofthe mouth as a feature point. Images
withfacial hairareusually ignored
during
testing.In orderto detectthe features in a
face,
it is often important to establishgroundtruth forthese features. With facial
images,
the use oftagging
certain facial feature regions presents thepossibility ofusingthese featurestoestablishagroundtruth or a standardforwhen the faceis in
a frontalposition. As the face is turned, theviewablefacial feature regions changeand, basedon
their size and shape, pose may be estimated. Such regions may be used in
training
a system toCootes and Tayloroffer some unique approaches to feature detection
by
introducing
theuse of active shape models
[16]
[17]. Active shape models are a way to detect features usingtrained knowledge oftheedges ofthe shape. Active shape models seekto limitthe deformation
ofthemodel to shapes found inthe
training
set.By
approachingfeature detection inthis mannerthey
seekto have variability in the feature model to allow fornoise and occlusion, but limit theamount of false detection that can occur
during
classification. Methods that involvetraining
usuallysharethissamegoal, since false detectionoffeaturescan severelyworsenresults.
An optional approach to whole feature detection is to detect only part of a feature.
Reference
[18]
uses the corners ofthe mouth fordetection,
instead ofthe entire mouth, to avoidshape changes that can occur. Reference
[
19]
uses separate detectors for the corners ofthe eyesandthepupil inorderto accuratelylocatethefeature.
While
training
isthebestexampleofusing priorknowledge forfeaturedetection,
itisnotthe only approach that can be taken. With color
images,
a histogram of the feature can bedeveloped from a numberofsample images. Featurescan thenbe detected
by dividing
an imageinto sub-images and
finding
the sub-image with the greatest correlation to the known colorhistogram,
an example ofthis approach is takenby
reference [20].However,
such methods arelimited in robustness, since
they
do not adapt to occlusions and image quality changes such aslighting
or noise. Another similar approach is the use of feature template matching. In theseapplications, ageneric template ofthe eye or mouth is first created from a test set. Features are
detected
by
correlating sub-images with the feature template and thenidentifying
sub-imagelocations most correlated with the template. These methods tend to be improved fromthe color
histogramapproach, butsufferfrom many ofthe same problemswithocclusions andnoise.
Once facial features are
determined,
it is possible to estimate the final pose in a numberof ways. The most computationally efficient way is through the use of geometric means.
Proportions betweenthe facial features and theiralignment may beusedto calculate the angle at
whichtheface ispositionedofffromcenter.
However,
geometric means ofposeestimation oftenlack the robustness ofother methods and rely
heavily
on correct feature detection. References[10] [13]
bothmake use of geometricandlinearequations todetermine pose.Template matching is the most common methodof
determining
pose through geometricmeans. A template is
typically
composed of a set of points that form some predeterminedcrosses
[11],
or pyramidal shapes[13],
although more complex patterns have been usedincluding
facial meshes or whole head generic structures. The algorithm for matching thetemplate to the model or image usually involves using detected features from the images and
deforming
the template to match the image. Often the process offitting
a template to detected points is called a "bestconstellationsearch"
[20]
and can often entail a decision makingprocessto avoid any
falsely
detected points. The change in the template can be used to determine itsorientationin 3-Dspaceusingeitherlinearequationsor sometypeofclassification system.
A step beyondtheuse oftemplate matching is using afull 3-D modelofthehead inpose
estimation. In these approaches, a 3-D model ofthe person under observation is
typically
used [13]. These methods use features extracted from 2-D images and project them onto this 3-Dmodel.
Using
the full 3-D model allows the determination ofthe rotation ofthe face in allthree possibleplanes anddoes so witha great deal ofaccuracy, often within a single degree. In orderforthese systemsto work, agreat deal ofinformation must already be known about the subject,
whichisnotpractical inmost applications.
All of the previously mentioned methods have both strengths and weakness. One
common problem among them is
dealing
withfalsely
detected features. The strategies toimprove feature detection
typically
involve waysto improve both image quality and localizationwithinthe image. Thenext section introducestheuse of skin detection as a means of
improving
feature detection.Using
skin regions canhelp
localize features and limit the search area forfeatures,
thishelpstoreduce computation andfalse detectionrates.2.4 An Overview
ofSkin Detection
Skin detection is partofthe largerproblem of skin segmentation andhas been appliedto
a number of applications ranging from scene classification to tracking. Skin detection is the process of
determining
which colors, pixels, or regions in an image are skin. Skin segmentation usesthiscolorinformationto extractskinregions out oftheimage thatmaycorrespondto humanshapes like hands or a face. The detection of skin color is independent of facial orientation
making it ideal for pose estimation. Also skin detection is one of the faster facial feature detectionmethodsmaking itsuitableforreal time systems [24].
The detection ofskin is not without its
difficulties,
and the most notable ofthese has todifferent
lighting
conditions; this problem is referred to as color constancy[4]
[24]. Skin colorappearance depends on the brightness ofthe light sourceused in obtaining the image as well as thesurfacetextureandtone ofthecolor intheimage. One approachin
dealing
withlighting
isto usedifferentcolorspacesforskin detection.The use ofthe RGB
(Red, Green,
Blue)
color space in skin detection is common due toease ofprocessing such information sincemostimages are already inthe RGBcolor space. Also theconversionfrom one color spacetoanothertends tobe computationally expensive. TheRGB
color space is also the most susceptible to
lighting
changes, which is why methods ofnormalization are often introduced. References
[4] [23]
introducethe use of colorhistograms for theredandgreen separations oftheimage. These separations are firstnormalizedusingthe blueseparation to
help
reduce variations inlighting
and also skin tones from one individual to another.Another color spacethat is used in skin detection is HSI (Hue-Saturation-Intensity). Hue
corresponds to the attributes ofcolorthat allow them to be classified as red-blue-yellow-green,
while saturation is the purity of color. This leaves
intensity
asbeing
closely related to thebrightness of the image. To make detection invariant to
lighting
we ignore theintensity
separation
during
detection. However one problem with HSI is with pixels oflowintensity
or saturation, such pixels makethedeterminationofHueunreliable [4].SimilartoHSI istheuse ofthe
YCt,Cr
color space. TheCb
componentisthe chrominanceofthe blue separation while the
Cr
component is the chrominance ofthe red separation. The Y component is the luminance ofthe image and is related to the brightness ofthe image and lighting.During
skindetection,
the Y separation istypically
ignored to make the process invariant to lighting. The equations for the conversion from RGB toYCbCr
are shown below.References
[25] [26]
usetheYCbCr
intheirskindetection method.Y= 0.257*7?+0.504*G+0.098*B+16
(1)
C
=-0.148*/?-0.291*
G+0.439*B+128
(2)
Although people generally think of skin color varying greatly among different ethnic
groups, in actualitythe skin tones cluster together once luminance factors are removed. This is another advantage ofusing either the HSI or the
YCbCr
color space [4]. It is this clustering of skin tones that is the basis for most skin detection methods.Looking
atthe distribution ofthenon-luminance color separations on a 2-D graph, one can observe that the skin pixels tend to
group with an almost Gaussian probability distribution. References
[4] [23] [24]
actually fit a Gaussian distributionto their colordata and determine skinbased on wherepixel intensities fallwithinthedistribution.
Other methods ofskin detection use more simplified methods such as
lookup
tables or histograms. Reference[11]
uses RGB values and alookup
table to determine whether or not apixel is skin. Withcolor
histograms,
ahistogram is made for each separation ofcolorusing skintone information.
During
the detection process, if a pixel's color information falls within thehistogram,
then thatpixel is classifiedas skin[23].More complex methods ofdetection expand detection to include the use of classifiers
such as Bayesians networks
[25]
or ANNs to make the final determination between skin and non-skin pixels. Classifiers combine the probability that the current pixel is skin with neighboringpixel information todetermineskinlocation.Once the pixels in an image are detected as skin, or non-skin it becomes necessary to
extract the parts ofthe image required
by
the algorithm, and this is where skin segmentation comes intoplay. Some approaches[11]
make use of elliptical approximations of skin regions to determinewhetherthey
are aface ornon-face; the largest ellipseisprocessed forfacial features. More simplified methods simply use the skin segments to localize the search for features.Methods of segmentation also employ morphological operations such as dilation and erosion to improvetheselection ofaskinregion.
Inthisthesis,skin detectionand segmentation are usedtolimitthe searchforfeatures and
to
help
normalize the image. The use of skin also helps reduce the false detectionrates from the feature detectors. The next section focuses on an overview of artificial neural networks that are2.5 An Overview
ofArtificial
Neural
Networks
An artificial neural network is a
learning
system that is capable of performing dataclassification based on input
training
data. As statedbefore,
ANN systems are motivatedby
theinternal connections ofthe neurons in the human brain. ANNs are made up ofinterconnected
processingelements callednodes,which respond inparallelto a setofinputsignals. Thenode is
the equivalent ofits braincounterpart,theneuron.
The concept for ANN is
fairly
old,dating
backto the 1940's when McCulloch and Pittsintroduced the first neural network computing model. More complex forms ofthis model were
developed into the 1950's
including
Rosenblatt's work resulted in a two-layer network, theperceptron,whichwas capable of
learning
certainclassificationsby
adjustingconnectionweights[28]. Despite
laying
the foundation for futurework, lackofcomputingpowerduring
thistime ledto a decline in the interest of neural networks until the 1980's. In 1984 DARPA began
experimenting with ANNs for both commercial and defense purposes [27]. Since that time
ANNs have been used inawidenumber ofapplications from economic
forecasting
to guidancesystems forautomobiles. Researchershave been attractedtoANNsbecauseoftheireffectiveness
in solving complex problems. The function of these networks depends
heavily
on theconstructionofthenetwork andtheway inwhich networks aretrained.
The constructionoftheANN consists ofits noderepresentation, its
interconnections,
andan activation rule. Figure 3 represents one possible construction of a neural network. All
networks consist of an input layer of nodes and an output layer ofnodes; however the hidden
layer of nodes may not always be needed
depending
on the application. More complexapplications often require multiple hidden node layers. The interconnections between the nodes
also affecttheresults ofthenetwork. Eachnode isnotnecessarilyconnectedto everyothernode
inthe
following
layer,
althoughthey
oftendo,
changingthe connections inthenetwork canresultin a change in performance. As the signals are passed from node to node
they
are also oftensubject to an activation rule. The activation rule affects when the input signals are used to
produce the corresponding output signal
by introducing
delays into the signal path of thenetwork, againtheresults ofthenetwork canbe changed.
The two most basic types ofANN architecture are the feed-forward network and the
feedbacknetwork.Afeed-forwardnetworkispresented witha set ofinputsandthesignal moves
signalmay be passedbackinto thesame layerora previous
layer,
theresults ofthe networkthenbecomemore dependent on
timing
in the system and the convergence ofthe signals [29]. Thisthesisdealsprimarilywithfeed-forwardnetworks.
Node
Interconnections
InputLayer
Hidden Layer
OutputLayer
Output Information
Figure 3: Neural network
The nodes ofthe network maybe configured in various ways to compute the results for
the network. Figure 4 shows a typical node calculation for a network, and this is also how the
nodes usedforthefeature detectors inthis thesisoperate.
Weights
(Wj
)^/
.Input(Ij) T--^(2]
-47] ? outPut
#"
/
bias
a= '^Irputs *Weights '+bias
Figure 4: Nodeoperation
The most important feature presented here is how the weights and biases are used in the
computation of the result. Each input value is multiplied
by
a weight determinedduring
thelearning
process.All oftheseweight and input combinations arethen summed and offset with afunction. Transferfunctions modifytheresult to fall within acertain rangeandaffectthe results
ofthenetwork. Examplesoffunctions used aretheinversetangentand
log
functions [27].The
learning
process, ortraining
process, greatly affectsthefunctionality
ofthenetwork. Allnetworks mustconsist ofalearning
rule thatdictates howto adjustweights andbiases basedon a given set of inputs.
Training
can be subdivided into either supervised or unsupervisedtraining. Withunsupervised
training
thenetworkisdevelopedusing onlythe inputvalues, whereas supervised
training
combines the inputs with known output values.Generally
the moreinput/output pairs used in the
training
process the better the results ofthe network. The goal oflearning
algorithms is to adjusttheweights of a network, so that as many inputs as possible aregiventhecorrect output.Thisthesiswillusetheback-propagation
training
methodtodevelop
thenetworks used in feature detection. The approach taken
by
the back-propagation algorithm is tostart at the output layer andpropagate the errorbetween the desiredand actual result backwards
through the hidden layers. Weights and biases are then adjusted at each node based on the error propagation [29]. Nowthat abasic understanding ofANNs has been established, it is important to look at how such networks have been used in the past to detect both features and with pose estimation.
While ANNs have been used in many applications and research, this section overviews
sources that are the most relevant to facial detection and the approach taken with this thesis. A
good example ofANNsusedto determinethe presence of aface is presented in [30]. Images are
firstpreprocessedto reduce noise and
lighting
effects onthe image. Thealgorithm usesANNs ina series ofstages. Stage one usesANN as
filters,
these filtersare appliedto theimage atdifferentscalesanddetermineswhetherthe20x20sub-image containsa
face,
thesame networkalsolooks for localized facial features within these sub-images,by breaking
the image down even further.The first network also makes the determination as to whether or not a face is present in the
system from the results ofthe filtered sub-image. The second stage in the method develops an ANN that works to eliminate false detection. This second network was trained using the information from several stage 1 networks, which produced different results due to training. A
face is only detected ifmore than one network agreed a face is present. This method claims an
Feature detection using ANNs is not new to image processing. Reference
[31]
is anexample ofworkusing networks for facial detection. This research does notlook fortraditional
physical
features,
such as eyes ornose,butrather uses featuredomains,
such as Pseudo MomentInvariant
(PZMI),
Zernike Moment Invariant(ZMI),
andPrinciple Component Analysis (PCA).These facial domains are first calculatedwithin an ellipse aroundtheface. The resulting features
from each of these domains are fed into a Radial Basis Function neural network where the
selected image is classified as either a face or non-face.
Using
feature domains instead oftraditional features allows their system independent of shape changes that can occur
during
directional changes in the face direction. In
fact,
the method can detect the presence of a faceeven at the 90 degreemark. An overall face detection rate of99.7% is given for the results, but
the set oftest images were
highly
constrained and are not typical of a general collection ofimages.
The methods presented in this thesis differ from previous work as follows. First two
completely independentnetworks are usedforthe detection ofeyes and mouth. Theeye network
classifies regions as either eye or non-eye and the mouth network classifies regions as either
mouth or non-mouth. The networks used here make only a localized classification of
features,
not a classification ofthe face as a whole. Most methods
dealing
with ANNs tend to be high incomputation requirements, so every attempt has been made to reduce the number of features
detected with the ANNs as well as the size on the inputted sub-image. The size ofthese
sub-images is relatively fixed comparedto themethods above. The methoddeals with scaling issues
by
using skin detection and known anthropometric proportions. Chapter 3 discusses how theChapter 3:
Feature Detection
andNormalization
The method of pose estimation presented here is
feature-based,
and the detection offeatures stronglyaffectstheresults ofthealgorithm. Thischapter is dedicatedtothepresentation
of novel work with feature detection using ANNs and template matching. Figure 5 shows the
processingthatanimageundergoes
during
theproposedpose estimationapproach.Image Result
> Skin Detection > Normalization : =.
J"*"8
> template Geometricpose detection Matching estimation
Figure 5: Algorithm flow
This chapter goes into depth on each ofthese processes except for the final estimation,
which is covered in Chapter 4. The next section introduces the limitations placed on feature
detectionandthemethod as a whole andthereasoning behindsuchlimitations.
3.1
Limitations
andDifficulties
As stated in chapter two, feature detection with faces is not a simple task and can be
approached in a variety of ways. Our goal forpose estimation is to
develop
the most efficientform of pose estimation possible, while still maintaining a good degree of accuracy. Since pose
estimation and the issue offacial recognition are large in scope, it is necessary to first place
limitations ontheexpected abilities ofthepose estimation method.
The determination offacial pose in allthree axes ofrotation,requires a complex method
as shown
by
previous examples, and may not always be possible for the task at hand.Also,
certain applications, such as
tracking
could only require one direction of change.Thus,
the firstlimitation ofthe method is the restriction to only determine the yaw angle directional change.
The determination ofthis angle has the most potential to be applied to other applications like
classification and recognition.
Focusing
on only one angle of change allows for a greaterunderstanding ofhow features are deformed and detected asthis pose changes.
Limiting
pose totranslationalposechanges,while reference
[13]
centers onpitch change. This doesnot meanthattheproposed method couldnotbe expandedto otherdirectional changes, asdiscussed later. Another limitation is the presence of occlusions in the image. Occlusions result from
objects in animage that can obstruct the viewoffeatures or the head as a whole.
Hats,
scarves,glasses and even facial hair can be considered occlusions.
Dealing
with occlusions has been themain focus of some researchers
[16],
but for the most part the problem remains unsolved. Ifa feature cannot be seen in the imageby
the human eye, most feature detectors will likewise beunableto determinethe location. Theproposedmethodassumes thatno occlusions arepresent in theimage.
In addition the method does not supportemotion inthe facial images. Ifthe person in an
image is smiling or
frowning,
the features can change intheirapparent shape. It is assumedthatthepersonhasastoic expressionwiththeirmouth closed and eyes
fairly
open.The final constraint is how far the face may turn in either direction. With the yaw and
pitch pose changesthereexists a
"vanishing
point"
for facial features. This iswhere thefeature is
no longer present or detectable in the image. With pitch, as the head descends below the horizontal plane the nostrils disappear from viewrapidly; other
features,
such asthe mouth andeyes, later become obstructed
by
thetop
ofthe head. With yaw directional changes, the eye in the directionofchange becomesthe first frontal featureto be lost fromview. Posedetectionpastthis point becomes impossible without the use of other features [19]. A second issue with features at extreme angles is their deformation. The shape ofthe eyes and the shape of mouth
will change as theface turns in one direction or another.
Training
on the mouth or eyes attheseextreme angles degrades the ANN and therefore overall detection. This method limits the
directional change in the yaw direction to between 30 and -30 degrees. In the results section
additionalresults are presented from 45 to -45, so the reader may betterobserve the degradation
in performance as the angle increases beyond design limitations. Other limitations in the
algorithm or its implementation are more trivial and were caused
by
problems inherent to thesoftware orhardware. These further limitationsare presentedintheirrespective sections.
Now that thelimitationsplaced on the method are understood, we can introduceour own
method offeature detection using ANNs. The first step inthis processis skin
detection,
whichis3.2
Skin Detection
Skin detection is anecessary step in feature detection for three reasons. First the ANN
tends to be computationally expensive, so if skin detection can be used to only search for features within a predetermined area instead ofthe entire
image,
the overall computation ofthesystem is reduced.
Second,
once the facial region is determined through skindetection,
the features ofthe face must, therefore, also be present in this region. The helps eliminatefalsely
detectedpoints.
Finally,
thefeature detectors arelimited intheirinvarianceto scale, (see Chapter5 for error analysis). The feature detectors workbest ifthe image is a certain size and the eyes
and mouth fall within the size limitations ofthe network. This greatly reducestherobustness of the algorithm, since images will rarely be ofthat specific size iftaken in the real world. Skin
detection,
therefore, allows scaling the image to the necessary size, based on the largest skinregion. The pose estimation should occur regardless ofthe initial size ofthe image as
long
as detectionof skin andfeatures is correct.Ourfirst attemptat skindetectionwas withtheuse oftheRGB histograms. Thisfirstskin
detection method was too sensitive to
lighting
conditions and tended to detect backgroundportions ofthe imageas skin. The use ofthe
YCbCr
colorhas been shown in references[4] [25]
tobe insensitive tolighting
conditionsandtherefore seemed appropriatetoour application.The first step inskindetection istobuildthegroundtruth informationthatwillbeusedto
determine the presence of skin. Color facial images of size 1600x1200 were gathered from ten
individuals ofvarying ethnic groups,theirskin regions were then selected manually.The images
were converted to the
YCbCr
color space. This resulted in millions of color tone pixels, whichcanbeused for future comparison.A2-D histogramwas developed which is shown in Figure 6. This figure is similar to other histograms presented
by
reference[25],
even thoughthey
used amuchlarger set ofweb-based images.
The next step was to process this information into a
lookup
table using theCr
andCb
color values and the probabilities from the histogram that a pixel ofthis coloring is skin. A
threshold probability was used to eliminate outliers on the histogram. The elimination ofthese
pixels helped to avoid errors caused
by
the manual detection of skinduring
the collection3D plot ofCR and CB inskin range
"
Figure 6: Skin detection histogram
Skin detection is performed on the image using this
lookup
table. The color image isscanned pixel
by
pixel and each pixel is converted from RGB toYCbCr
using Equations (l)-(3). Thevalues fortheCb
andCr
are comparedto thetable,andiftheirvalues are abovethe thresholdprobability, then thatpixel is considered to be skin. It can benotedthat other methods
[24] [25]
make more complex uses of colorprobabilityto eliminatedfalse detectionof skin. Inour methodthis is not necessary, since we use morphological operations to eliminate possible false points
and fill in anyundetectedareas.
An example ofthe results produced
by
the skin detection is shown in Figure 7. At thispoint a skin mask is created from the detected area that needs to be processed with morphologicalfiltersto remove straypixels andfill ingapsinthe image. The firstmorphological
process is closing, i.e. dilation followed
by
erosion.During
the dilation process a 3x3 block is passed over each pixel in theimage,
ifthe center pixel is detected as skin then the neighboring pixels are also marked as skin. Dilation has the overall effect ofincreasing
the size ofthe skin region and reducing thenumber ofholes in the interior.During
the erosion step a 3x3 block isother skin pixels are nowremoved.
Closing
has the effect offilling
in interiorparts ofthe image thatwerenotdetectedas skinduring
processing. Figure 8 shows theresults oftheMorphological operations. Fromthis figuretheareas forthe eyesandopenareas inthe foreheadand cheekhave beenfilledin allowing for smoother regions.20 40 60 BO 100 120 140 160 180 2CD 100 120 140 160 180 200
Figure 7: Original imageanddetectionskin area
20 40 60 ICO 133 1 160 180 230
Figure8: Resultof morphological filters
The nextstep in theprocess is imagesegmentation shown in Figure 9. In this
figure,
eachregion has been given a different gray level to make itmore distinguishable.
During
this process20 40 80 80 100 120 140 150 180 393
Figure 9: Skinregionimage
The facialregionis determinedtobe theregion with themost skinpixels; thiscan alsobe
viewed as the region with the largest area. The image goes through the final processing stage
whichremoves alltheother regions.
During
thisstagethefacialregionis alsofilledin,
removingany remaining gaps left out
by
the morphological processing. This final step is necessary toallowfor images ofdifferentsizes and viewing distancestobe processedcorrectly,closingalone
cannot always accomplishthis unless it is done multiple times which is costly. The final region
imageis shownbelow in Figure 10.
40 B0 B0 1D0 120 140 1B0 1B0 200
Figure 10: Final facialregion
Someother examples ofthe skindetectionprocess
help
to showits limitationsandrobustness. Figure 1 1 through Figure 13 showwhathappenswhenother skintoneareas are
Figure 1 1: Original image withhand
Figure 12: Region imagewithhand
The hand in the image is separated out
during
the region process and later removedleaving
onlythefacialarea. Withmultiplepeople orfacesinan imageasimilar affectoccurs, theface which is closestto the camera is determinedto bethe facial region andthe pose estimation
process will occur on that region. Figure 14 introduces some of the problems with the skin
detectionmethod.
Figure 14: Effect ofconnectedskinregions
The above figure shows what happens when another flesh tone object comes into
contact withtheheadregion. Thetransitionbetweenthe two objects isnot great enough forthem
tobe separatedout
during
the region process. Thisalsohappens ifmultiple faces arein an imageand are
touching
or at least appear thatthey
are connecting. The result is a much larger facialregionthanis actuallythere andtherefore
incorrectly
scales theimageduring
normalization. Thebest wayto deal withregions likethis isto explore methods of elliptical shape
fitting
to findtheactual face in the image.
Unfortunately
most methods that accomplish this are complex andoutsidethe scopeofthisthesis,this doeshoweverprovide an area offuturework.
Figure 15 showsthat the skintable can also handle other skintones than thosepresented
in the previous examples. With using the
YCbCr
color space a few colors are sometimes eitherdetected
falsely
or not atall. The most apparent colorthat has problems is red at either high orlow luminance. Theredness in facesthatoccurs with
blushing
has
the same chrominance asdarkor bright reds. One possible way to correct this false detection is to avoid pixels at extreme
luminance.
Likewise,
certainhairtones arealso sometimesfalsely
detectedas skin dueto similart"* fT ":'-*
-.'-'^.'^7,"i^q. 2CD 300 400 500
_
300 -100 500 600
Figure 15: Skintones
While the skin detection method is
by
no means perfect, it does satisfy the needs ofthe algorithmby
providing a method to scale the images and to limit the search area for facial features. The next section discusses how the facial region is now used to normalize the imagealongwithotherimage processing methods.
3.3
Image Normalization
In order to improve feature detection and therefore pose estimation, it is necessary to introduce methods thatimprove the quality ofthe image and make it suitable for processing
by
the ANNs. This processing isknownas image normalization.The two stepsofnormalization for thismethod arescalingand contrast stretching.
Oncethefinal skin regionhas been
located,
like in Figure10,
theoriginal image cannow be scaled. The image is scaled eitherup or downdepending
on theviewing distance and size ofthe original image. First the height and width of the facial region is measured. Next the proportion ofthe facialregion heightto the facialregion width is analyzedto seewhetheror not theneckhas been included intheregion like in Figure 13 orjustthe facial areaas in Figure 10. If the width ofthe face is not greater than 2/3rds ofthe
length,
than the region includesthe neck and mustbe scaled with adifferentbase value. The 2/3rds value was determined from previouswork related to anthropometric studies
[2]
and from our own experimentation with differentfacial proportions. The width ofthe facial region changes as pose changes become larger at
ontheheight ofthe images usedin
training
forthe ANNs. This is theheightthe image mustbeforthefeature detectorstocorrectly detecttheeyes andthemouth.
Scalerate=
(4)
HeightofFacial RegionAfter the scale rate has been determined using Equation
(4),
the image is scaled downwith bilinear interpolation. The image is then converted from 24bit RGB to 8 bit gray scale for
the nextstage ofnormalization,contrast stretching.
The contrast of an image is the distribution ofits lightand dark pixels. An imageoflow
contrasthas a small difference between itspixel values. The histogramof alowcontrast image is
usually skewed eitherto the
left,
as in the case oflightimages,
or to the right, as in the case ofdarker images. Histograms located around center are mostly gray. Contrast stretching is a
techniqueusedto stretchthehistogramof an image sothat the fullranges of pixels values in the
imageare filled. Contrast stretching helpsmakefeatures standoutmore inthe imagecomparedto
skin areas.
The method of contrast stretching implemented here creates a histogram of pixel values.
The histogram is then clipped between the 0.5 and 99.5 percentiles-to eliminate outlier values
and forrobustness. The highest and lowestvalues from the histogram are then setto 255 and 0
respectively. All other pixel values are then scaled according. This technique works best when
the histogram ofthe image is Gaussian inshape, butdoesnotmake full use allowedvalues [32].
Contrast stretchingalsohelps deal with
lighting
effectsto alimited extent.Once the imagehas beennormalizedto the correct shape and contrast, the ANNs can be
used to determine final features. The next section introduces how the ANNs were created and
trained to suittheneeds ofthefeature detectors.
3.4
Neural Network
Training
Before the ANNs can be used for detection offacial
features,
they
must first be trainedusing images of the desired features. The images used in the
training
stage were from the"Database of
Faces"
provided
by
AT&T Laboratories Cambridge [33]. This database consists ofare ten
images,
taken under controlledlighting
and distance conditions, at various poses. Theorientation ofthe faces varies slightly in both pitch and yaw, but exact angles are not given.
Frontal features are viewable in all images. Images containing facial
hair,
glasses,blinking,
oropen mouths were eliminated from the
training
sets. These images were eliminated sincelimitationsdonot allowfor facialexpressions orocclusions. Sample imagesareprovidedbelow.
Figure 16: Sample faces for
training
Thenext stepwas to create sub-images ofboththe eye andthe mouth. Foreachimage in
the
training
set, the center ofthe right eye, left eye and mouth were manually selected. Asub-image of21x11 was then extracted around the eyes. For the mouth a sub-image of 33x13 was
used. The determinationofthesub-image size isatradeoffbetween speed and accuracy.Alarger
sub-image size contains more image
information,
but also produces a much larger network andrequires more time for both
training
and detection. If the sub-image is too small, not enoughinformation is captured to make the feature different from other areas ofthe face.
Thus,
thenetworkmaytrainor perform poorly.
In orderto train anetwork correctly non-feature images needed tobe collected from the
facial images. For each feature image collected, five non-feature sub-images were collected, of
thesame size, fromthe aroundtheface. Non-feature images includeareas like facial edges,
hair,
eye
brows,
nostrils and skin areas like the cheek. The figures below show some examples ofML;: * B MB Tij
Figure 17: Eve feature images
Figure 18: Non-eve feature images
? .
_
T
Fiaure 19: Mouth feature images
Figure 20: Non-mouth feature images
The collection of sub-images in this fashion resulted in a significant number of
sub-images to be used for training. The initial
training
set for the eye consisted of276 eye imagesand 1380 non-eye images. Eye images included bo