Efficient Human Facial Pose Estimation

(1)

Rochester Institute of Technology

RIT Scholar Works

Theses

Thesis/Dissertation Collections

2004

Efficient Human Facial Pose Estimation

James C. Schimmel

Follow this and additional works at:

http://scholarworks.rit.edu/theses

This Thesis is brought to you for free and open access by the Thesis/Dissertation Collections at RIT Scholar Works. It has been accepted for inclusion in Theses by an authorized administrator of RIT Scholar Works. For more information, please [email protected].

Recommended Citation

(2)

by

James

C.

Schimmel

A Thesis Submitted

III

Partial Fulfillment of the

Requirements for the Degree of

MASTER OF SCIENCE

III

Computer Engineering

Approved by:

Principal

Advisor _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

_

Dr. Andreas Savakis, Associate Professor and Department Head

Committee Member

-Dr. Juan C. Cockburn, Associate Professor

Committee

Membet

__

_

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _

_

Dr. Lawrence Ray

, Research Associate, Eastman Kodak Company

Department of Computer Engineering

College cif Engineering

Rochester Institute of Technology

Rochester, New York

(3)

RELEASE PERMISSION FORM

Rochester Institute of Technology

I,

James C. Schimmel, hereby grant pennission to the Wallace Library of the Rochester Institute

of Technology to reproduce my thesis in

whole or in

part. Any reproduction will not be for

commercial use or profit.

(4)

Abstract

Pose estimationhas become an

increasingly

important areain computer visionand more

specifically in human facial recognition and _activity recognition for surveillance applications. Pose estimationis a process

by

which_{the rotation, pitch,} oryaw of ahuman head is determined. Numerous methods _already exist which can determine the angular change of a

face, however,

these methods _vary in _accuracy and their computational requirements tend to be too high for real-time applications. The objective ofthis thesis is to

develop

a method forpose _estimation, whichis computationallyefficient,whilestill _maintainingareasonabledegreeof accuracy.

Inthis_thesis,afeature-basedmethod ispresentedto determinethe yawangle of ahuman facial pose _using a combination of artificial neural networks and template matching. The artificial neural networks are used for the feature detection portion ofthe algorithm _along with skindetectionand otherimage enhancement algorithms. Thefirst head _model, referredto as the Frontal Position

Model,

determines the pose ofthe face _using two eyes and the mouth. The second_model, referredto as the Side Position

Model,

is used when_only one eye canbe viewed and determines pose based on a single _eye, the nose _tip, and the mouth. The two models are presented to demonstrate the position change offacial features due to pose andto provide the meanstodeterminetheposeasthese featureschangefromthe frontalposition. Theeffectiveness ofthispose estimation methodis examined

by looking

atboth themanual and automatic feature detection methods. Analysis is further performed on how errors in feature detection affect the

resulting pose determination. The method resulted inthe detection offacial pose from 30 to -30 degrees with an average error of4.28 degrees for the Frontal Position Model and 5.79 degrees forthe Side Position Modelwithcorrectfeaturedetection.

The

Intel(R)

Streaming

SIMD Extensions

(SSE)

technology

was employed to enhance

the performance of

floating

point operations. The neural networks used inthe feature detection process requirealargeamountof

floating

point_{calculations,} duetothecomputation oftheimage data with weights and biases. With SSE optimization the algorithm becomes suitable for

(5)

Table

of

Contents

ListofFigures iv

ListofTables vii

Chapter1: Introduction 2

Chapter2: Background onPose Estimation 5

2.1 AnthropometricResearch 5

2.2 Appearance-based Approaches 7

2.3 Feature-based Approaches 9

2.4 An _{Overview ofSkin Detection} 12

2.5 _{An Overview ofArtificial Neural Networks} 15

Chapter 3: Feature DetectionandNormalization 19

3.1 LimitationsandDifficulties 19

3.2 Skin Detection 21

3.3 Image Normalization 27

3.4 Neural Network

Training

28

3.5 Feature Detection 37

3.6 Template

Matching

Algorithm 42

Chapter 4: Pose Estimation

Theory

andModels 45

4.1 Frontal Position Head Model 45

4.2 Frontal Position Head Model Pose Estimation Process 53

4.3 Side Position Head Model 55

4.4 Side Position HeadModel-_{Pose Estimation Process} ₆₀

Chapter5: Pose Estimation Results and

Accuracy

64

5.1 Image Set 64

5.2 Manual Feature Detection 65

5.2.1 Frontal Position Model 66

(6)

5.3 Automatic FeatureDetection 74

5.4 Error AnalysisandPropagation 75

Chapter 6: Implementation andSSEOptimizations 88

6.1 Matlab Implementation 88

6.2 C++_{Implementation} ₈₉

6.3 IntelSSE Optimizations 91

6.4 Performance Analysis 95

Chapter 7: Conclusion andFuture Work 100

ListofReferences 102

(7)

List

of

Figures

Figure 1:Possibleanglesforpose estimation 3

Figure 2: Exampleheadmodel_usinganthropometric measurements 6

Figure 3: Neuralnetwork 16

Figure 4: Nodeoperation 16

Figure 5: Algorithm flow 19

Figure6: Skindetection histogram 22

Figure 7: Original image anddetectionskin area 23

Figure 8: Resultof morphologicalfilters 23

Figure 9: Skinregionimage 24

Figure 10: Final facialregion 24

Figure 11: Original imagewithhand 25

Figure 12: Region imagewithhand 25

Figure 13: Final facialarea withhand 25

Figure 14: Effectofconnectedskinregions 26

Figure 15: Skintones 27

Figure 16: Samplefaces for

training

29

Figure 17: Eye feature images 30

Figure 18: Non-eye featureimages 30

Figure 19: Mouthfeature images 30

Figure 20: Non-mouthfeatureimages 30

Figure 21: Eyenetwork configuration 32

Figure 22: Mouthnetwork configuration 32

Figure 23: Transferfunctions

[27]

32

Figure 24: EyenetworkROC graph 35

Figure 25: MouthnetworkROCgraph 35

Figure 26: Processof_scanningtheimage 38

Figure 27: Original colorimage 38

(8)

Figure 29: Grayscale image 39

Figure 30: Mouthweightedlocation_map 40

Figure 31: Eyeweightedlocation_map 40

Figure 32: Eyeweightlocation_map afteraveragefilter 41

Figure 33: Facialtemplates 42

Figure 34: Finalappliedtemplate 44

Figure 35: Headmodel

-overheadview 46

Figure36: Headmodel

-overheadview

during

rotation 47

-sideview ₄₈

Figure38: Geometrictemplate onthe2-Dplane 49

Figure 39:Magnification

during

_{image capturing} 50

Figure 40: Theoreticalanglesduetopose 52

Figure 41: Comparisonof

theory

anglestoa realface 53

Figure42: Eyeangles alphal vs. alpha2 54

Figure 43: Inversegraph of alphal vs. alpha2 54

Figure44: Headmodel

-overhead viewwith nose 56

-rotated overheadview withnose 57

Figure 46: Headmodel- _side_{view with}

nose 57

Figure 47: Geometrictemplate onthe2-Dplane 58

Figure48:

Theory

angles forsideviewheadmodel 59

Figure 49: Comparisonof

theory

anglesto realfaceforsideviewheadmodel 60 Figure 50: Pose Anglevs. theproportion oftemplateangles 61

Figure 51: Linearapproximationforpose from

theory

data 62 Figure 52: Linearapproximation forpose comparedtoimage data 62

Figure 53: Sampletestimages 65

Figure54: Bestcase manualdetectionpose_results, FrontalPosition Model 68

Figure 55: Bestcase manualdetection_error, Frontal Position Model 68

Figure 56: Worstcasemanualdetectionpose_results,Frontal Position Model 69 Figure 57: Worstcasemanualdetection_error,Frontal Position Model 69

Figure 58: Bestcasemanualdetectionpose_results, Side Position Model 72

(9)

Figure 60: Worstcase manualdetectionpose_results, SidePosition Model 73 Figure61: Worstcase manual detectionpose _error, Side Position Model 73

Figure 62: Samplecomparisonforpose results 75

Figure 63: Errorinthe template 76

Figure 64: Poseerrorduetofeatureerrorinclosest eye 78 Figure 65: Poseerrorduetofeatureerrorin farthesteye 79

Figure 66: Poseerrordueto featureerrorinmouth 80

Figure67: Poseerrordueto errorinallfeatures 81

Figure68: Allowederrorinfeature detection 81

Figure69: Graphicalrepresentation of error 82

Figure 70: 3-Dplot of pose errorduetorelative errorineyedetection 83 Figure 71: 3-Dplot of pose errorduetorelative errorinmouthdetection 83 Figure 72: Monte Carlo _simulation,

introducing

errorintoallfeatures 84 Figure 73: Monte Carlo _simulation, effectof_varyingtheerrorrange 85

Figure 74: Effectof_scalingon eyedetection 86

Figure 75: Effectof_scalingonmouthdetection 86

Figure 76: Matlab functions 89

Figure 77: UML diagram 91

Figure 78: Typicalxmmregister representation 92

Figure 79: ExampleofSSE operation 93

Figure 80: ANNoptimization 94

Figure 81: Colorconversion optimization 95

Figure 82: Relative

timing

infunctionswithAthlonprocessor 96 Figure83: RelativetimeinfunctionswithPentium 4processor 96 Figure84: Relativetimein functionswithPentium 4 andSSE optimizations 97

Figure 85: Effectofimage sizeonoptimizations 98

(10)

List

of

Tables

Table 1: Typicalconfusion matrix 33

Table2: Eyenetwork confusion matrix 36

Table 3: Mouthnetwork confusionmatrix 36

Table4: Poseerrorfor45 to-45 degrees_usingmanualfeaturedetection 66 Table 5: Poseerrorfor 30to -30degrees_usingmanualfeature detection 67 Table6: Poseerrorfor 20 to-20degrees_usingmanualfeaturedetection 67 Table 7: Poseerrorfor 45to -45 degrees_usingmanualfeaturedetection 70 Table 8: Poseerrorfor30to-30 degrees_usingmanualfeaturedetection 70

Table 9: Poseerrorfor 20 to-20degrees usingmanualfeature detection 71 Table 10: Pose errorfor 30to-30degrees usingautomaticfeaturedetection 74

Table 11: Comparisonchartforall pose methods 74

Table 12: Poseerrorduetofeatureerrorinclosest eye 77 Table 13: Pose errorduetofeatureerrorinfarthesteye 78

Table 14: Poseerrorduetofeature errorinmouth 79

Table 15: Poseerrorduetofeatureerrorinallfeatures 80 Table 16: Averageandstandarddeviationvaluesfor Monte Carlo simulations 85

(11)

Chapter

1:

Introduction

Facialpose estimationhas become animportantresearchtopic incomputer visiondueto the large number of applications that _{may benefit} from an efficient method. Faces in images rarely appear frontal and _upright, so detection of faces and their associated information has provendifficult. A face can undergo _any ofthree rotations out ofthe plane ofthe image. Such rotations introduce deformations into the _{image affecting} the facial

features,

which are present

and the overall shape ofthe head. Pose estimation introduces a _way to detect the change in direction and then use this information to augment further _processing ofthe image. Real-time

tracking

ofhuman

faces,

image _{classification,} and 3-D modeling are just some ofthe current

applications for which pose estimation can _play a significant part. Current methods for facial pose estimationhave focused on a range ofmethodsfrom using wavelet networksto _calculating posethroughgeometric means.

The problem offacial recognition and facial pose has been around for sometime. The humanvisual system can _{easily detect}not _only ahuman

face,

butalsothe relative directionthat

face is aligned in 3-D. In computer vision systems those same visual task have been proven difficult. Variations in

lighting,

image size, orientation _viewpoint,

body

_movement, and facial

expressionhavecreatedawide range ofdifficultiesthatmust_{eventually be}overcome. Thescope ofthehumanvisual recognitionisvast_{making it important}to subdividetheprobleminto smaller tasks, suchas poseestimation.

While pose estimation has been examined

by

a number of_researchers, papers on this

subject tend to combine pose estimation with otherforms offacial detection or 3-D modeling.

Estimating

the pose of a human face may be performed in a _variety ofways

including

which

directions of rotationto examine. The

'yaw'

angle of rotation occurs whenthefaceturnsrightor left from the frontal position. The

'pitch'

ofthe face is the degree to which the faced is turned

vertically up or down. The

'roll'

(12)

[Pitch!

Figure 1: Possibleanglesforposeestimation

Inthis approachtopose estimation wefocus_primarily ontheyawangle forpose change.

The yaw angle has the greatest possible benefit to applications such as

tracking

and _modeling,

since it is a good indication ofthe direction a person is

facing

orheading.

By

focusing

on _only

theyawangle change,we eliminate _manyofthecomplicationsassociated withfacial recognition

andfocus closelyontheproblem athand.

Methods of pose estimation are oftenbroken down into two categories: feature-based or

appearance-based algorithms. With feature-based pose _estimation, the determination ofpose is

basedonthelocationofindividual features. Thesemethodsrequiretheuse offeaturedetectorsto determinethelocationofimportant featuresinthe face. With appearance-based_methods,pose is

determinedbasedontheface as a whole. Oftenappearance-basedmodels incorporate sometype

offilterto analyzetheimageandthenmake adecision fromthe_resultingimage. Anoverview of

feature-basedandappearance-basedtechniquesisprovidedinChapter 2.

With feature-basedapproachestopose_{estimation, the}_{way in} whichfeaturesaredetected is just as important as which features are detected. For this method we have chosen to use

Artificial Neural Networks as the basis for the feature detectors ofthe system. Artificial Neural Networks

(ANNs)

are an attempttoallowcomputer systemsto learnand adapt

by

_modelingthe

structure ofthe neurons inthe humanbrain. ANNs learn

by

_adjusting theirinternalweights and

(13)

Once the features for the face have been

determined,

the question ofhow to determine

pose fromthesefeatures israised. Againthereare a number ofways inwhichto determinepose

from known features points; these methods depend on the number of features and often the

methods used to acquire them. In Chapter 4 our own models are presented which use three featurepoints onthe face. The firstmodel, referred tohenceforthas theFrontal Position

Model,

makesuse ofthe detection ofthe two eyes andthemouth. This modelmakes use oftheinherent symmetry in the face and is intended _{primarily for detection} of pose in frontal images. The secondmodel, referredtoas theSide Position

Model,

makesuse ofthedetectionof one_{eye, the} nose

tip

and _{the mouth,} and _{potentially detects} pose at greater angle variations. Chapter 5

discusses the results ofthese models in

detecting

pose and also introduces some error analysis

regarding howerrorsinfeaturedetectionpropagatethroughthe system_effectingpose estimation. Since the algorithm _efficiency is

important,

it is _necessary to observe how current

technologies mightbe usedto increase the system's speed. Intel's SSE

technology

provides the

best means to enhance the performance due to the algorithm's due high use of

floating

point

operations.

By

_{parallelizing many} of the calculations performed

by

the ANN we achieve a

noticeable performance improvementofthe algorithm. Chapter 6 shows howthe algorithmwas implemented into

Matlab,

C++,

and C++ with SSE optimizations. Performance analysis is

conductedto showthe_speedup fromSSEoptimizations.

Thetechniques presented

here,

such as theuse ofANNs for feature

detection,

_{may later} be extended oradapted toother methods. Themodels presented are also generic enoughto later be expanded to include other angles ofdirectional change. The most important contribution of

this thesis isan _efficient, yetaccurate, means of pose estimationthat can_{easily be} integrated in

(14)

Chapter

2:

Background

on

Pose Estimation

In understandingtheorigins ofthepose estimation_problem, it is importantto lookat past

research relatedto this topic.

Comparing

previoustechniques inpose estimationhelpsserve as a referenceto themethod presentedinthisthesis._{Not only is} it_necessarytolookat other methods

ofpose _estimation, butat methods that are relatedtofacial recognition such as featuredetection

and skin detection. This chapter also provides a briefoverview ofArtificial Neural Networks withregardstotheirgenerals operation andhow ANNs have beenusedinotherresearch areas.

2.1

Anthropometric Research

Anthropometry

is afield of_studythat involves the measurement ofthe human

body

and

its respective parts. Anthropometric researchcovers not _only to the size ofhuman

features,

but

also their relative proportions across different races and ages. While this area of _{study is}

relatively old, it still has_manypractical uses. Withpose estimation and facial feature

detection,

anthropometric researchcanbeused asthe_startingpointfor

building

ahuman headmodel. Such models can useknown anthropometricdata andinformationto determinethe relative locationof

facial features.

Examples of anthropometric _{study have been} seen throughout human history. Most

people are familiar withLeonardo Di Vinci's "Universal Man." This workof art makes note of

many proportions ofthe human

body

that appearto be common _among all humans. While this work focuses on limb length and

height,

it makes the point that _using anthropometric data is well-known. In the 1970's the scientific _{community began} to take a more serious look at

anthropometric data

by developing

large databases [1]. This information has been used for

applications such as ergonomics and other biometric applications. These databases took

measurementsof various

body

parts forthousandsofindividuals. Againthisdatafocused_mostly onheight and limb proportions, but research has done withfacial measurements. Reference

[2]

offersonesuch exampleofresearchthat_{has been done using facial}proportions.

To make _any pose estimation process useful it must be person-independent across the

spectrumoffacial appearances, sowith_{anthropometry} thequestionarises astohowuseful such datamight betowards facialpose estimation. Reference

[3] by

Horprasert and Yacoob supports

(15)

both genetic and environmental constraints. The shape of_{the eyes, the}relative feature

distances,

and even the size ofthe head are all limited within certain ranges regardless ofthe individual

being

observed. Reference

[3]

uses anthropometric data extensively to even _classify the individuals'

gender, race and age before performing its

tracking

operation. Such methods allow

fornoprior knowledge ofthe individual in

tracking

or facial

location,

andtherefore tend to be

more adaptabletoother applications.

Even throughall humanproportions are limitedto ahigh degree of_variability, there still

exists roomforerrordueto these featuredifferences. The greatest source of error canoftencome from young children, since their relative face proportions are closer together at birth

[4],

most

notably the distance between the eyes. As the child

develops,

and the nose becomes more

pronounced, the proportions change. Race and gender also _play an important role in feature

placementand proportionsbuttoalesserextent. Thegoalthenisto

develop

aheadmodelthatis

generic enoughto fitmost

individuals,

but can usedifferences intheplacement ofthese features

todeterminepose.

Reference

[2]

by

Alattar and Rajala offers a good example of a head model built using

anthropometricdata. Thismodelispresentedin Figure 2.

1

Figure 2: Example headmodel_usinganthropometric measurements

The work presented in this reference _primarilyuses elliptical approximations to first determine

the location ofthe head and then incorporates the model to

help

localize the features. Image

(16)

Often researchers use the term model-based in reference to their approach for

dealing

withfacialrecognitionor more_specificallypose estimation. Model-basedapproaches_simplyuse

some formofahumanheadmodeltoaccomplish_{their goal,}andanthropometric research_usually plays apart inthe development ofthis headmodel. While the bulk ofmodel-based approaches are also feature-based _approaches, some appearance-based methods to pose estimation also

includetheuse ofaheadmodel _and,

therefore,

canbe considered model-based. The debate over

what

terminology

touse in

dealing

with certain methodsis explored inanumber ofotherpapers

[3]

[4]. Since themethod presented inthis thesis _{is primarily} a feature-based _approach,

breaking

down other research into either appearance-based or feature-based categories helps in

understanding how this method differs from previous methods. Model-based approaches are classified as

being

feature-based,

if

they

make _any use offacial

features,

otherwise,

they

are considered appearance-based.

2.2 Appearance-based Approaches

Appearance-based approaches,also knownasview-based or holisticapproaches, attempt to determine the pose ofa human face

by looking

at the face as a whole. Methods vary from

usingmorecomplex methods such asGaborwaveletfiltersandneural_{networks, to} more simple methods like color-based comparison. Appearance-based methods seek to avoid the

decomposition offaces into parts such as

features,

and instead look for the face in its entirety.

Once the face has been determined these methods base their classification ofthe face from

preexistingdataoffacestakenat variousposes.

The advantages of appearance-based methods are often dependent on how

they

are

implemented,

but here is the general overview. Appearance-basedmodels require information

regarding the face as a whole and are not based on individual

features,

thus, the amount of

training

and feature extraction required are reduced. These methods in general require low

resolution images. Reference

[5]

requires _only a 20x20 image of the face to make its

classification. With such low resolution, the computation required has the potential of

being

quite

low,

howeverwith presentapproachestheoppositehasprovedtrue.

The disadvantages presented

by

appearance-based methods are the _primaryreasons _why

work into this area has been limited compared to its feature-based counter part.

(17)

Often getting the face into a form suitable for classification requires a significant amount of

preprocessing.

Finally

appearance-based methods tend to be less accurate and less robust than

other _methods, since

they

require _computing a _global, lower-dimensional representation ofthe

image

information,

although this is subject to debate [3][4][8]. Since the problems associated

withappearance-based methods are well

known,

it is importanttolookathowother researchhas attemptedtodealwiththeproblem of robustness.

An approach taken

by

some researchers

[4] [6] [15]

has been to use eigenvectors and

eigenvaluesto compute an eigenspace that capturesthe facialinformation in a low dimensional feature space. Aneigenspace is computed_using the eigenvectors fromthe covariance matrix of

the

training

set. The eigenvectors associated with the largest eigenvalues are used as the

representationofthe

training

set.Eigenspacesthatare usedtorepresentthe faceare also referred

to eigenfaces. The approach assumes that low-resolution facial images form a linear subspace

thatcan beused to _{classify images} either for facial detection or pose estimation [4]. With pose

estimationthe methodbecomes more complex_requiringmultiple eigenfaces to be computedfor eachpose desired. Theuseofeigenfaceshasthebenefitsof

limiting

theinfluence of

lighting

and other noise onthe

image,

while_providing accurateclassification offaces. Reference

[6]

obtained

an 83% rate ofrecognition _using _eigenspaces; however the computation of such methods is

relatively high.

Another area of appearance-based pose estimation research is using Gabor filters and

wavelet transforms to determine pose. Gabor kernels are sinusoidal modulated Gaussian

functions. Gaborwavelets are sets ofkernels withdifferent spatial frequencies and orientations.

A Gaborwavelet-transformisproduced

by

convolutionwithGabor kernels [4].

The benefits of_{using Gabor} filters and wavelets allow for better processing offacial

imagesand classification. Wavelettransforms alsohavethe advantage of_providing some degree

ofinvariance to

illumination,

skin tone and hair color. Complex uses oftransforms even allow

for selective _scalinganddeterminationoforientation. Reference

[7]

makes use ofGaborwavelet

filters in the determination offacial pose. First aclassification is set_{up using 1470} ofimagesat 10 degree intervals. These

training

images arethenprocessed _usingdifferent Gaborfilters. From

the initial

training

a filter is chosen to represent the optimal filter for each pose.

During

the

determinationofpose,animage isprocessed with each optimal filter. The_{resulting image is}then

(18)

were poor, _achieving the determination of pose _only within +/-10 degrees 90% ofthe time.

Gaborwavelets have been combined with other forms of classification such as artificial neural

networks andBayesian networks [8]. Theuse ofANN inappearance-based models is presented in section 2.5. Although Gabor filters have their limitations

they

do offer a _way of_reducing

manyoftheerrorsthatcan occur with appearance-based models.

The examples of appearance-based models presented

help

show their limitations and

relative effectiveness in

detecting

pose. Theproblems of robustness in such applications andthe general agreement that appearance-based methods are more inaccurate than feature-based

methods,

help

support _why the feature-based approach was chosen as the preferred method to

pose estimation. In the next section we look at feature detection and a number offeature-based

approaches.

2.3 Feature-based Approaches

Feature-basedapproachesdepend onthe extraction offeaturepoints inan

image,

andthe use of these features points in some manner to estimate pose. Feature detection is _simply a method of

identifying

theposition of a specific part of an

image;

inthis case an eyeor a mouth. Oftenwithfacialfeaturedetectionsuch methodsrequiretheuse of aheadmodeltoformulatethe final calculation, which is why such methods are often referred to as model-based approaches.

Obviously

feature detection itself becomes the focus of most of these methods due to its

difficulty. Sincefeaturedetection isso

important,

it is necessarytolook intoresearch onthisarea alongwithwaysdetectedfeatures have beenusedto determinepose.

An importantpoint of all pose estimation methods isthe determinationofwhich human facial featuresto use.A largenumber offacial featuresoften require additional computationbut have abetterchance at

determining

pose correctly. Reference

[9]

uses thedetectionof_anyfacial

features with edges to match to a facial template for 3-D reconstruction; the number offeature points is

fairly

high and also varies

depending

on the face. Reference

[10]

uses _{fifteen feature}

points spread around the eyes, nose, and mouth region. Reference

[11]

uses _only _{five feature} points, namelythe eyes, eyebrows and mouth todetectboth face and pose. The use ofthe eyes

asfeature pointstends tobepresent inmost pose estimationtasks due its ease of_{detection from} therest ofthe face.

Eyes,

_especially the pupils, still have the same general shape across human

(19)

rest ofthe face.

Finally

the intra-ocular distance isthe measurement ofthe distance betweenthe

eyes. The intra-ocular distance varies from individual to individual sothis distance can be used

forpose estimation _{only if}the distance is known for the frontal position of a specific person or

can be normalized. Reference

[12]

states that as facial pose _changes, the distance between the

eyes also changes in size,making theeyes andthe intra-ocular distance ideal forpose estimation

methods.

The mouth is another _commonly used

feature,

but it introduces some challenges to its

detection. Themouth tends to _{vary in} shape_{considerably between individuals. As} theface turns,

the shape ofthe mouth deforms at a greater rate compared to the eyes.

Finally,

from a color

perspective, the natural color tone ofthe mouth often matches the _surrounding skin or other

facials tones with some individuals. The color ofthe mouth can be influenced with the use of

makeup such as lipstick. This makes edge detection and color based methods difficult. The

mouthis also sensitiveto theemotion oftheindividual andis influenced

by

_smilingor frowning.

Thenose, onthe other

hand,

often causes_manyproblemsdue to its lackof contrast with

the rest ofthe

face; however,

methods such as references

[10] [13]

still makeuse ofthis feature

with some filtering.

Depending

onthe_pose, thenostrils arenot always effectivedetectionpoints

since _changing the pitch ofthehead removesthese features fromthe 2-D image.

Also,

the nose

shapeand size_may obscure nostrilfeatures eveninthefrontalposition.

A number ofmethods attempt to make use ofthe head outline to estimate pose. These

approaches involve elliptical approximations ofthe head shape, which can _{be computationally}

complex in applications.

However,

hair styles and background interference tend to worsen the

results of these experiments as shown in reference [14]. Facial hair causes problems with

elliptical approximations andwith the detection or use ofthe mouth as a feature point. Images

withfacial hairare_{usually ignored}

during

testing.

In orderto detectthe features in a

face,

it is often important to establishgroundtruth for

these features. With facial

images,

the use of

tagging

certain facial feature regions presents the

possibility of_usingthese featurestoestablishagroundtruth or a standardforwhen the faceis in

a frontalposition. As the face is turned, theviewablefacial feature regions change_and, basedon

their size and _shape, pose _{may be} estimated. Such regions _{may be} used in

training

a system to

(20)

Cootes and Tayloroffer some unique approaches to feature detection

by

introducing

the

use of active shape models

[16]

[17]. Active shape models are a _way to detect features _using

trained knowledge oftheedges ofthe shape. Active shape models seekto limitthe deformation

ofthemodel to shapes found inthe

training

set.

By

_approachingfeature detection inthis manner

they

seekto _{have variability in} the feature model to allow fornoise and _occlusion, but limit the

amount of false detection that can occur

during

classification. Methods that involve

training

usuallysharethissamegoal, since false detectionoffeaturescan _severelyworsenresults.

An optional approach to whole feature detection is to _{detect only} part of a feature.

Reference

[18]

uses the corners ofthe mouth for

detection,

instead ofthe entire _mouth, to avoid

shape changes that can occur. Reference

[

1

9]

uses separate detectors for the corners ofthe eyes

andthepupil inorderto _accuratelylocatethefeature.

While

training

isthebestexampleof_using priorknowledge forfeature

detection,

itisnot

the _only approach that can be taken. With color

images,

a histogram of the feature can be

developed from a numberofsample images. Featurescan thenbe detected

by dividing

an image

into sub-images and

finding

the sub-image with the greatest correlation to the known color

histogram,

an example ofthis approach is taken

by

reference [20].

However,

such methods are

limited in _robustness, since

they

do not adapt to occlusions and image quality changes such as

lighting

or noise. Another similar approach is the use of feature template matching. In these

applications, ageneric template ofthe eye or mouth is first created from a test set. Features are

detected

by

_correlating sub-images with the feature template and then

identifying

sub-image

locations most correlated with the template. These methods tend to be improved fromthe color

histogramapproach, butsufferfrom many ofthe same problemswithocclusions andnoise.

Once facial features are

determined,

it is possible to estimate the final pose in a number

of ways. The most _{computationally} efficient _{way is} through the use of geometric means.

Proportions betweenthe facial features and theiralignment _{may be}usedto calculate the angle at

whichtheface ispositionedofffromcenter.

However,

geometric means ofposeestimation often

lack the robustness ofother methods and _rely

heavily

on correct feature detection. References

[10] [13]

bothmake use of geometricandlinearequations todetermine pose.

Template matching is the most common methodof

determining

pose through geometric

means. A template is

typically

composed of a set of points that form some predetermined

(21)

crosses

[11],

or pyramidal shapes

[13],

although more complex patterns have been used

including

facial meshes or whole head generic structures. The algorithm for _matching the

template to the model or image usually involves _{using detected features from} the images and

deforming

the template to match the image. Often the process of

fitting

a template to detected points is called a "bestconstellation

search"

[20]

and can often entail a decision makingprocess

to avoid _any

falsely

detected points. The change in the template can be used to determine its

orientationin 3-Dspace_usingeitherlinearequationsor sometypeofclassification system.

A step beyondtheuse oftemplate _{matching is using} afull 3-D modelofthehead inpose

estimation. In these approaches, a 3-D model ofthe person under observation is

typically

used [13]. These methods use features extracted from 2-D images and project them onto this 3-D

model.

Using

the full 3-D model allows the determination ofthe rotation ofthe face in allthree possibleplanes anddoes so witha great deal of_accuracy, often within a single degree. In order

forthese systemsto work, agreat deal ofinformation must _{already be known} about the subject,

whichisnotpractical inmost applications.

All of the _previously mentioned methods have both strengths and weakness. One

common problem _among them is

dealing

with

falsely

detected features. The strategies to

improve feature detection

typically

involve waysto _{improve both image quality} and localization

withinthe image. Thenext section introducestheuse of skin detection as a means of

improving

feature detection.

Using

skin regions can

help

localize features and limit the search area for

features,

thishelpstoreduce computation andfalse detectionrates.

2.4 An Overview

of

Skin Detection

Skin detection is partofthe largerproblem of skin segmentation andhas been appliedto

a number of applications _{ranging from} scene classification to tracking. Skin detection is the process of

determining

which _colors, _pixels, or regions in an image are skin. Skin segmentation usesthiscolorinformationto extractskinregions out oftheimage that_maycorrespondto human

shapes like hands or a face. The detection of skin color is independent of facial orientation

making it ideal for pose estimation. Also skin detection is one of the faster facial feature detectionmethods_{making it}suitableforreal time systems [24].

The detection ofskin is not without its

difficulties,

and the most notable ofthese has to

(22)

different

lighting

_{conditions; this} problem is referred to as color _constancy

[4]

[24]. Skin color

appearance depends on the brightness ofthe light sourceused in obtaining the image as well as thesurfacetextureandtone ofthecolor intheimage. One approachin

dealing

with

lighting

isto usedifferentcolorspacesforskin detection.

The use ofthe RGB

(Red, Green,

Blue)

color space in skin detection is common due to

ease of_processing such information sincemostimages are _{already in}the RGBcolor space. Also theconversionfrom one color spacetoanothertends to_{be computationally} expensive. TheRGB

color space is also the most susceptible to

lighting

_changes, which is why methods of

normalization are often introduced. References

[4] [23]

introducethe use of colorhistograms for theredandgreen separations oftheimage. These separations are firstnormalized_usingthe blue

separation to

help

reduce variations in

lighting

and also skin tones from one individual to another.

Another color spacethat is used in skin detection is HSI (Hue-Saturation-Intensity). Hue

corresponds to the attributes ofcolorthat allow them to be classified as _{red-blue-yellow-green,}

while saturation is the _purity of color. This leaves

intensity

as

being

_closely related to the

brightness of the image. To make detection invariant to

lighting

we ignore the

intensity

separation

during

detection. However one problem with HSI is with pixels oflow

intensity

or saturation, such pixels makethedeterminationofHueunreliable [4].

SimilartoHSI istheuse ofthe

YCt,Cr

color space. The

Cb

componentisthe chrominance

ofthe blue separation while the

Cr

component is the chrominance ofthe red separation. The Y component is the luminance ofthe image and is related to the brightness ofthe image and lighting.

During

skin

detection,

the Y separation is

typically

ignored to make the process invariant to lighting. The equations for the conversion from RGB to

YCbCr

are shown below.

References

[25] [26]

usethe

YCbCr

intheirskindetection method.

Y= 0.257*7?₊0.504*G+0.098*B+16

(1)

C

=-0.148*/?

-0.291*

G+0.439*B+128

(2)

(23)

Although people _generally think of skin color _{varying greatly among different} ethnic

groups, in actualitythe skin tones cluster together once luminance factors are removed. This is another advantage of_using either the HSI or the

YCbCr

color space [4]. It is this _clustering of skin tones that is the basis for most skin detection methods.

Looking

atthe distribution ofthe

non-luminance color separations on a 2-D _graph, one can observe that the skin pixels tend to

group with an almost Gaussian _probability distribution. References

[4] [23] [24]

_{actually fit} a Gaussian distributionto their colordata and determine skinbased on wherepixel intensities fall

withinthedistribution.

Other methods ofskin detection use more simplified methods such as

lookup

tables or histograms. Reference

[11]

uses RGB values and a

lookup

table to determine whether or not a

pixel is skin. Withcolor

histograms,

ahistogram is made for each separation ofcolor_using skin

tone information.

During

the detection process, if a pixel's color information falls within the

histogram,

then thatpixel is classifiedas skin[23].

More complex methods ofdetection expand detection to include the use of classifiers

such as Bayesians networks

[25]

or ANNs to make the final determination between skin and non-skin pixels. Classifiers combine the _probability that the current pixel is skin with neighboringpixel information todetermineskinlocation.

Once the pixels in an image are detected as _skin, or non-skin it becomes _necessary to

extract the parts ofthe image required

by

the _algorithm, and this is where skin segmentation comes intoplay. Some approaches

[11]

make use of elliptical approximations of skin regions to determinewhether

they

are aface or_non-face; the largest ellipseisprocessed forfacial features. More simplified methods _simply use the skin segments to localize the search for features.

Methods of segmentation also _employ morphological operations such as dilation and erosion to improvetheselection ofaskinregion.

Inthisthesis,skin detectionand segmentation are usedtolimitthe searchforfeatures and

to

help

normalize the image. The use of skin also helps reduce the false detectionrates from the feature detectors. The next section focuses on an overview of artificial neural networks that are

(24)

2.5 An Overview

of

Artificial

Neural

Networks

An artificial neural network is a

learning

system that is capable of _{performing data}

classification based on input

training

data. As stated

before,

ANN systems are motivated

by

the

internal connections ofthe neurons in the human brain. ANNs are made _up ofinterconnected

processingelements called_nodes,which respond inparallelto a setofinputsignals. Thenode is

the equivalent ofits brain_counterpart,theneuron.

The concept for ANN is

fairly

_old,

dating

backto the 1940's when McCulloch and Pitts

introduced the first neural network _computing model. More complex forms ofthis model were

developed into the 1950's

including

Rosenblatt's work resulted in a two-layer network, the

perceptron,whichwas capable of

learning

certainclassifications

by

_adjustingconnectionweights

[28]. Despite

laying

the foundation for futurework, lackof_computingpower

during

thistime led

to a decline in the interest of neural networks until the 1980's. In 1984 DARPA began

experimenting with ANNs for both commercial and defense purposes [27]. Since that time

ANNs have been used inawidenumber ofapplications from economic

forecasting

to guidance

systems forautomobiles. Researchershave been attractedtoANNsbecauseoftheireffectiveness

in _solving complex problems. The function of these networks depends

heavily

on the

constructionofthenetwork andthe_{way in}which networks aretrained.

The constructionoftheANN consists ofits node_{representation,} its

interconnections,

and

an activation rule. Figure 3 represents one possible construction of a neural network. All

networks consist of an input layer of nodes and an output layer of_nodes; however the hidden

layer of nodes _may not always be needed

depending

on the application. More complex

applications often require multiple hidden node layers. The interconnections between the nodes

also affecttheresults ofthenetwork. Eachnode isnot_necessarilyconnectedto _everyothernode

inthe

following

layer,

although

they

often

do,

_changingthe connections inthenetwork canresult

in a change in performance. As the signals are passed from node to node

they

are also often

subject to an activation rule. The activation rule affects when the input signals are used to

produce the _{corresponding} output signal

by introducing

delays into the signal path of the

network, againtheresults ofthenetwork canbe changed.

The two most basic types ofANN architecture are the feed-forward network and the

feedbacknetwork.Afeed-forwardnetworkispresented witha set ofinputsandthesignal moves

(25)

signal_{may be} passedbackinto thesame layerora previous

layer,

theresults ofthe networkthen

becomemore dependent on

timing

in the system and the convergence ofthe signals [29]. This

thesisdeals_primarilywithfeed-forwardnetworks.

Node

Interconnections

InputLayer

Hidden Layer

OutputLayer

Output Information

Figure 3: Neural network

The nodes ofthe network maybe configured in various ways to compute the results for

the network. Figure 4 shows a typical node calculation for a _network, and this is also how the

nodes usedforthefeature detectors inthis thesisoperate.

Weights

(Wj

₎

^/

.

Input_(Ij) T--^(2]

-47] ? outPut

#"

/

bias

a= '^Irputs *Weights '+bias

Figure 4: Nodeoperation

The most important feature presented here is how the weights and biases are used in the

computation of the result. Each input value is multiplied

by

a weight determined

during

the

learning

process.All oftheseweight and input combinations arethen summed and offset with a

(26)

function. Transferfunctions _modifytheresult to fall within acertain rangeandaffectthe results

ofthenetwork. Examplesoffunctions used aretheinversetangentand

log

functions [27].

The

learning

_process, or

training

_process, _greatly affectsthe

functionality

ofthenetwork. Allnetworks mustconsist ofa

learning

rule thatdictates howto adjustweights andbiases based

on a given set of inputs.

Training

can be subdivided into either supervised or unsupervised

training. Withunsupervised

training

thenetworkisdeveloped_{using only}the inputvalues, where

as supervised

training

combines the inputs with known output values.

Generally

the more

input/output pairs used in the

training

process the better the results ofthe network. The goal of

learning

algorithms is to adjusttheweights of a _network, so that as _{many inputs} as possible are

giventhecorrect output.Thisthesiswillusetheback-propagation

training

methodto

develop

the

networks used in feature detection. The approach taken

by

the back-propagation algorithm is to

start at the output layer andpropagate the errorbetween the desiredand actual result backwards

through the hidden layers. Weights and biases are then adjusted at each node based on the error propagation [29]. Nowthat abasic understanding ofANNs has been established, it is important to look at how such networks have been used in the past to detect both features and with pose estimation.

While ANNs have been used in many applications and _research, this section overviews

sources that are the most relevant to facial detection and the approach taken with this thesis. A

good example ofANNsusedto determinethe presence of aface is presented in [30]. Images are

firstpreprocessedto reduce noise and

lighting

effects onthe image. Thealgorithm usesANNs in

a series ofstages. Stage one usesANN as

filters,

these filtersare appliedto theimage atdifferent

scalesanddetermineswhetherthe20x20sub-image containsa

face,

thesame networkalsolooks for localized facial features within these sub-images,

by breaking

the image down even further.

The first network also makes the determination as to whether or not a face is present in the

system from the results ofthe filtered sub-image. The second stage in the method develops an ANN that works to eliminate false detection. This second network was trained _using the information from several stage 1 networks, which produced different results due to training. A

face _{is only detected if}more than one network agreed a face is present. This method claims an

(27)

Feature detection _{using ANNs is} not new to image processing. Reference

[31]

is an

example ofwork_using networks for facial detection. This research does notlook fortraditional

physical

features,

such as eyes ornose,butrather uses feature

domains,

such as Pseudo Moment

Invariant

(PZMI),

Zernike Moment Invariant

(ZMI),

andPrinciple Component Analysis (PCA).

These facial domains are first calculatedwithin an ellipse aroundthe_{face. The resulting features}

from each of these domains are fed into a Radial Basis Function neural network where the

selected image is classified as either a face or non-face.

Using

feature domains instead of

traditional features allows their system independent of shape changes that can occur

during

directional changes in the face direction. In

fact,

the method can detect the presence of a face

even at the 90 degreemark. An overall face detection rate of99.7% is given for _{the results,} but

the set oftest images were

highly

constrained and are not typical of a general collection of

images.

The methods presented in this thesis differ from previous work as follows. First two

completely independentnetworks are usedforthe detection ofeyes and mouth. Theeye network

classifies regions as either eye or non-eye and the mouth network classifies regions as either

mouth or non-mouth. The networks used here make _only a localized classification of

features,

not a classification ofthe face as a whole. Most methods

dealing

with ANNs tend to be high in

computation requirements, so _every attempt has been made to reduce the number of features

detected with the ANNs as well as the size on the inputted sub-image. The size ofthese

sub-images is _relatively fixed comparedto themethods above. The methoddeals with _{scaling issues}

by

using skin detection and known anthropometric proportions. Chapter 3 discusses how the

(28)

Chapter 3:

Feature Detection

and

Normalization

The method of pose estimation presented here is

feature-based,

and the detection of

features stronglyaffectstheresults ofthealgorithm. Thischapter is dedicatedtothepresentation

of novel work with feature detection _{using ANNs} and template matching. Figure 5 shows the

processingthatanimageundergoes

during

theproposedpose estimationapproach.

Image Result

> Skin Detection > Normalization : =.

J"*"8

> template Geometricpose detection Matching estimation

Figure 5: Algorithm flow

This chapter goes into depth on each ofthese processes except for the final estimation,

which is covered in Chapter 4. The next section introduces the limitations placed on feature

detectionandthemethod as a whole andthe_{reasoning behind}suchlimitations.

3.1

Limitations

and

Difficulties

As stated in chapter _{two, feature detection} with faces is not a simple task and can be

approached in a _variety of ways. Our goal forpose estimation is to

develop

the most efficient

form of pose estimation possible, while still _maintaining a good degree of accuracy. Since pose

estimation and the issue offacial recognition are large in _scope, it is necessary to first place

limitations ontheexpected abilities ofthepose estimation method.

The determination offacial pose in allthree axes of_rotation,requires a complex method

as shown

by

previous examples, and _may not always be possible for the task at hand.

Also,

certain _{applications,} such as

tracking

could _only require one direction of change.

Thus,

the first

limitation ofthe method is the restriction to _{only determine} the yaw angle directional change.

The determination ofthis angle has the most potential to be applied to other applications like

classification and recognition.

Focusing

on _only one angle of change allows for a greater

understanding ofhow features are deformed and detected asthis pose changes.

Limiting

pose to

(29)

translationalpose_changes,while reference

[13]

centers onpitch change. This doesnot meanthat

theproposed method couldnotbe expandedto otherdirectional changes, asdiscussed later. Another limitation is the presence of occlusions in the image. Occlusions result from

objects in animage that can obstruct the viewoffeatures or the head as a whole.

Hats,

scarves,

glasses and even facial hair can be considered occlusions.

Dealing

with occlusions has been the

main focus of some researchers

[16],

but for the most part the problem remains unsolved. Ifa feature cannot be seen in the image

by

the human _eye, most feature detectors will likewise be

unableto determinethe location. Theproposedmethodassumes thatno occlusions arepresent in theimage.

In addition the method does not supportemotion inthe facial images. Ifthe person in an

image _{is smiling} or

frowning,

the features can change intheirapparent shape. It is assumedthat

thepersonhasastoic expressionwiththeirmouth closed and eyes

fairly

open.

The final constraint is how far the _{face may} turn in either direction. With the yaw and

pitch pose changesthereexists a

"vanishing

point"

for facial features. This iswhere thefeature is

no longer present or detectable in the image. With pitch, as the head descends below the horizontal plane the nostrils disappear from view_rapidly; other

features,

such asthe mouth and

eyes, later become obstructed

by

the

top

ofthe head. With yaw directional _changes, the eye in the directionofchange becomesthe first frontal featureto be lost fromview. Posedetectionpast

this point becomes impossible without the use of other features [19]. A second issue with features at extreme angles is their deformation. The shape ofthe eyes and the shape of mouth

will change as theface turns in one direction or another.

Training

on the mouth or eyes atthese

extreme angles degrades the ANN and therefore overall detection. This method limits the

directional change in the yaw direction to between 30 and -30 degrees. In the results section

additionalresults are presented from 45 to -45, so the reader may betterobserve the degradation

in performance as the angle increases beyond design limitations. Other limitations in the

algorithm or its implementation are more trivial and were caused

by

problems inherent to the

software orhardware. These further limitationsare presentedintheirrespective sections.

Now that thelimitationsplaced on the method are _understood, we can introduceour own

method offeature detection using ANNs. The first step inthis processis skin

detection,

whichis

(30)

3.2

Skin Detection

Skin detection is a_{necessary step in feature detection for} three reasons. First the ANN

tends to _{be computationally} _expensive, so if skin detection can be used to _only search for features within a predetermined area instead ofthe entire

image,

the overall computation ofthe

system is reduced.

Second,

once the facial region is determined through skin

detection,

the features ofthe face must, therefore, also be present in this region. The helps eliminate

falsely

detectedpoints.

Finally,

thefeature detectors arelimited intheirinvarianceto scale, (see Chapter

5 for error analysis). The feature detectors workbest ifthe image is a certain size and the eyes

and mouth fall within the size limitations ofthe network. This greatly reducestherobustness of the algorithm, since images will _{rarely be} ofthat specific size iftaken in the real world. Skin

detection,

therefore, allows _scaling the image to the _necessary size, based on the largest skin

region. The pose estimation should occur regardless ofthe initial size ofthe image as

long

as detectionof skin andfeatures is correct.

Ourfirst attemptat skindetectionwas withtheuse oftheRGB histograms. Thisfirstskin

detection method was too sensitive to

lighting

conditions and tended to detect background

portions ofthe imageas skin. The use ofthe

YCbCr

colorhas been shown in references

[4] [25]

tobe insensitive to

lighting

conditionsandtherefore seemed appropriatetoour application.

The first step inskindetection istobuildthegroundtruth informationthatwillbeusedto

determine the presence of skin. Color facial images of size 1600x1200 were gathered from ten

individuals of_varying ethnic _groups,theirskin regions were then selected manually.The images

were converted to the

YCbCr

color space. This resulted in millions of color tone _pixels, which

canbeused for future comparison.A2-D histogramwas developed which is shown in Figure 6. This figure is similar to other histograms presented

by

reference

[25],

even though

they

used a

muchlarger set ofweb-based images.

The next _step was to process this information into a

lookup

table _using the

Cr

and

Cb

color values and the probabilities from the histogram that a pixel ofthis _coloring is skin. A

threshold probability was used to eliminate outliers on the histogram. The elimination ofthese

pixels helped to avoid errors caused

by

the manual detection of skin

during

the collection

(31)

3D plot ofCR and CB inskin range

"

Figure 6: Skin detection histogram

Skin detection is performed on the image _using this

lookup

table. The color image is

scanned pixel

by

pixel and each pixel is converted from RGB to

YCbCr

_{using Equations (l)-(3).} Thevalues forthe

Cb

and

Cr

are comparedto the_table,andiftheirvalues are abovethe threshold

probability, then thatpixel is considered to be skin. It can benotedthat other methods

[24] [25]

make more complex uses of color_probabilityto eliminatedfalse detectionof skin. Inour method

this is not necessary, since we use morphological operations to eliminate possible false points

and fill in anyundetectedareas.

An example ofthe results produced

by

the skin detection is shown in Figure 7. At this

point a skin mask is created from the detected area that needs to be processed with morphologicalfiltersto remove _straypixels andfill ingapsinthe image. The firstmorphological

process is closing, i.e. dilation followed

by

erosion.

During

the dilation process a 3x3 block is passed over each pixel in the

image,

ifthe center pixel is detected as skin then the _neighboring pixels are also marked as skin. Dilation has the overall effect of

increasing

the size ofthe skin region and _reducing thenumber ofholes in the interior.

During

the erosion _step a 3x3 block is

(32)

other skin pixels are nowremoved.

Closing

has the effect of

filling

in interiorparts ofthe image thatwerenotdetectedas skin

during

processing. Figure 8 shows theresults oftheMorphological operations. Fromthis figuretheareas forthe eyesandopenareas inthe foreheadand cheekhave beenfilled_{in allowing for} smoother regions.

20 40 60 BO 100 120 140 160 180 2CD 100 120 140 160 180 200

Figure 7: Original imageanddetectionskin area

20 40 60 ICO 133 1 160 180 230

Figure8: Resultof morphological filters

The next_{step in} theprocess is imagesegmentation shown in Figure 9. In this

figure,

each

region has been given a different gray level to make itmore distinguishable.

During

this process

(33)

20 40 80 80 100 120 140 150 180 393

Figure 9: Skinregionimage

The facialregionis determinedtobe theregion with themost skin_pixels; thiscan alsobe

viewed as the region with the largest area. The image goes through the _{final processing} stage

whichremoves alltheother regions.

During

thisstagethefacialregionis alsofilled

in,

_removing

any remaining gaps left out

by

the morphological processing. This final step is necessary to

allowfor images ofdifferentsizes and _{viewing distances}tobe processed_correctly,_closingalone

cannot always accomplishthis unless it is done multiple times which is costly. The final region

imageis shownbelow in Figure 10.

40 B0 B0 1D0 120 140 1B0 1B0 200

Figure 10: Final facialregion

Someother examples ofthe skindetectionprocess

help

to show_{its limitations}and

robustness. Figure 1 1 through Figure 13 showwhathappenswhenother skintoneareas are

(34)

Figure 1 1: Original image withhand

Figure 12: Region imagewithhand

(35)

The hand in the image is separated out

during

the region process and later removed

leaving

onlythefacialarea. Withmultiplepeople orfacesinan imageasimilar affect_occurs, the

face which is closestto the camera is determinedto bethe facial region andthe pose estimation

process will occur on that region. Figure 14 introduces some of the problems with the skin

detectionmethod.

Figure 14: Effect ofconnectedskinregions

The above figure shows what happens when another flesh tone object comes into

contact withtheheadregion. Thetransitionbetweenthe two objects isnot great enough forthem

tobe separatedout

during

the region process. Thisalsohappens ifmultiple faces arein an image

and are

touching

or at least appear that

they

are connecting. The result is a much larger facial

regionthan_{is actually}there andtherefore

incorrectly

scales theimage

during

normalization. The

best wayto deal withregions likethis isto explore methods of elliptical shape

fitting

to findthe

actual face in the image.

Unfortunately

most methods that accomplish this are complex and

outsidethe scopeofthisthesis,this doeshoweverprovide an area offuturework.

Figure 15 showsthat the skintable can also handle other skintones than thosepresented

in the previous examples. With using the

YCbCr

color space a few colors are sometimes either

detected

falsely

or not atall. The most apparent colorthat has problems is red at either high or

low luminance. Theredness in facesthatoccurs with

blushing

has

the same chrominance asdark

or bright reds. One possible _way to correct this false detection is to avoid pixels at extreme

luminance.

Likewise,

certainhairtones arealso sometimes

falsely

detectedas skin dueto similar

(36)

t"* fT ":'-*

-.'-'^.'^7,"i^q. 2CD 300 400 500

_

300 -100 500 600

Figure 15: Skintones

While the skin detection method is

by

no means _perfect, it does satisfy the needs ofthe algorithm

by

_providing a method to scale the images and to limit the search area for facial features. The next section discusses how the facial region is now used to normalize the image

alongwithother_{image processing} methods.

3.3

Image Normalization

In order to improve feature detection and therefore pose _estimation, _{it is necessary} to introduce methods thatimprove the _quality ofthe image and make it suitable for processing

by

the _{ANNs. This processing} isknownas image normalization.The two stepsofnormalization for thismethod are_scalingand contrast stretching.

Oncethefinal skin regionhas been

located,

like in Figure

10,

theoriginal image cannow be scaled. The image is scaled either_up or down

depending

on the_{viewing distance} and size of

the original image. First the height and width of the facial region is measured. Next the proportion ofthe facialregion heightto the facialregion width is analyzedto seewhetheror not theneckhas been included intheregion like in Figure 13 orjustthe facial areaas in Figure 10. If the width ofthe face is not greater than 2/3rds ofthe

length,

than the region includesthe neck and mustbe scaled with adifferentbase value. The 2/3rds value was determined from previous

work related to anthropometric studies

[2]

and from our own experimentation with different

facial proportions. The width ofthe facial region changes as pose changes become larger at

(37)

ontheheight ofthe images usedin

training

forthe ANNs. This is theheightthe image mustbe

forthefeature detectorsto_{correctly detect}theeyes andthemouth.

Scalerate=

(4)

HeightofFacial Region

After the scale rate has been determined using Equation

(4),

the image is scaled down

with bilinear interpolation. The image is then converted from 24bit RGB to 8 _{bit gray} scale for

the nextstage ofnormalization,contrast stretching.

The contrast of an image is the distribution ofits lightand dark pixels. An imageoflow

contrasthas a small difference between itspixel values. The histogramof alowcontrast image is

usually skewed eitherto the

left,

as in the case oflight

images,

or _{to the right,} as in the case of

darker images. Histograms located around center are _mostly gray. Contrast stretching is a

techniqueusedto stretchthehistogramof an image sothat the fullranges of pixels values in the

imageare filled. Contrast stretching helpsmakefeatures standoutmore inthe imagecomparedto

skin areas.

The method of contrast _{stretching implemented here} creates a histogram of pixel values.

The histogram is then clipped between the 0.5 and 99.5 percentiles-_to _eliminate _{outlier values}

and forrobustness. The highest and lowestvalues from the histogram are then setto 255 and 0

respectively. All other pixel values are then scaled according. This technique works best when

the histogram ofthe image is Gaussian inshape, butdoesnotmake full use allowedvalues [32].

Contrast _stretchingalsohelps deal with

lighting

effectsto alimited extent.

Once the imagehas beennormalizedto the correct shape and _{contrast, the} ANNs can be

used to determine final features. The next section introduces how the ANNs were created and

trained to suittheneeds ofthefeature detectors.

3.4

Neural Network

Training

Before the ANNs can be used for detection offacial

features,

they

must first be trained

using images of the desired features. The images used in the

training

stage were from the

"Database of

Faces"

provided

by

AT&T Laboratories Cambridge [33]. _{This database} _{consists of}

(38)

are ten

images,

taken under controlled

lighting

and distance conditions, at various poses. The

orientation ofthe faces varies _{slightly in both} pitch and _yaw, but exact angles are not given.

Frontal features are viewable in all images. Images containing facial

hair,

glasses,

blinking,

or

open mouths were eliminated from the

training

sets. These images were eliminated since

limitationsdonot allowfor facialexpressions orocclusions. Sample imagesareprovidedbelow.

Figure 16: Sample faces for

training

Thenext _stepwas to create sub-images ofboththe eye andthe mouth. Foreachimage in

the

training

set, the center ofthe right _eye, left eye and mouth were _manually selected. A

sub-image of21x11 was then extracted around the eyes. For the mouth a sub-image of 33x13 was

used. The determinationofthesub-image size isatradeoffbetween speed and accuracy.Alarger

sub-image size contains more image

information,

but also produces a much larger network and

requires more time for both

training

and detection. If the sub-image is too small, not enough

information is captured to make the feature different from other areas ofthe face.

Thus,

the

network_maytrainor perform poorly.

In orderto train anetwork _correctly non-feature images needed tobe collected from the

facial images. For each feature image collected, five non-feature sub-images were _collected, of

thesame size, fromthe aroundtheface. Non-feature images includeareas like facial _edges,

hair,

eye

brows,

nostrils and skin areas like the cheek. The figures below show some examples of

(39)

ML;: * B MB Tij

Figure 17: Eve feature images

Figure 18: Non-eve feature images

? .

_

T

Fiaure 19: Mouth feature images

Figure 20: Non-mouth feature images

The collection of sub-images in this fashion resulted in a significant number of

sub-images to be used for training. The initial

training

set for the eye consisted of276 eye images

and 1380 non-eye images. Eye images included bo