Low-resolution facial expression recognition: A filter learning perspective

(1)

Contents lists available at ScienceDirect

Signal

Processing

journal homepage: www.elsevier.com/locate/sigpro

Low-resolution

facial

expression

recognition:

A

ﬁlter

learning

perspective

Yan Yan

a

_{, Zizhao Zhang}

b

_{, Si Chen}

c

_{, Hanzi Wang}

a,∗

a_Fujian_Key_Laboratory_of_Sensing_and_Computing_for_Smart_City,_School_of_Informatics,_Xiamen_University,_Fujian,_China b_Department_of_Computer_Information_and_Science_Engineering,_University_of_Florida,_Florida,_USA

c_School_of_Computer_and_Information_Engineering,_Xiamen_University_of_Technology,_Fujian,_China

a

r

t

i

c

l

e

i

n

f

o

Articlehistory:

Received23April2019 Revised28October2019 Accepted9November2019 Availableonline12November2019

Keywords:

Filterdesign Subspacelearning Imagerepresentation Low-resolution

Facialexpressionrecognition

a

b

s

t

r

a

c

t

Automaticfacial expressionrecognitionhasattractedincreasingattention foravarietyofapplications. However, the problem of low-resolution generally causes the performance degradation of facial ex-pressionrecognitionmethods under real-lifeenvironments. Inthispaper, weproposetoperform low-resolution facial expression recognitionfromthe filter learningperspective. More specifically, anovel imagefilterbasedsubspacelearning(IFSL)methodisdevelopedtoderiveaneffectivefacialimage rep-resentation.TheproposedIFSLmethodmainly includesthreesteps:Firstly,weembedtheimagefilter learningintotheoptimizationprocessoflineardiscriminantanalysis(LDA).Byoptimizingthecost func-tionofLDA,asetofdiscriminativeimage filters(DIFs) correspondingtodifferent facialexpressionsis learned.Secondly,theimagesfilteredbythelearnedDIFsareaddedtogethertogeneratethecombined images.Finally,aregressionlearningtechniqueisleveragedforsubspacelearning,wherean expression-awaretransformationmatrixisobtainedusingthecombinedimages.Basedonthetransformationmatrix, IFSLeffectivelyremovesirrelevantinformationwhilepreservingusefulinformationinthefacialimages. Experimentalresultsonseveralfacialexpressiondatasets,includingCK+,MMI,JAFFE,SFEWandRAF-DB, showthesuperiorperformanceoftheproposedIFSLmethodforlow-resolutionfacialexpression recog-nition,comparedwithseveralstate-of-the-artmethods.

1. Introduction

Duringthepastfewdecades,automaticfacialexpression recog-nition has attracted extensive attention in computer vision and pattern recognition. It plays an important role in a variety of applications, including human computer interaction (HCI), data-drivenanimation [1–5].Despitesignificantprogress,facial expres-sionrecognitionisstilladifficulttask,duetothevariationscaused by pose, illumination,andsoon.One criticalproblemthat isnot well solved is low-resolution (LR). In real-life environments, the resolutionoffacialimagescapturedbyanordinarycameramaybe low.TheLR facialimagesusuallylacksufficientvisualinformation to extract informative features, thus leading to the performance degradation of facial expression recognition methods [6]. There-fore,effectivelydistinguishingdifferentfacialexpressionsbasedon theLRfacialimagesisverychallengingbutmeaningfulfor practi-caltasks.

∗ _{Corresponding}_author.

E-mailaddresses:[email protected](Y. Yan),[email protected]ﬂ.edu(Z.Zhang),

[email protected](S. Chen),[email protected](H. Wang).

Todeal withthe problemof LR facial expression recognition, existing methods mainly focus on two aspects: (1) face super-resolution (SR)[7,8],and (2) facial image representation [6,9,10]. The ﬁrstaspect usually adopts two criteriato performSR: pixel-levelvisualﬁdelityandimage-levelfaceidentitypreservation.The secondaspectaimstoextractthecompactanddiscriminative fea-turerepresentation.Inthispaper,wemainlystudythesecond as-pect.

SRmethodsaimtoconstructahigh-resolution(HR)imagefrom thecorresponding LRimage[11].Theoretically,byapplyingtheSR methodson the LRfacial imagesto constructHR images,the re-constructedimages can be usedforfacial expression recognition. However,thecomputational complexityofexisting SRmethods is usuallyhighandtheseSRmethodscannot guaranteethat the re-sultingHRfacialimagesareoptimalforrecognition[12].

Facialimage representationalsoplays a criticalrole forLR fa-cialexpressionrecognition.Themethodsofrepresentingfacial ex-pressionimagescanberoughlycategorizedintogeometric feature-basedmethods [13]andappearance-basedmethods[14]. Geomet-ric feature-basedmethods detectsalient facialfeature pointsand characterizethevariationsofthesepoints,whichcanachievegood https://doi.org/10.1016/j.sigpro.2019.107370

(2)

performanceon actionunit recognition[15].However, precise lo-calizationoffacial featurepointsisnot atrivialtaskforLRfacial images.Appearance-basedmethodsrepresentthevariationsof fa-cialappearance based on the whole face orspeciﬁc regions ina facialimage.Thiskindofmethodsusuallyattemptstoextract dis-criminativefeatures infacial imagestodistinguishdifferentfacial expressions.

The appearance-based methods can be further classiﬁed into handcrafted feature-basedmethods [16–18] andfeature learning-basedmethods [19–21].Representative handcrafted feature-based methods include local binary patterns (LBP) [16], Haar-like fea-tures[17] andGabor-wavelet features [18]. However, these man-uallydesignedfeatures maynot effectivelyhandlethe challenges causedbythenon-linearfacialappearancevariationsdueto differ-entposes,occlusionsandetc.Morerecently,featurelearning-based methods,suchasauto-encoder[19]andconvolutionalneural net-works(CNNs)[20,21],haveattractedmuch attention.Zhangetal. [19]presentaspatiallycoherentfeature-learningmethodfor pose-invariantfacialexpressionrecognition.Theycombinethe learning-based features and the corresponding geometry features to con-struct robust features. Xie andHu [20] propose a deep compre-hensivemultipatchesaggregationCNNsmethod,whichconsistsof twoCNNbranchestorespectivelyextracttheholisticfeaturesand local features, to solve the problemof facial expression recogni-tion.Lietal.[21]presentthe CNNwithan attentionmechanism (ACNN)forfacialexpressionrecognitioninthepresence of occlu-sions.Thesemethodsshowtheexcellentabilitytoextractthe dis-criminativerepresentationfromtherawdata.

Psychologistshaveshownthatthecrucialfeaturesfor recogniz-ingfacial expressions areusually distributedaround salientfacial featurepoints,suchasmouth,noseandeyes[1].Thevariationsof thesalientfacialfeaturepointsaretheusefulinformationforfacial expression recognition. In contrast, the facial identities of differ-entpersons arethe irrelevant information,whichshould be sup-pressed or removed (althoughsuch information is important for identifyingaperson) forthetaskoffacialexpression recognition. Inparticular, theinformation inthe LR facialimages isrelatively limited.Therefore,theirrelevantinformationmaysigniﬁcantly de-creasetheperformanceofLRfacial expressionrecognition. There-fore,howtoextractthediscriminativefacialimage representation fromthelimitedinformationiscritical.

Inthispaper,weproposetoperformLRfacialexpression recog-nitionfromtheﬁlterlearningperspective,whereanoveland effec-tivefacial image representationisdeveloped forfacialexpression recognition.Theprocessofconstructingthefacialimage represen-tationforLRfacialexpressionrecognitioncanbeconsideredasthe process of suppressingirrelevant information(e.g., facial identity differences),whileenhancingthevaluableinformation(e.g., wrin-kledeyebrow, smiling mouthand other features that are critical fordiscriminatingdifferentexpressions)infacialimages.

More specifically, we propose a novel image filterbased sub-space learning (IFSL) method to achieve an effective image rep-resentationfor LR facial expression recognition. In particular, we learn a discriminative image filter (DIF), based on the two-class lineardiscriminant analysis(LDA) technique[22], todiscriminate anon-neural expressionfromtheneutralexpression.The learned DIFextracts discriminativeinformationbymappingthefacial im-agestoa subspacewheretheintra-classvariationsareminimized andtheinter-classvariationsaremaximized. Therefore,theDIF is abletofindsubtle variationsof facialexpression amongdifferent LRfacialimages.WhenasetoflearnedDIFs(correspondingto dif-ferent expressions) is applied to a multi-class classification task (e.g., facial expression recognition in this paper), we propose to usearegressionlearningtechnique(i.e.,thelinearridgeregression (LRR)technique[23])to derivea newfacialimage representation withhighdiscriminability,basedonthefilteredimagesratherthan

the original images.As aresult, an expression-aware transforma-tion matrix that encodes the expression information isobtained. ThisstrategyextendstheclassiﬁcationabilityoftheDIFfrom two-classtomulti-classclassiﬁcation.

Insummary,we presentanovelimage representationmethod bytakingadvantageofthediscriminativeimage ﬁlterandthe re-gressionlearningtechnique.

Thepreliminaryversionsofthisworkwerereportedin[24,25]. However, we have made severalsigniﬁcant extensions compared withthepreviousversions.Thenewcontributionsinclude:

• We provide a general formulation of the image ﬁlter learn-ing, where the image ﬁltercan take different forms (such as element-wise product, linear transform and convolution) as longasitisdifferentiable.

• Wereformulate theoriginal methodanddevelopa more gen-eral framework for image ﬁlter based subspace learning. We also offer more mathematical details and motivations of the proposedmethodforfacialexpressionrecognition.

• Weconductextensiveexperimentsonbothin-the-lab datasets andin-the-wilddatasetstodemonstratetheeffectivenessofthe proposedmethodforLRfacialexpressionrecognition.

The remainder of this paper is organized as follows. In Section2,wereview relatedwork. InSection3,weintroducethe detailsoftheproposedIFSLmethod.InSection4,weevaluatethe performanceofIFSLandcompareIFSLwithseveralstate-of-the-art methodson variousfacialexpression recognitiondatasets.Finally, theconclusionisdrawninSection5.

2. Relatedwork

ThissectionreviewsrelatedworkinLRfacialexpression recog-nition. Firstly, the recently developed methods on face super-resolution and LR recognition are introduced in Section 2.1. Then, some facial image representationmethods are reviewed in Section 2.2.Finally,some ﬁlterlearningmethodsare discussedin Section2.3.

2.1. Facesuper-resolutionandlow-resolutionrecognition

Traditional methods for handling the LR facial images aim to generatehigh-resolution (HR)facialimages,basedonwhichfacial expression recognition can be performed. These methods can be roughlyclassiﬁedintotwocategories:genericsuper-resolution(SR) methods[7],andclass-speciﬁcSRmethods(alsocalledface hallu-cination)[8].

Generic SRmethods take advantage of thepriors that ubiqui-tously exist in natural images without relying on the face class information. For instance, Guet al. [26] develop a convolutional sparse coding method for image SR instead of the conventional patch-basedmethods.Dongetal.[7]usetheCNNstolearna non-linear mapping function between LR and HR images based on a large-scaleimagedataset.

Ontheotherhand,facehallucinationmethodsexploitthe sta-tisticalinformationoffacesandtheyusuallyachievebetterresults than genericSR methodsforfacial expressionrecognition orface recognition. For example, Ma et al. [27] use multiple local con-straintslearned fromexemplarfacial imagestoperform face hal-lucination.Wangetal.[28]proposetoapplytheglobalconstraints betweenLR andHR facial images,andthen hallucinate HR facial imagesbasedontheeigen-transformation.However,theoutputHR facialimagesmaysufferfromghostingartifacts.Notethat, gener-ativeadversarialnetworks(GANs) [29]can generatefacialimages with sharp details due to the discriminative networks. However, thegeneratedimagesareonly similartoone anotherintheclass domain butthey are differentin the appearancedomain [30].In

(3)

general, the computationalcomplexityof thesefacehallucination methodsisusuallyhigh.

Except for the above SR methods, some methods have been speciﬁcally developed for LR facial expression recognition/face recognition. These methods aim to extract resolution-insensitive features[6]orlearncross-resolutiontransformations[31–34].

For example,Khan et al.[6] propose a novel feature descrip-tor PLBP(Pyramid ofLBP) forLR facialexpression recognition.In fact, LR face recognition has achieved signiﬁcant progress in the past few years. Renet al.[31] propose a coupled kernel embed-ding(CKE)methodforfeatureextractionwithitsapplicationtoLR face recognition. Jiang etal. [32] develop a coupled discriminant multi-manifold analysis (CDMMA)for LR face recognition.By ex-ploiting theneighborhood informationandlocal geometric struc-tureof themanifold, CDMMAlearnstwo mappings to projectLR andHRimagestoauniﬁeddiscriminativefeaturespace.Xingand Wang[33]developcouplemanifolddiscriminantanalysiswith bi-partitegraphembedding (CMDA_BGE)tosolvetheproblemofLR face recognition. Chu et al.[34] propose a cluster-based regular-ized simultaneous discriminant analysis (C-RSDA) method for LR face recognition with single sample per person. Note that these LR face recognition methods usually match the LR probe faces against the HR gallery images. In this paper, we concentrate on themoregeneralsettings,whereboththetrainingandtestimages areLR.

2.2. Facialimagerepresentation

Automaticfacialexpression recognitionusuallyconsistsoftwo main steps: feature extractionand facial expression classiﬁcation [1].Featureextractionextracts generativeordiscriminative repre-sentationsfromrawfacialimagestoeffectivelyrepresentthefacial images.Generally,currentmethodsforrepresentingfacial expres-sionimagescanbe dividedintogeometricfeature-basedmethods [13]andappearance-basedmethods[9,16,17,35].Inthispaper,we mainlyreviewtheappearance-basedmethods.

The representative appearance-based methods, including lo-cal binarypatterns (LBP) [6,16,36,37], Haar-like features [17] and Gabor-wavelet features [18], have been successfully applied into facial expression analysis. Speciﬁcally, several LBP-basedvariants, such as m-LBP (representing salient micro-patterns in facial im-ages) [36] and Boost-LBP (using a boosting algorithm to learn the most discriminative LBP histograms) [37], are proposed and achievethestate-of-the-artperformance. Classicalsubspace learn-ing methods, such as lineardiscriminant analysis (LDA)[38] and principle componentanalysis(PCA)[39],are alsowidelyused for featureextraction.

Theabovemethodsconsiderthefacialimageasawhole with-outspeciﬁcallyemphasizingtheimportantroleofsalientfacial fea-turepoints.Actually,somelocalfacialregions(e.g.,eyes,eyebrows and mouths) contain critical information for expression recogni-tion,sincedifferentexpressionsaccompanythevariationsin differ-entlocalfacialregions.Inrecentyears,some methods[9,35]have beenproposed toanalyze non-holisticfacial images.Forinstance, Zhongetal.[35] proposea multi-tasksparse learningframework toexplorediscriminativeinformationinlocalfacialregionsfor dif-ferentiatingdifferent expressions, andsuggest that differentlocal facialregions shouldbeassignedwithdifferentweights.Theyuse LBP to partition the facial imagesinto isometric non-overlapping regions,wheretherelationshipamongdifferentlocalfacialregions isexploited. Experimentalresults in[35]show that themost im-portantlocalfacialregionsforrecognizingtheexpressionsarethe eyes,eyebrows,noseandmouths.

Recently, deeplearning hasachievedoutstanding performance in a variety of computer vision tasks, including facial expression recognition[20,21,40–44].Forexample,Xieetal.[40]proposethe

deep attentive multi-path CNN (DAM-CNN) method, which not onlyautomaticallylocatestheexpression-relatedregions,butalso generatesaneffectivefacialexpressionrepresentation.LiandDeng [41] develop a deep locality-preserving CNN (DLP-CNN) method forunconstrained facialexpressionrecognition,whichusesanew locality-preservinglosslayerfordeeplearning. Moreover,they in-troduce a newreal-world facial expression dataset (i.e., RAF-DB) captured from the Internet. In [42], a deep emotion-conditional adaption network (ECAN) method forunsupervised cross-dataset facialexpressionrecognitionisdeveloped.

Recently,the video-based facialexpression recognitionhas re-ceived great interest. Compared with a static image, a video se-quence not only contains the spatial appearance, but also pro-videsfacialmotions.Guptaetal.[45]developascaleinvariant ar-chitecture for generatingillumination invariant deep motion fea-tures for video-based facial expression recognition. Alam et al. [46]proposeabiologicallyinspiredsparse-deepsimultaneous cur-rent network (S-DSRN) for robust facial expression recognition. S-DSRN makes use of the weight sharing technique in the hid-denrecurrentlayerstoreducethenumberofnetworkparameters, wherethesimultaneousrecurrencyofferseﬃcientcontroloverthe depth ofthe model. Chen etal. [47]combine a new feature de-scriptorcalledhistogramoforientedgradientsfromthree orthog-onal planes (HOG-TOP) and a new geometric feature descriptor to respectively extract dynamic textures and facial conﬁguration changes for video-based facial expression recognition. Moreover, theaudiomodalityisalsoconsideredforrecognition.

2.3.Filterlearning

Filterlearning-basedmethodsarewidelyappliedtomany com-putervisiontasks,suchasfacerecognition[48,49],visualtracking [50],objectdetection[51],duetotheir highgeneralization ability androbustness. Forexample, Yanetal. [49]propose an effective correlation filter bank methodto extract features for face recog-nition.Henriquesetal.[50] proposetousethekernelcorrelation filtermethodforfastvisualtracking.Generallyspeaking,thefilter isdesignedtosuppressnoisysignalsandamplifyusefulsignalsso thatthediscriminabilityoffilteredsignalscanbeenhanced.

It is worthy pointing out that Gabor-wavelet [18] and LBP [16,36]canalsobeconsidered asthespecialformsoffilter. How-ever,thesefilters aremanually designedwithoutusingthe learn-ing technique. Incontrast, ourproposed discriminative image fil-ter is learnedvia the objectivefunction ofLDA withmaximizing discriminability.Inaddition,therecentpopularCNNs-based meth-ods [3,52]use theconvolutional filters toobtain rich representa-tions for accurate facial expression recognition and these meth-ods have achieved superior performance. The parameters of the convolutionalfilterscanbeeffectivelylearnedbasedonthe back-propagationtechnique. Nevertheless,theCNNs-basedmethods re-quirethe relativelyHRimagesastheinputto extracthierarchical representations, whichcan make thesemethods difficult to han-dle the LR facial expression recognition problem [52]. Note that theseCNNs-basedmethodsusuallyrequirealargeamountof train-ingdata.Butcurrentfacial expressiondatasetstypicallycontaina small number of labelled samples. Therefore, cross-corpus train-ing [42] or transfer learningtechniques [53] (which take advan-tage of the extra available training data) can be used to effec-tively deal with facial expression recognition with limited train-ing data. In this paper, we aim to solve the problem of lim-ited training data from the image filtering perspective. The pro-posedmethodisa goodalternativefordealingwithLR facial ex-pression recognitionusinglimitedtrainingdata.Experimental re-sultsonmultiplepublicfacialexpressiondatasetsverifythe excel-lentperformance oftheproposed methodforLRfacial expression recognition.

(4)

Fig.1. TheframeworkoftheproposedIFSLmethod.

3. Imageﬁlterbasedsubspacelearning

The proposed image filter based subspace learning (IFSL) methodcontainsthreesteps.Inthefirststep,weembedthe pro-cessofdiscriminativeimagefilter(DIF)learningintothe optimiza-tionoftwo-classLDA.Inthesecondstep,welinearlycombinethe filtered images generated by the learned DIFs. In the third step, based on the combined results, we propose to use a regression learningapproachtoperformfeatureextraction.Theoverall frame-workoftheproposedIFSLisshowninFig.1.

The discriminative image ﬁlter learning is introduced in Sections3.1,andthedetailsoftheoptimizationprocedureare de-scribed in Sections 3.2. The details of the second and the third stepsarediscussedinSection3.3.Thecompletealgorithmisgiven inSection3.4.

3.1.Discriminativeimageﬁlterlearning

Forthetaskoffacialexpressionrecognition,thesignificant dis-criminativeinformationmainlyliesinthelocalfacialregionssuch astheeyes,eyebrows,noseandmouths.Theselocalfacialregions havedifferentinfluence onrecognizingdifferentexpressions (e.g., lipsrise ina happy expression face; eyebrows wrinklein an an-gryexpression face; a mouthwidely opens andeyebrows rise in asurpriseexpressionface). Thelocalfacialregions likeeyebrows, mouths and ajina contain more discriminative information than theregions like cheekstoidentify theangry expression. Inother words,theseregions playanimportantrole indiscriminating dif-ferentexpressions. Therefore,theobjectiveofan imagefilteristo simultaneouslyemphasizediscriminativeinformationinthecrucial localfacialregionswhilesuppressingirrelevantinformationinthe otherfacialregions.

Avarietyofﬁlterfunctionscanbeused.Forexample,t

• Element-wise product: f

(

λ

1,p

)

=

λ

1p, where represents the dot product operator;

λ

1 is the ﬁlter function in Rd; p is a facial image represented asa d-dimensional column vec-tor.Therefore,eachelement

λ

₁_idecidestheintensityoftheith pixelpiinafacialimagethatpassesthrough.

• Lineartransform: f

(

λ

2,p

)

=

λ

2p,where

λ

2∈Rd×disthe trans-formationmatrix.

• Convolution: f

(

λ

3,p

)

=

λ

3∗p, where ∗ denotes the convolu-tionoperatorand

λ

3∈Rd istheconvolutionalkernel.

Given a ﬁlter function

λ

(deﬁned as one of the above-mentionedthreefunctions)andaninputmatrixP=[p1,...,pn]∈

Rd×n_consisting_of_n_facial_images,_we_can_obtain_the_output_matrix as,

f

(

λ

,P

)

=[f

(

λ

,p1

)

,...,f

(

λ

,pn

)

]

=[x1,...,xn], (1)

andwe define X=[x1,...,xn]. Here,X∈Rd×n contains nfiltered facialimages,andeachfilteredfacialimageisad-dimensional col-umnvector.

Generallyspeaking,weexpectthatthelearnedimagefilterhas thediscriminativeability toextract theusefulinformationfor fa-cialexpressionrecognition.Inotherwords,thefilteredimages cor-respondingto differentexpressions should be moreseparablefor subsequent classification. Therefore, in order to learn a discrimi-native image filter (DIF), we propose to take advantage of linear discriminantanalysis(LDA)duringthetrainingprocess.

LDA[22]isapopularsubspaceanalysismethodwhichprojects high-dimensionalsamples to an optimaldiscriminative subspace, wheretheprojectedsamplesare well-separated.Itcaneffectively extract the information from samples and compress the dimen-sionalityofsamplesthroughasupervisedlearningstrategy.LDAis originallyproposed tohandletwo-classclassificationproblems.In fact,LDAcanalsobe extendedtohandlemulti-classclassification problems(wheretheinter-classmatrixisthesumofthepairwise scatter matrix ofanytwo differentclasses). However, multi-class LDA suffers from the problem of unbalanced pairwise distances (i.e.,thedistanceoftwodifferentclassesmaybemuchsmalleror largerthanthatbetweenanothertwodifferentclasses),whichmay significantlydegradetheperformanceinfacialexpression recogni-tion[54,55].Therefore,we mainlyfocus ontwo-class LDAinthis paper.

Next we give the detailed steps of embedding the DIF learn-ingintotheoptimizationprocessofLDA.LDAattemptstoseekfor an optimallineartransformation tominimizethe intra-class vari-ance (characterizedby an intra-classcovarianceSW) aswellasto maximize the inter-classvariance (characterizedby an inter-class covariance SB). In our method, we use the facial images with a neutral expression,denoted as P1, and those witha non-neutral expression (e.g., angry,disgust, fear, happy, surprise, or sad), de-notedas P2, asthe inputsfor trainingthe two-class LDAmodel.

(5)

Thecostfunctionoftwo-classLDAisdeﬁnedas, L

(

X1,X2

)

=

ω

T_S B

(

X1,X2

)

ω

T_S W

(

X1,X2

)

ω

, (2)

where X_i₌f

(

λ

_,P_i

)

_, i₌1_, 2.

ω

_∈_Rd _is _the_linear_{transformation} vector.Theinter-classcovarianceSBisdeﬁnedas:

SB

(

X1,X2

)

=

(

m1−m2

)(

m1−m2

)

T, (3) andtheintra-classcovarianceSW isdeﬁnedas:

SW

(

X1,X2

)

=

(

X1−M1

)(

X1−M1

)

T+

(

X2−M2

)(

X2−M2

)

T, (4) wherethecolumnvectorm_iisthemeanofX_i(i₌1_, 2)in_Rd_._The matrixMiincludesncopiesofmi.

Theoptimal

ω

∗canthenbecomputedas[23],

ω

∗₌_argmax ω

ω

T_S B

(

X1,X2

)

ω

T_S W

(

X1,X2

)

ω

=SW

(

X1,X2

)

−1

(

m2−m1

)

. (5)

In two-class LDA, the linear transformation vector

ω

has a closed-formformulation.

During the training process, the objective of the DIF learning istoobtainan optimalDIF

λ

embeddedinthecost function,and thisproblemcanbesolved basedonthegradientdescent (which willbediscussedinthefollowingsubsection).Morespeciﬁcally,by incorporating a DIF intothe cost function ofLDA, we can obtain thefollowingobjectivefunction,i.e.,

O

(

λ

)

=−ln L

(

X1,X2

)

+ 1 2Ctr

(

λ

T

_λ

₎

=−ln L

f

(

λ

,P1

)

,f

(

λ

,P2

)

+1 2Ctr

(

λ

T

_λ

₎

_, ₍₆₎

wheretr(

λ

T

_λ

₎_is_a _{regularization}_term_which_enhances_the gener-alizationcapabilityandrobustnessofthelearnedﬁlter.C(≥0)isa scalarparameter.Therefore,weaimtoobtaintheoptimalDIF,such that

λ

∗₌_argmin

λ O

(

λ

)

. (7)

3.2. Discriminativeimageﬁlteroptimization

TheminimizationprobleminEq.(7)canbesolvedviathe gra-dientdescenttechnique[56],sincebothfandLinO(

λ

)are differ-entiable.Itisworthypointingoutthatineachiteration,the calcu-lationof

ω

∗showninEq.(5)isdynamicallyupdatedforcomputing

∂

ω

∗_/

_∂

_λ

_in_L_._The_derivation_details_for_optimizing_O₍

_λ

₎_are_shown asfollows.

The partialderivative ofO(

λ

) withrespectto

λj

(or

λi

,jif

λ

is the transformationmatrix) iscomputed. Inthefollowing, weuse

λj

for simpliﬁcation without loss of generality. Thus, the partial derivativecanbewrittenas:

∂

O

(

λ

)

∂λ

j =− ∂ ∂λjL

(

f

(

λ

,P1

)

,f

(

λ

,P2

))

L

F

(

λ

,P1

)

,f

(

λ

,P2

)

+C

λ

j. (8)

ThecostfunctionofLDAisdifferentiable,andwehave:

∂

L

∂λ

j = ∂ ∂λj

(

ω

∗T_S B

ω

∗

)

ω

∗T_S W

ω

∗ −

ω

∗TSB

ω

ˆ∗

(

ω

∗T_S W

ω

∗

)

2

∂

∂λ

j

(

ω

∗T_S W

ω

∗

)

, (9) where

∂

∂λ

j

(

ω

∗T_S W

ω

)

=

∂

ω

∗

∂λ

j

T

(

SW

ω

∗

)

+

ω

∗T

∂

SW

∂λ

j

ω

∗₊_S W

∂

ω

∗

∂λ

j

, (10)

andthederivationof

∂

(

ω

∗TSB

ω

∗

)

/

∂

λj

issimilartotherightitem ofEq.(10).

Thepartial derivativeof

ω

∗ withrespectto

λj

is calculatedin eachiteration:

∂

ω

∗

∂λ

j =

∂

∂λ

j

S_W−1

(

m2−m1

)

=−SW−1

∂

∂λ

j SW

S−W1

(

m2−m1

)

+SW−1

∂

∂λ

j

(

m2−m1

)

. (11)

Accordingto Eq.(3) andEq.(4),we can compute

∂

SW/

∂λj

as follows:

∂

SW

∂λ

j =

∂

∂λ

j

(

X1−M1

)(

X1−M1

)

T

+

∂

∂λ

j

(

X2−M2

)(

X2−M2

)

T

, (12)

andsimilarly,

∂

SB/

∂λj

iscomputedasfollows:

∂

SB

∂λ

j =

∂

∂λ

j

(

m2−m1

)(

m2−m1

)

T

. (13)

Finally,thepartialderivativeofXwithrespectto

λj

isobtained bycomputingthe partialderivative ofeachcolumnofX with re-spectto

λj

,thatis,

∂

X

∂λ

j =

∂

f

(

λ

,P

)

∂λ

j =

∂

f

(

λ

,p1

)

∂λ

j ,...,

∂

f

(

λ

,pn

)

∂λ

j

. (14)

Thepartialderivativeoffwithrespectto

λj

canbederived ac-cordingtodifferentformsofﬁlters.Forexample,forthe element-wiseproduct,

∂

f/

∂λj

=sjp,wheresjisavectorwhereonlythe value of the jth entry is 1 and the rest are zero. For the linear transformation,

∂

f/

∂λi

,j=Ejp,whereEjisad×dmatrix consist-ingofallzerosexceptforthe(i,j)thentry,whichis1.

It is worth pointing out that the optimization problem (i.e., Eq.(6)) of the proposed method is formulated by taking advan-tage ofthe Fisher criteria used inthe conventional LDA method. However,theproposedmethodandLDAaresignificantlydifferent. Firstly,LDAobtainstheprojectionmatrixwithaclosed-form solu-tion.Incontrast,theproposedmethodobtainsthefilters(cantake the forms of element-wiseproduct, linear transform or convolu-tion) in an iterativemanner. Secondly,the objective ofLDA is to obtaintheoptimalprojectionmatrix(i.e.,winEq.(2)),whilethat oftheproposedmethodtriestoobtainthefilters(i.e.,

λ

inEq.(2)). Inotherwords,theproposedmethodisnotequivalenttoLDAeven ifthelineartransformisused.

3.3.Linearcombinationandregressionlearning

In this paper, each DIF is designed to discriminate a speciﬁc (non-neural)facialexpressionfromtheneuralexpression,whichis atwo-classclassiﬁcationproblem.SupposethatthereareN differ-entexpressions,NDIFs(denotedas

{

λ

i

}

Ni=1)willbetrained.Given animagepwithanunknownexpression,weaimtoﬁgureoutthe expressionbasedontheoutputsofDIFs.Here,we denotetheDIF (correspondingtotheexpressionlabeloftheimage p) as

λ

+₁ and theother

(

N−1

)

DIFsas

{

λ

−_i

}

N

i=2.

Toidentify the expression ofa test image, one simpleway is tofirstly train N expression-dependent classifiers(using the one-vs-allstrategy), whereeach classifier istrained to discriminatea non-neural expression from the other expressions based on the outputs of one DIF. Then, the facial expression corresponding to the classifier giving the highest probability output is identified.

(6)

However,suchawayisnot reliablesincethecorrelationbetween different expressions is not considered and the recognition re-sultsmay be inaccurate when two expressions share similar ap-pearancevariations.In other words,foratest image, theoutputs ofthesixclassifiers(corresponding todifferent imagefilters) are not discriminative enough for selecting the correct image filter. Besides, unbalanced data distribution may lead to the classifier overfittingtothe majorityclass(note thatthe one-vs-allstrategy isused).

In thispaper, wesolve the above problemby usingthe strat-egy of linearcombination and regressionlearning. The steps are brieﬂy given as follows. Firstly, a set of DIFs (

{

λ

i

}

Ni=1) is applied to the input image p so as to obtain the N ﬁltered images, de-ﬁnedas

{

si

_}

N

i=1. Notethat onlyone ﬁltered image (corresponding tothetest expression)isenhanced whiletheother N−1 ﬁltered imagesare suppressed.Inother words,theenhancedimage, gen-eratedby f

(

λ

+₁_,p

)

_, contains usefulinformation forclassiﬁcation, while the suppressed images,generated by f

(

λ

−_i,p

)

,i=2,...,N,

containirrelevantinformationforclassification.Allthefiltered im-agesare linearlycombined togeneratethecombined image. Sec-ondly,based ontheobservation thatthe correlationbetweenthe filteredimage(i.e., f

(

λ

+₁,p

)

)andtheinputimagepishigherthan thosebetweenthe ﬁlteredimages(i.e., f

(

λ

−_i_,p

)

_,i₌2_,_._._._,N) and theinput imagep, wepropose touse aregressionlearning tech-nique to yield an effective representation for the combined im-age (which generates a new image representation for the input imagep).

LinearCombinationWeﬁrstlinearlycombine

{

si

_}

N

i=1 to gener-ateacombinedﬁlteredimagess,and

s= N i=1 si = f

(

λ

+₁,p

)

+ N i=2 f

(

λ

−_i,p

)

, (15)

Note that the weights corresponding to different ﬁltered im-ages are all set to 1. Therefore, given n training images P= [p1,p2,...,pn]∈Rd×n containing different expression facial im-ages,N×nﬁlteredimagesare generated,whicharelinearly com-binedasfollows: S= N i=1 f

(

λ

i,P

)

, (16)

whereS₌[s1,s2,...,sn]∈Rd×n containsn linearly combined ﬁl-teredimages.Different fromtheDIFlearning, thecomputationof Sis general and notclass-speciﬁc.Eachsi

(

i=1,...,n

)

consistsof an enhanced ﬁlteredimages(corresponding tothe expression la-belofpi)and

(

N−1

)

suppressedfilteredimages.Theinformation inthesuppressedfilteredimagescanbeconsideredasnoisesthat shouldberemovedwithoutaffectingthevaluableexpression infor-mationintheenhancedfilteredimages.

Regression learningToextract theusefulinformation and re-movethe irrelevant informationin S,we propose to take advan-tage of the linear ridge regression (LRR) method, where we ob-tainanewrepresentationfortheinputimage.Mathematically,LRR solvesthefollowingoptimizationproblem,

min

G

P−G

T_S

2₊

_β

_GT_I

2_, ₍₁₇₎

where I is a diagonal matrix (which is usually the identity ma-trix);

GTI

2_is_the_{regularization}_term,_and

_β

_is_the_{regularization} parameter;Gisatransformationmatrix,whichprojectsthe com-binedﬁlteredimages inSonto anew space(i.e.,generatingnew imagerepresentations);

·

denotestheFrobeniusnorm.

Theclosed-formsolutionofEq.(17)canbewrittenas: G∗=

(

SST+

β

I

)

−1_S_PT_. ₍₁₈₎

The optimal transformation matrix G∗ is expression-aware, sinceitencodesexpression information,whichnot onlyimproves the capabilityofdistinguishing different expressions,but also re-ducestheinﬂuenceoffacialidentitydifferences.

BasedonG∗,theprojectedimagesareobtainedas:

Y=G∗TS, (19)

where Y is deﬁnedas the IFSL images. Each projected image yn inY containsusefulinformationfor its corresponding expression label inpn, which can be used forclassiﬁcation. yn and pn have thesamedimension.

Note that the least squares(LS) methodis also a popular re-gression learning technique. However, compared with LRR used in the proposed method, LS encounters the following problems. Firstly, LS is effective only if the independent variables are not well-correlated. However, the characteristics of facial expressions are usuallywell-correlated [57], whichcangreatly affectthe per-formance of LS. Secondly, the variance estimation of LS may be large when the number of samples used is small. Thus, the re-sultsobtainedbyLSbecomeunreliablewhenalimitednumberof trainingsamplesare used.Thirdly,supposethat S∈Rd×n _consists ofnd-dimensionalfeature samplesobtainedbythe above proce-dures.SST becomesasingularmatrixifn<d,andthustheresults obtainedby LScan beoverﬁtted. Incontrast,LRR[23] effectively solves theseproblemsbyadding aregularization termto balance the deviation [55]. Therefore,the useful information inS can be successfullypreservedwhileirrelevant informationisremovedby usingLRR.

The objective of regression learning step is to learn an expression-awaretransformationmatrix(i.e.,G∗).BasedonG∗_,we are ableto obtaina newsubspace, wheretheinformation inthe enhancedimagesispreservedwhilethatinthesuppressedimages isremoved.Suchawaynotonlyimprovesthecapabilityof distin-guishingdifferentexpressions,butalsoreducestheinﬂuenceof fa-cialidentitydifferences.Asaresult,foranarbitraryimage,wecan obtain an image representation encoding effective expression in-formationbyprojectingthecombinedﬁlteredimageontothenew subspace.

In summary, the advantages ofcombining linear combination andregressionlearningaretwo-fold.1)Wecaneffectivelyimprove the discriminative capability for facial expression recognition by alleviating the influence of the distracting factors (such as facial identity).2) Forthe test stage, we do not need to decide which image filterto be used.Instead, we combinethe filteredoutputs andprojectthemontoa subspacetoobtain thefacialimage rep-resentationbyusingG∗.

3.4. Thecompletealgorithm

Intheprevious subsections,wehavedevelopedallingredients fortheLRfacialexpressionrecognitionmethod.Nowweputthem together to describe a complete algorithm for facial expression recognition.

TheoveralltrainingstageoftheproposedIFSLmethodis sum-marized in Algorithm 1, which returns a set of DIFs

{

λ

_i

}

N

i=1, an expression-aware transformation matrix G∗, and a classiﬁcation model

. The test stage is straightforward. Speciﬁcally, given a test image, IFSL ﬁrstly uses a set of DIFs

{

λ

_i

}

N

i=1 to generate N filtered images. After linearly combiningthese N filtered images, thetransformationmatrixG∗ isused toobtainthecorresponding IFSL image (i.e.,a new image representationfor the test image). Thefinalclassificationresultisperformedbyapplyingthetrained model

.

(7)

Algorithm1:ThetrainingstageoftheproposedIFSLmethod. Input: A set of training images P=[p1,p2,...,pn]∈Rd×n, with the neutral expression set Pn andN non-neutral ex-pressionset

{

P_i

}

N

i=1;themaximumnumberofiterations

τ

; Output:

{

λ

i

}

Ni=1,G∗,and

.

fori=1,...,Ndo

Randomlyinitialize

λ

(_i0);

SelectPnandPiastheinputsofEq.(7); t=0;

while(t<

τ

)Do

Compute

ω

∗accordingtoEq.(5); Compute ∂O(λ(

t) i ) ∂λ(t)

i

followingEq.(8)toEq.(14); Update

λ

(_it) using the conjugate gradient descent technique;

t₌t₊1; endwhile

Obtain anoptimal image ﬁlter

λ

i corresponding tothe ithexpression;

endfor

Combine theimages ﬁlteredusing thelearnedDIFs

{

λ

i

}

Ni=1 toobtainSbyEq.(16);

Compute an expression-aware transformationmatrix G∗ by Eq.(18);

ObtaintheprojectedimagesYaccordingtoEq.(19); Obtainafacialexpression classiﬁer

usingY,andthe cor-respondinglabelsusingthetrainingdata.

3.5. Discussions

Firstly,themainadvantageoftheproposedIFSLmethodisthat irrelevantoruselessexpressioninformationcanbesignificantly re-moved,whiletheusefulinformationforLRfacialexpression recog-nitioncanbeeffectivelypreserved.TheproposedIFSLmethod con-tainstwokeyelementsthatcontributetotheoverallperformance andeffectiveness.(1)Imagefiltering.Adiscriminativeimage filter (DIF) is learned to distinguish a non-neural expression from the neural expression (by optimizing the cost function of LDA). The image filteredby thelearnedDIFcontainsthecriticalinformation for discriminatingthe non-neural expression. (2)Subspace

learn-ing.An expression-aware transformationmatrix islearnedto en-codetheexpressioninformationandremovetheidentity informa-tion(byusingthe linearridge regressiontechnique).Fig.2shows thevisualizationofdifferentfacialexpressionimagesandthe cor-respondingIFSLimages.WecanseethatthesimilaritiesoftheIFSL imagesobtainedbytheproposedmethodarehigherthanthoseof therawimagesforeachofthesixexpressions.Theirrelevant infor-mation(e.g.,facialidentitydifference)issuppressedandthe valu-able expression information around facial keypoints is preserved intheobtainedIFSLimages.Therefore,thepreservedinformation intheIFSLimagesbearshighdiscriminabilityforfacialexpression recognition.

Secondly,thereasonswhytheproposedIFSLmethodcanbe ap-pliedto LR facialexpression recognitionare two-fold:(1) TheLR facialimagesusuallycontainnoisesduetothevariationscausedby illumination,poseanddegradationinresolution[58].Thelearned DIFs can effectively remove noises while preserving useful in-formation in LR facial images. (2) The proposed IFSL method is a holistic recognition method, which performs subspace learning based on the whole facial appearances of LR images. Compared withthelocalrecognitionmethods,theholisticrecognition meth-odsarelesssensitivetotheimageresolution[1–3].

Finally,it isworth noting that thereare some potential prob-lems,whentheproposedmethodisappliedtoHRfacialexpression recognition.Firstly, it is difficult forthe image filter to learn the goodparametersforrecognizingHRfacialimages,whenthe num-beroftrainingsamplesislimited.Thisisbecausethatthenumber ofparametersofimagefilterisrelativelylargeforHRfacialimages. Secondly,theproposed imagefilterbecomesmoresensitivetothe misalignmentproblemforHR facialimages.Thirdly,theproposed methodsuffersfromhighcomputationalcomplexityifthesizesof imagesarelarge(seeSection4.3formoredetails).

4. Experimentalresults

Inthissection,extensiveexperimentsareconductedtoevaluate theperformance ofthe proposed IFSLmethod.In Section4.1,we demonstratethediscriminabilityoftheDIFonasyntheticdataset. InSection4.2,wefirstlyintroduceseveralpopularfacialexpression datasets,andevaluatetheperformanceofdifferentfilterfunctions. Moreover,wediscusstheinfluenceofdifferentimagesizesonthe performance of IFSL. Then, we evaluate the performance of IFSL on the facial expression datasets, and compareIFSL with several

Fig.2. Visualizationofsixtypesoffacialexpressionimages(1st and3rd rows)andthecorrespondingIFSLimages(2nd and4th rows).Here,theelement-wiseproductﬁlter functionisused.

(8)

Fig.3. ExperimentalresultsbyapplyingthelearnedDIFtoasyntheticdataset.Thefirstrowpresentsarandomlygeneratedsyntheticdataset,showing10positivesamples inthelefthalfpartand10negativesamplesintherighthalfpart.Thesecondtothefourthrowsrespectivelyshowtheunfilteredimages,thecorrespondingfilteredimages andthetrainedfilters.FilterIistrainedwhenDissetto50whilefilterIIistrainedwhenDissetto150.

Fig.4. RecognitionresultsobtainedbytheproposedIFSLmethodwithdifferentﬁlterfunctionsonthethreefacialexpressiondatasets.(a)Therecognitionratesobtainedby IFSLwithSVM(b)TherecognitionratesobtainedbyIFSLwithk-NN.

state-of-the-artmethods.InSection4.3,weanalyzethelimitations oftheproposedmethod.

Toshow the influenceofa classifieron theproposed method, weusetwoclassifiers(i.e.,SVMandk-NN)forcomparison:(1)The supportvectormachine(SVM)classifierhasbeenproposedasone ofthemostpopular classifiersto dealwiththe taskoffacial ex-pression recognition [57]. SVM uses a kernel function to project samplestoahigh-dimensionalspace.Popular kernelsinclude lin-ear,polynomial,andradialbasis functions(RBF).Toavoid overfit-ting, we use the linear kernel in the following experiments. (2) Thek-nearestneighbor(k-NN)classifierisregardedasthesimplest instance-basedclassifier[23].Asampleisclassifiedbya majority voteofits k nearestneighbors.We setthevalue ofk to3 inthe followingexperiments.

4.1.Experimentsonasyntheticdataset

Inthisexperiment,wevalidate thediscriminabilityofthe pro-posedDIF(theelement-wiseproductﬁlterisemployed)byusinga syntheticdataset,wherewevisuallydemonstratethatthelearned DIFcaneffectivelyextractdiscriminativeinformation.Wegenerate a syntheticdataset consistingof one positive class andone neg-ativeclass.Note that herewe do notuse thelinear combination step and the regression learning step (described in Section 3.3), sincethisisatwo-classclassiﬁcationtask.

More speciﬁcally, thesynthetic datasetiscomprised ofD (the valueofDissetto50and150,respectively)syntheticsamples, in-cludingD/2positivesamplesandD/2negativesamples.Thepatch size ofeach synthetic sampleis 16×16(thus d=256). Foreach positivesample, wegenerateahorizontalwhitelineatarandom positioncrossingfromtheleftsidetotherightside,whilesucha linedoesnotexistforthenegativesamples,asillustratedinFig.4.

Bothnegativeandpositivesamplesarecontaminatedbyrandomly generatedwhitenoises.TheDIFisthentrainedusingallthe sam-ples. Therefore,thesyntheticdataset isusedto evaluatewhether the learned DIF can suppress the useless information (i.e.,white noises)in thepositive sampleswhile preservingtheuseful infor-mation(i.e.,horizontalwhitelines).

Fig. 3 also showsthe experimental results obtainedby using the proposed DIF on the synthetic dataset. We can observe that the noises in the positive samples are successfully suppressed by thelearnedDIFandthewhitelinesare well preserved. More-over,thefilteringperformance obtainedby DIF(whenD₌150)is better thanthat obtainedby DIF (when D=50),since the noises inthefilteredpositivesampleswhenD=150aremuchlessthan thosewhenD₌50.Moreover,thepositivehorizontallinesare ef-fectivelypreservedwhenD=150.However,theselinesareslightly suppressed whenD=50because thelimitednumber ofsamples is used. Ingeneral, the learnedDIF can effectivelyextract useful informationwhile atthe same time filteringout irrelevant infor-mationforclassification.

4.2. Experimentsonfacialexpressionrecognition

Inthissection,weextensivelydemonstratetheperformanceof theproposedIFSLforfacialexpressionrecognition.

4.2.1. Facialexpressiondatasets

Weevaluatetheperformanceoftheproposedmethodonboth in-the-lab facial expression datasets (including CK+, JAFFE and MMI) and in-the-wild facial expression datasets (including SFEW [62] and RAF-DB [42]). A brief introduction of these datasets is givenasfollows.

(9)

Fig.5. RecognitionresultsobtainedbytheproposedIFSLmethodwithdifferentimagesizesonthethreefacialexpressiondatasets.(a)Therecognitionratesobtainedby IFSLwithSVM(b)TherecognitionratesobtainedbyIFSLwithk-NN.

The extended Cohn-Kanade (CK+) facial expression dataset1_, which is an extended version ofthe previous Cohn-Kanade (CK) dataset, consistsof 593short videosfrom 123 subjectswith dif-ferentagesunderuniformillumination[59].Allvideosvaryin du-ration(i.e.,from10to 60frames)andstartfromtheﬁrst neutral frametothelastframewithapeakexpression.TheMMIdataset2 includesmorethan43subjects,whoexpressfacialemotions non-uniformly and spontaneously. 213 video sequences in MMI have beenlabeledwithsixbasicexpressions,wheresomesubjectswear hats,hoods,orglasses.TheJAFFEdataset3 _is_an_expression_dataset consisting of 219 images from 10 Japanese subjects who are fe-male [60]. There are three or four images foreach subject with each expression.TheStatic FacialExpressions intheWild(SFEW) dataset4 _is _collected _by _selecting _static _frames_from _Acted _Facial Expressions in theWild(AFEW) [63].The SFEWdataset contains 95 subjectswithunconstrained facial expressions (suchas differ-ent poses, ages). RAF-DB5 _is _a _large-scale _dataset _captured _from the Internet. This dataset contains about30,000facial imagesof thousands ofsubjectsannotatedwithbasicor compound expres-sions by 40 trainedhuman labelers. Inour experiment, only im-ageswithbasicfacialexpressionsareused.Intotal,thereare1007 differentfacial expression imagesselectedfromCK₊,606images selected from MMI, 219 images selected fromJAFFE, 663 images selectedfromSFEW,and15,339imagesselectedfromRAF-DB.

In all experiments,six basic non-neutral expressions and one neutral expression are selected from each of the three datasets. The sixbasicnon-neutralexpressions includeangry,disgust, fear, happy,surprise,andsadexpressions,whicharerespectively abbre-viated asAn,Di, Fe,Ha, Su, andSa inthispaper. Thus N=6 for thefollowingexperiments.

Following[37],foreach imageinthedataset,we ﬁrstly manu-allylocatetheeyepositionandcropa110×150patchcoveringthe facialregion.Then,themanuallycroppedfacialimagesareresized to the size of 32×32 (i.e., d=1024) andconverted to the gray-scale images.Weconduct the 10-foldcross-validation onall sub-jectstoevaluatetheperformanceoftheproposedmethod,asdone in[30].The trainingsetisused totrain thesixDIFs correspond-ing tothesixexpressions andlearnthetransformationmatrixG∗ in Eq. (18). Their corresponding IFSL images are used to train a classiﬁer.Theneutralexpressiontrainingimagesaresharedduring thetrainingprocessofthesixDIFs.Fortheparametersettings,we empirically setC=0.1in Eq.(6). The value ofthe regularization parameter

β

issetto2.0. 1_{http://www.pitt.edu/}∼_{emotion/ck-spread.htm} 2_{http://www.mmifacedb.com} 3_{http://www.kasrl.org/jaffe.html} 4_{https://cs.anu.edu.au/few/emotiw2015.html} 5_{http://www.whdeng.cn/RAF/model1.html}

Wereporttherecognitionratesobtainedbytheproposed IFSL on each ofthe six expressions andthe average recognition rates usingeitherSVMork-NN.

4.2.2. Inﬂuenceofdifferentﬁlterfunctions

As we mentionpreviously (see Section 3.1), a variety ofﬁlter functionscanbe usedinthe optimizationprocessofLDA. Inthis section, we evaluate the performance of IFSL with different ﬁl-terfunctions, includingtheelement-wiseproduct(also calleddot product),lineartransformandconvolutionfunctions.Here,weuse thethree in-the-lab facial expression datasets in thisexperiment forperformanceevaluation.

Theperformance obtainedbytheproposed IFSLwithdifferent filterfunctionsisshowninFig.4,wheretheresultsobtainedusing theSVMclassifier are givenin Fig.4(a)andthoseobtainedusing thek-NNclassifierareshowninFig.4(b).Wecanobservethatthe proposed IFSL with the dot product filter function achieves bet-ter performance than that withthe other two filterfunctions on allthethreedatasets.TheproposedIFSLwiththelineartransform achieves the worst results among the three filterfunctions. This ismainlybecausethat thedot productfiltercandirectlyhavean influence onthe pixel-level valuesin the facial image, while the othertwofilterfunctionsoperateonthewholefacialimage. There-fore,the directchangeofpixel-levelvaluesseems tobemore ef-fectiveforfacialexpressionrecognitionsincethenumberof train-ingsamplesislimited. Inotherwords,thediscriminativelocal fa-cialregionscanbeenhancedandtheirrelevantlocalfacialregions are suppressed by using the dot product filter. Furthermore, the recognitionratesobtainedbyIFSLusingSVMarehigherthanthose obtainedbyIFSLusingk-NN.

Therefore,inthefollowingexperiments,wewillchoosethedot productfilterasthefilterfunctionoftheproposedIFSLmethod. 4.2.3. Influenceofdifferentimagesizes

Inthissection, we evaluate the performance ofIFSL with dif-ferent image sizes, including 8×8, 16×16 and32×32. We also usethethreein-the-labfacialexpressiondatasetsforperformance evaluation.

Theperformance obtainedbytheproposed IFSLwithdifferent image sizesis shownin Fig. 5,where the resultsobtained using the SVM classifier are given in Fig. 5(a) and those obtained us-ingthek-NNclassifierareshowninFig.5(b).Wecanobservethat theproposedIFSLwiththeimagesizeof32_×32achievesthebest recognitionrates,while IFSLwiththe image sizeof 8×8obtains theworstresultsonthethreedatasets.Thehighertheimage reso-lutionis,thebettertherecognitionperformance is.Thisismainly dueto the fact that the proposed ISFL method can be beneficial fromexploitingmoreinformationinthehigher-resolutionimages.

(10)

Fig.6. Recognitionresultsobtainedbytheproposedmethodwithtwodifferentclassiﬁersonthethreefacialexpressiondatasets.(a)Therecognitionrateforsixdifferent expressionsobtainedbyIFSLwithSVM(b)TherecognitionrateforsixdifferentexpressionsobtainedbyIFSLwithk-NN.(c):Theaveragerecognitionratesobtainedbythe proposedIFSLusingeitherSVMork-NNonthethreedatasets.

Furthermore,therecognitionratesobtainedbyIFSLusingSVMare higherthanthoseobtainedbyIFSLusingk-NN.

4.2.4. Performanceoftheproposedmethod

Inthefollowingexperiments,we showtheperformanceofthe proposedIFSL methodtohandlethetaskofmulti-classfacial ex-pressionrecognition.

Table 1showstheconfusion matrixobtainedby theproposed IFSL using the SVM classifier on the CK+ dataset. IFSL achieves good performance on all the expressions. However, we can also observethat some samplescorresponding to the sadanddisgust expressions are misclassifiedasthe angry expression. This is be-causethatboththedisgustandsadexpressionshavethewrinkled eyebrows, andthey share some similarities to the angry expres-sion.Moreover,thesubjectswiththesadordisgustexpressionsdo nothaveobviousmotionsaroundthemoutharea,whichalsoleads to the incorrect classification of these two expressions for some samples.

We alsoshow theclassificationresultsforeach expression ob-tainedby the proposed method using theSVM andk-NN classi-fiersonallthedatasetsinamorecompactwayinFig.4.Fromthe classificationresultsforthebestexpressionsonthethreedatasets withSVM(seeFig.4(a)),IFSLachievesthetoprecognitionrateson thehappyexpressionwhileitobtainstheworstresultsonthesad expression.These results can alsobe observed when using k-NN (seeFig.4(b)).Thisisbecausethatthehappyexpressionhasvery obvious appearance variations around the mouth area compared withthe other expressions. In contrast, the sad expression does nothavesignificantappearancevariationsaroundthesalientfacial featurepoints. Actually, similar observationshave beendiscussed insomeotherworks[9,37].Wealsoobservethatthesadandfear expressionssometimesshowsimilarappearanceforthesubjectsin thethreefacial expressiondatasets. Thus,IFSL cannot accurately discriminatethesetwo expressions forsome subjects.As we can seein Fig. 4(c), IFSL with eitherSVM ork-NN achieves the best

Table1

TheconfusionmatrixobtainedbyIFSLwithSVMonthe CK+dataset.Thebestresultsareboldfaced.

(%) An Di Fe Ha Su Sa An 97.01 2.99 0 0 0 0 Di 2.27 96.59 0 1.14 0 0 Fe 0 0 100 0 0 0 Ha 0 0 0 100 0 0 Su 0 0 0 0 100 0 Sa 2.5 0 0 0 0 97.5

performance ontheCK+dataset.ThisisbecauseCK+isthe sim-plestandthelargestdatasetamongthesethreedatasets.However, IFSL achieves the worst performance on the JAFFE dataset, since thenumberofthetrainingsamplesislimited.

4.2.5. Comparisonwiththestate-of-the-artmethods

We compare the proposed IFSL with several state-of-the-art methods. These methods include PCA [38], multi-class LDA [38], m-LBP [36], Boosted-LBP [37], PLBP [6], CSPL [35], HMFF [61], SalientPatch [9], CS-APL [64], MSCNN [65], pACNN [21], gACNN [21],DLP-CNN[41],andDAM-CNN[40].Thechoiceofthese com-peting methods is based on the following reasons: 1) PCA and multi-classLDAarethetwowidely-usedsubspacelearning meth-ods forfacial expression recognition. We usethese two methods asthebaseline.2)TheLBP-basedmethods(i.e.,m-LBP,Boosted-LBP andPLBP) are regarded asthepowerful feature extraction meth-ods which achieve the state-of-the-art performance. 3) Salient-Patch, CSPL and CS-APL are proposed to address part-based im-age representation and they effectively extract expression infor-mation in local facial regions, which is similar to the proposed method.4)HMFFusesasubspaceanalysismethodbasedona hier-archicalfeatureextractionframework,whichalsoaimstoenhance the discriminability of different expressions. 5) MSCNN, pACNN, gACNN,DLP-CNNandDAM-CNNaretherepresentativeCNN-based facial expressionrecognition methods.MSCNN can effectively ex-tractspatialfeaturesunderthesupervisionofrecognitionand ver-iﬁcationsignals.pACNN,gACNNadopttheattentionmechanismin CNN.DLP-CNNemploysthedeeplocality-preservingfeature learn-ing for FER. DAM-CNN designs a deep multi-path convolutional neuralnetworkbytakingadvantageofsalientregionattention.For all the competingmethods, we use the default parameters from therespectivepapers.

ComparisonresultsontheCK+datasetTable2comparesthe proposed IFSL method with the state-of-the-art methods on the CK+dataset.We canseethatthe proposedIFSLmethodachieves thebest recognitionrate, andsignificantlyoutperforms the tradi-tionalfeaturelearningmethods(suchasGSPL,HMFF,SalientPatch) ontheCK+dataset.ThePCAandmulti-classLDAachievepoor per-formanceon theCK₊dataset.Thisis mainlybecausethelearned subspace obtained by either PCA or multi-class LDA is not dis-criminativeto distinguish differentfacial expressions. The perfor-manceobtainedbyIFSLwithSVMisbetterthanthatobtainedby IFSL withk-NNin termsof theaverage recognition performance. Moreover, compared withsome competing methods, such as m-LBP,Boost-LBP,GSPL andMSCNN,IFSLwiththesimplek-NN clas-sifierachievespromisingperformance. As wementionpreviously, k-NNmakesaclassificationdecisionbyusingthemajorityvoteof

(11)

Table2

Comparisonresultsobtainedbyallthe com-peting methods on the CK+ dataset. The bestresultsareboldfaced.

Methods Accuracy(%) PCA(k-NN) 43.8 PCA(SVM) 47.3 multi-classLDA(k-NN) 84.7 multi-classLDA(SVM) 87.1 m-LBP[36] 88.4 Boost-LBP[37] 91.1 PLBP[6] 95.2 GSPL[35] 89.9 HMFF[61] 96.1 SalientPatch[9] 94.0 CS-APL[64] 98.6 MSCNN[65] 95.5 pACNN[21] 97.0 gACNN[21] 96.4 DAM-CNN[40] 95.9 IFSL(SVM) 98.7 IFSL(k-NN) 96.6 Table3

Comparisonresultsobtainedbyallthe competing methodsontheMMIdataset.Thebestresultsare boldfaced.

Methods Accuracy(%)

PCA(k-NN) 65.6

PCA(SVM) 67.9

multi-classLDA(k-NN) 68.3multi-classLDA

(SVM) 71.0 Boost-LBP[37] 86.9 PLBP[6] 91.0 GSPL[35] 73.5 MSCNN[65] 77.1 pACNN[21] 70.4 gACNN[21] 69.0 DLP-CNN[41] 78.5 IFSL(SVM) 92.6 IFSL(k-NN) 88.9

its knearest neighbors,whichindicates thatintra-class variations are small and inter-class variations are large in the transformed subspaceobtainedbyIFSL.Inotherwords,thedistributionsofthe samplescorresponding todifferentexpressions arewell-separated inthesubspaceobtainedbytheproposedIFSLmethod.Compared withtheCNN-basedmethods(suchasMSCNN,pACNN,gACNNand DAM-CNN), the proposed IFSL still achieves better performance, when onlylimitedtrainingdata areavailable. Therefore,the pro-posedmethodcan effectivelyextract thediscriminativeand com-pactfeaturesfromtheLRfacialimages.

Comparison results on the MMI dataset Table 3 shows the comparison results obtained by the proposed IFSL method and some state-of-the-art methods on the MMI dataset. MMI is a well-known challenging facial expression dataset due to non-uniformlyposed expressions andvarious headdressing. The pro-posed method with SVM obtains higher accuracy than PLBP [6], anditachievesmuchbetterperformancethanGSPL[35].PLBPuses the images withthe size of110_×150 and GSPL uses theimages withthesizeof95×95.Theimageresolutions usedinthesetwo methods are much larger thanthe image resolution used inIFSL (i.e., 32_×32). From Table 3, we can see that PCA obtains much worserecognitionratethanmulti-classLDA.Thisisbecause multi-classLDAeffectivelyreducesthewithin-classscatterwhile enlarg-ingthebetween-classscatter.However,multi-classLDAisnotable todiscriminatetheclassesclosetoeachothersincelargeclass dis-tances are often overemphasized duringtraining. In contrast,the

Table4

Comparisonresultsobtainedbyallthe com-petingmethodsonthe JAFFE dataset.The bestresultsareboldfaced.

Methods Accuracy(%) PCA(k-NN) 52.4 PCA(SVM) 55.6 multi-classLDA(k-NN) 62.7 multi-classLDA(SVM) 64.4 Boost-LBP[37] 82.0 MSCNN[65] 85.1 DAM-CAM[40] 99.3 IFSL(SVM) 88.2 IFSL(k-NN) 76.4 Table5

Comparisonresultsobtainedbyallthe com-petingmethodsonthe SFEWdataset.The bestresultsareboldfaced.

Methods Accuracy(%) PCA(k-NN) 23.4 PCA(SVM) 28.1 multi-classLDA(k-NN) 34.9 multi-classLDA(SVM) 39.3 MSCNN[65] 47.9 gACNN[21] 51.7 pACNN[21] 49.8 DLP-CNN[41] 51.1 IFSL(SVM) 46.5 IFSL(k-NN) 43.2

proposedIFSLmethodbenefitsfromtheclassspecificfilter learn-ing andlinearridge regression techniques,which can distinguish a specific facial expression from the neutral expression and ex-tractdiscriminativeexpressioninformationfromthe facialimage, respectively. IFSL also achieves better performance than MSCNN, whichshowstheeffectivenessoftheproposed methodforLR fa-cialexpressionrecognition.ThemainreasonisthatMSCNNsuffers fromtheproblemofinsufficienttrainingdata.pACNN,GACNNand DAM-CNNachievesworseresultsthantheproposedmethodinthe MMIdatasets.Thisismainly becausethatthesemethods usethe CNNmodeltrainedonother datasets forfeatureextraction with-outfine-tuning.

Comparisonresults on the JAFFE datasetTable 4 shows the comparison results obtained by the proposed IFSL method and some state-of-the-art methods on the JAFFE dataset. The perfor-manceofonlyfewexistingmethodsisevaluatedonJAFFE,sinceit isasmalldataset.IFSLachievesrelativelyloweraccuracyonJAFFE thanthatonMMIandCK+.Similarly,Boost-LBPalsoachievesthe worst performance on JAFFE, compared withits performance on theother two datasets.This observationis especially obviousfor multi-classLDAandPCA.ThisismainlybecauseJAFFE hasa very smallnumber ofsamples fortraining, which causesthat the ob-tainedfacial expression features are less effective.Notethat IFSL achievesworse performance than DAM-CAM.The main reason is that DAM-CAM ﬁne-tunes the VGG model training on a large-scale dataset. Moreover, the input image size of DAM-CAM (i.e., 224_×224)ismuch largerthan that ofIFSL(i.e.,32_×32). In con-trast, IFSL learns the parameters of DIF by only using the small trainingset.

ComparisonresultsontheSFEWdatasetTable5comparesthe proposedIFSLmethodwithseveralstate-of-the-artmethodsonthe SFEWdataset.Amongallthecompetingmethods,gACNNand DLP-CNNrespectivelyachievethebestandsecondperformance (51.7% and51.1%, respectively), whichare better than theproposed IFSL withSVMbyamoderatemargin(5.2%and4.6%,respectively).This