Tracked Frame

Figure 6.4. Forwardandbackwardmotionestimation. Adaptedfrom[39 ,Fig.5.5].

anchor andtrackedframes, respectively. Ingeneral, wecanrepresent themotion

eld asd(x;a), where a = [a

1 ;a 2 ;:::;a L ] T

is avector containingall the motion

parameters. Similarly, the mapping function can be denoted by w (x;a) = x+

d(x;a):Themotionestimationproblemistoestimatethemotionparametervector

a. Methods thathavebeendevelopedcanbecategorizedinto twogroups: feature-

basedandintensity-based. Inthefeature-basedapproach,correspondencesbetween

pairsofselectedfeaturepointsintwovideoframesarerstestablished. Themotion

model parameters are then obtainedby a least squares tting of the established

correspondences into the chosen motionmodel. This approach is only applicable

toparametricmotionmodelsandcanbequiteeectivein,say,determiningglobal

motions. Theintensity-basedapproachappliestheconstantintensityassumptionor theoptical owequationateverypixelandrequirestheestimatedmotiontosatisfy thisconstraintascloselyaspossible. Thisapproachis moreappropriatewhenthe

underlyingmotioncannotbecharacterizedbyasimplemodel,andthatanestimate

ofapixel-wiseorblock-wisemotioneldisdesired.

Inthis chapter, we only consider intensity-based approaches, which are more

widelyusedin applicationsrequiringmotioncompensatedpredictionandltering.

In general, the intensity-based motion estimation problem can be converted into

anoptimization problem, andthree keyquestionsneed to beanswered: i)how to

parameterizetheunderlyingmotioneld? ii)whatcriteriontousetoestimatethe parameters? andiii)howtosearchfortheoptimalparameters? Inthissection,we

rstdescribeseveralwaystorepresentamotioneld. Thenweintroducedierent

types of estimation criteria. Finally, we present search strategies commonly used

6.2.1 Motion Representation

A keyproblem in motion estimationis howto parameterize the motioneld. As

shown in Sec. 5.5, the 2D motion eld resulting from acamera orobjectmotion

canusuallybedescribedbyafewparameters. However,usually,therearemultiple objectsintheimagedscenethatmovedierently. Therefore,aglobalparameterized

model isusually notadequate. Themostdirect and unconstrainedapproachis to

specifythemotionvectorateverypixel. Thisistheso-calledpixel-basedrepresenta- tion. Sucharepresentationisuniversallyapplicable,butitrequirestheestimation

ofalargenumberof unknowns(twicethenumberofpixels!) andthesolutioncan

oftenbephysicallyincorrectunlessaproperphysicalconstraintisimposed during theestimationstep. Ontheotherhand,ifonlythecameraismovingortheimaged scenecontainsasinglemovingobjectwithaplanarsurface,onecoulduseaglobal motion representationtocharacterizetheentiremotioneld. Ingeneral,forscenes containingmultiplemovingobjects,itismoreappropriatetodivideanimageframe intomultipleregionssothatthemotionwithineachregioncanbecharacterizedwell

byaparameterizedmodel. This is known asregion-based motion representation,

whichconsistsofaregionsegmentationmapandseveralsetsofmotionparameters,

one for each region. The diÆculty with such an approach is that one does not

know in advance which pixels havesimilar motions. Therefore, segmentation and

estimationhavetobeaccomplishediteratively,whichrequiresintensiveamountof

computationsthatmaynotbefeasibleinpractice.

Oneway to reduce thecomplexityassociatedwith region-basedmotion repre-

sentationisbyusingaxedpartitionoftheimagedomainintomanysmallblocks.

Aslongaseachblockissmall enough,themotionvariationwithin eachblockcan

becharacterizedwellbyasimplemodelandthemotionparametersforeach block

can be estimated independently. This brings us to the popular block-based repre-

sentation. The simplest version models the motion in each block by a constant

translation, so that the estimation problem becomes that of nding one MV for

eachblock. Thismethodprovidesagoodcompromisebetweenaccuracyandcom-

plexity, and hasfound great successin practical videocoding systems. One main

problem with theblock-basedapproach is that itdoesnotimpose any constraint

on the motion transition between adjacent blocks. Theresulting motion is often

discontinuousacrossblockboundaries,evenwhenthetruemotioneld ischanging

smoothlyfromblocktoblock. Oneapproachto overcomethisproblemisbyusing

a mesh-based representation, by which the underlying image frame is partitioned

into non-overlapping polygonalelements. The motion eld over the entire frame

is described by the MVs at the nodes (corners of polygonal elements) only, and

theMVsattheinteriorpointsofanelementareinterpolatedfromthenodalMVs.

Thisrepresentationinducesamotioneldthatiscontinuouseverywhere. Itismore appropriatethantheblock-basedrepresentationoverinteriorregionsof anobject,

Thisissometimescalledobject-basedmotionrepresentation[27].Hereweusetheword\region- based"toacknowledgethefactthatweareonlyconsidering2Dmotions,andthataregionwith

Figure6.5. Dierentmotionrepresentations: (a)global,(b)pixel-based,(c)block-based, and(d)region-based. From[38 ,Fig.3].

whichusuallyundergoesacontinuousmotion,butitfailstocapturemotiondiscon-

tinuities at object boundaries. Adaptiveschemes that allow discontinuities when

necessaryisneededformoreaccuratemotionestimation. Figure6.5illustratesthe

eect of several motion representations described above for a head-and-shoulder

scene. Inthenextfewsections,wewillintroducemotionestimationmethodsusing dierentmotionrepresentations.

6.2.2 Motion Estimation Criteria

Forachosenmotionmodel,theproblemishowtoestimatethemodelparameters.In thissection,wedescribeseveraldierentcriteriaforestimatingmotionparameters.

CriterionbasedonDisplacedFrameDierence

The most popular criterion for motion estimation is to minimize the sum of the

errorsbetweentheluminancevaluesofeverypairofcorrespondingpointsbetween

the anchor frame

and the tracked frame

2 . Recall that x in 1 is moved to w (x;a)in 2

. Therefore,theobjectivefunction canbewritten as,

E DFD (a)= X j 2 (w (x;a)) 1 (x)j p ; (6.2.1)

whereisthedomainofallpixelsin 1

,andpisapositivenumber. Whenp=1,

the above error is called mean absolute dierence (MAD), and when p = 2, the

mean squared error (MSE). The error image, e(x;a) =

(w (x;a)) 1

(x), is

usuallycalled displacedframe dierence(DFD)image, andtheabovemeasurethe

DFD error.

Thenecessarycondition forminimizing E

DFD

isthat its gradientvanishes. In thecaseofp=2,this gradientis

@E DFD @a = 2 X x2 ( 2 (w (x;a)) 1 (x)) @d(x) @a r 2 (w (x;a)) (6.2.2) where @d @a = " @d x @a1 @d x @a2 @d x @aL @dy @a 1 @dy @a 2 @dy @a L # T :

CriterionbasedonOptical Flow Equation

Instead of minimizing the DFD error, another approach is to solve the system

of equations establishedbased onthe optical ow constraintgiven in Eq. (6.1.3). Let 1 (x;y) = (x;y;t); 2 (x;y) = (x;y;t+d t ): If d t

is small, we canassume

@ @t dt= 2 (x) 1

(x):Then,Eq.(6.1.3)canbewrittenas

@ 1 @x d x + @ 1 @y d y +( 2 1 )=0 or r T 1 d+( 2 1 )=0: (6.2.3)

This discrete version of the optical ow equation is more often used for motion

estimation in digitalvideos. Solving the aboveequations for all x canbe turned

intoaminimizationproblemwiththefollowingobjectivefunction:

E ow (a)= X x2 r 1 (x) T d(x;a)+ 2 (x) 1 (x) p : (6.2.4) ThegradientofE ow is,whenp=2; @E ow @a = 2 X x2 r 1 (x) T d(x;a)+ 2 (x) 1 (x) @d(x) @a r 1 (x): (6.2.5)

Ifthemotioneld isconstantoverasmallregion 0 , i.e.,d(x;a)=d 0 ;x2 0 ,

thenEq.(6.2.5) becomes

@E ow @d 0 = X x2 0 r 1 (x) T d 0 + 2 (x) 1 (x) r 1 (x): (6.2.6)

Settingtheabovegradienttozeroyieldstheleastsquaressolutionford 0 : d 0 = X 0 r 1 (x)r 1 (x) T ! 1 X 0 ( 1 (x) 2 (x))r 1 (x) ! : (6.2.7)

When themotion is not a constant, but canbe related to the model parameters linearly, onecanstill derivea similar least-squaressolution. SeeProb. 6.6 in the

Problemsection.

Anadvantageoftheabovemethodisthattheminimizingfunctionisaquadratic

function of the MVs, when p =2. If the motion parametersare linearly related

to the MVs, then the function has a unique minimum and can be solved easily.

This is not true with the DFD error given in Eq. (6.2.1). However, the optical

ow equation is valid only when the motion is small, or when an initial motion

estimate ~

d(x)thatisclosetothetruemotioncanbefoundandonecanpre-update 2

(x)to 2

(x+ ~

d(x))Whenthis isnotthecase,itisbettertousetheDFDerror

criterion, and nd the minimal solution using the gradient descent orexhaustive

searchmethod.

Regularization

MinimizingtheDFDerrororsolvingtheoptical owequationdoesnotalwaysgive

physicallymeaningfulmotionestimate. Thisispartiallybecausetheconstantinten- sityassumptionisnotalwayscorrect. Theimagedintensityofthesameobjectpoint

mayvaryafter anobjectmotionbecauseof thevariousre ectanceandshadowing

eects. Secondly,inaregionwith attexture,manydierentmotionestimatescan satisfytheconstantintensityassumptionortheoptical owequation. Finally,ifthe

motionparametersaretheMVs ateverypixel, theoptical owequationdoesnot

constrainthemotionvectorcompletely. Thesefactorsmaketheproblemofmotion

estimationaill-posedproblem.

Toobtainaphysicallymeaningfulsolution,oneneedstoimposeadditionalcon-

straintstoregularizetheproblem. Onecommonregularizationapproachis toadd

apenaltytermto theerrorfunction in (6.2.1)or(6.2.4),whichshould enforcethe

resultingmotionestimatetobearthecharacteristicsofcommonmotionelds. One

well-knownpropertyofatypicalmotioneldisthatitusuallyvariessmoothlyfrom pixeltopixel,exceptatobjectboundaries. Toenforcethesmoothness,onecanuse

apenaltytermthat measures thedierencesbetween theMVsofadjacentpixels,

i.e., E s (a)= X x2 X y 2N x kd(x;a) d(y ;a)k 2 ; (6.2.8)

where kkrepresents the 2-norm, N

represents the set of pixels adjacent to x. Eitherthe4-connectivityor8-connectivityneighborhoodcanbeused.

Theoverallminimizationcriterioncanbewrittenas

E=E DFD +w s E s : (6.2.9)

TheweightingcoeÆcientw

shouldbechosenbasedon theimportance ofmotion

smoothness relativeto thepredictionerror. Toavoidover-blurring,oneshould re- ducetheweightingatobjectboundaries. This,however,requiresaccuratedetection

Bayesian Criterion

TheBayesianestimator isbased onaprobablisticformulationof themotionesti-

mationproblem,pioneeredbyKonradandDubois[22,38]. Underthisformulation,

givenananchorframe

,theimagefunctionatthetrackedframe 2

isconsidered arealizationofarandomeld ,andthemotionelddisarealizationofanother

random eld D . The a posterior probability distribution of the motion eld D

givenarealization of and

canbewritten, usingtheBayesrule

P(D =dj = 2 ; 1 )= P( = 2 jD =d; 1 )P(D =d; 1 ) P( = 2 ; 1 ) : (6.2.10)

In the above notation, the semicolon indicates that subsequent variables are de-

terministicparameters. An estimatorbasedontheBayesiancriterionattempts to

maximize the a posterior probability. But for given

1 and

, maximizing the

aboveprobabilityis equivalent tomaximizing the numeratoronly. Therefore,the

maximumaposterior(MAP)estimateofdis

d MAP = argmax d fP( = 2 jD =d; 1 )P(D =d; 1 )g: (6.2.11)

Therstprobabilitydenotesthelikelihoodofanimageframegiventhemotion

eldandtheanchorframe. LetE representtherandomeldcorrespondingtothe

DFD imagee(x)=

(x+d)

(x)forgivendand 1 , then P( = 2 jD =d; 1 )=P(E =e);

andtheaboveequationbecomes

d MAP = argmax d fP(E =e)P(D =d; 1 )g = argmin d f logP(E =e) logP(D =d; 1 )g: (6.2.12)

From the source coding theory (Sec. 8.3.1), the minimum coding length fora

source X is its entropy, logP(X = x). We see that the MAP estimate is

equivalent to minimizing thesum of the coding lengthfor the DFD image e and

that for themotion eld d. As will beshown in Sec.9.3.1, this is precisely what

avideocoderusingmotion-compensated predictionneedsto code. Therefore,the

MAPestimatefordisequivalenttoaminimumdescriptionlength(MDL)estimate

[34]. Becausethepurposeof motionestimationin videocodingis tominimizethe

bitrate,theMAPcriterionisabetterchoicethanminimizingthepredictionerror.

The most common model for the DFD image is a zero-mean independently

identicallydistributed (i.i.d.) Gaussianeld, withdistribution

P(E =e)=(2 2 ) jj=2 exp P x2 e 2 (x) 2 2 ; (6.2.13)

wherejjdenotes thesize of (i.e.,thenumberofpixelsin ). With thismodel, minimizingtherstterminEq.(6.2.12)isequivalenttominimizingthepreviously

ForthemotioneldD ,acommonmodelisaGibbs/Markovrandomeld[11].

Suchamodelisdenedbyaneighborhoodstructurecalledclique. LetCrepresent

thesetofcliques,themodelassumes

P(D =d)= 1 Z exp( X c2C V c (d)); (6.2.14)

whereZ isanormalizationfactor. Thefunction V

(d)iscalledthepotentialfunc-

tion,which isusuallydenedto measurethedierencebetweenpixelsin thesame

clique: V c (d)= X (x;y )2c jd(x) d(y )j 2 : (6.2.15)

Underthismodel,minimizingthesecondterminEq.(6.2.12)isequivalenttomin-

imizing the smoothing function in Eq. (6.2.8). Therefore, the MAP estimate is

equivalenttotheDFD-basedestimatorwithanappropriatesmoothnessconstraint.

6.2.3 Minimization Methods

The error functions presented in Sec. 6.2.2 can be minimized using various opti-

mization methods. Here we only consider exhaustive search and gradient-based

searchmethods. Usually,fortheexhaustivesearch,theMADisusedforreasonsof

computational simplicity, whereasfor thegradient-basedsearch, theMSE is used

foritsmathematicaltractability.

Obviously,theadvantageoftheexhaustivesearchmethodisthat itguarantees reachingtheglobalminimum. However,suchsearchisfeasibleonlyifthenumberof unknownparametersissmall,andeachparametertakesonlyanitesetofdiscrete values. Toreducethesearchtime,variousfastalgorithmscanbedeveloped,which achievesub-optimalsolutions.

Themost commongradientdescent methods include the steepest gradient de-

scent andtheNewton-Ralphsonmethod. Abriefreviewofthesemethodsisprovided

inAppendixB. Agradient-basedmethodcanhandleunknownparametersinahigh

dimensional continuousspace. However,it canonlyguaranteethe convergence to

alocalminimum. Theerrorfunctions introducedintheprevioussectioningeneral

arenotconvexandcanhavemanylocalminimathatarefarfromtheglobalmini-

mum. Therefore,itisimportanttoobtainagoodinitialsolutionthroughtheuseof

aprior knowledge,orbyaddingapenaltytermtomaketheerrorfunctionconvex.

Withthegradient-basedmethod,onemustcalculatethespatio-temporalgradi-

entsoftheunderlyingsignal. AppendixAreviewsmethodsforcomputingrstand

second order gradientsfrom digitalsampled images. Note that the methods used

for calculating the gradient functions canhave profound impact on the accuracy

and robustnessofthe associatedmotion estimation methods, ashavebeen shown

by Barron et al. [4]. Using a Gaussian pre-lter followed by acentral dierence

Oneimportantsearchstrategyistouseamulti-resolutionrepresentationofthe

motioneldand conductthesearchin ahierarchicalmanner. Thebasicideaisto

rst search the motion parametersin acoarse resolution, propagatethis solution

into aner resolution, andthen rene thesolution in thener resolution. Itcan

combat both the slowness of exhaustive search methods and the non-optimality

of gradient-basedmethods. Wewill present themulti-resolution method in more

detailinSec.6.9.

6.3 Pixel-Based Motion Estimation

In pixel-based motion estimation, one tries to estimate a motion vector for ev-

ery pixel. Obviously, this problem is ill-dened. If one uses the constant inten-

sity assumption, foreverypixel in theanchorframe, there aremany pixelsin the trackedframe that have exactlythesameimage intensity. If oneusesthe optical

ow equation instead, the problem is again indeterminate, because there is only

oneequationfor twounknowns. Tocircumvent thisproblem, there are ingeneral

four approaches. First, onecan use the regularizationtechnique to enforce some

smoothnessconstraintsonthemotioneld,sothatthemotionvectorofanewpixel

isconstrained bythose foundfor surroundingpixels. Second,one canassumethe

motionvectorsin aneighborhoodsurroundingeach pixelarethesame,and apply

theconstantintensityassumptionortheoptical owequationovertheentireneigh-

borhood. Thethird approach is to makeuse of additionalinvariance constraints.

In addition to intensity invariance, which leadsto the optical ow equation, one

can assume that the intensity gradient is invariant under motion, as proposed in

[29,26,15]. Finally,onecanalsomakeuseoftherelationbetweenthephasefunc-

tions of the frame before and after motion [9]. In [4], Barron, et al. evaluated

variousmethods foroptical owcomputation,bytestingthese algorithmsonboth

synthetic and real worldimageries. In this section, we will describe the rst two approachesonly. Wewillalsointroducethepel-recursivetypeofalgorithmswhich aredevelopedforvideocompressionapplications.

6.3.1 Regularization Using Motion Smoothness Constraint

HornandSchunck[16]proposed toestimatethemotionvectorsbyminimizingthe

followingobjectivefunction,whichisacombinationofthe ow-basedcriterionand

amotionsmoothness criterion:

E(v (x))= X x2 @ @x v x + @ @y v y + @ @t 2 +w s krv x k 2 +krv y k 2 : (6.3.1)

Intheiroriginal algorithm,thespatial gradientofv x and v y areapproximatedby rv x =[v x (x;y) v x (x 1;y);v x (x;y) v x (x;y 1)] T ;rv y =[v y (x;y) v y (x 1;y);v y (x;y) v y (x;y 1)] T

Nagle and Enkelmann conducted a comprehensive evaluation of the eect of smoothnessconstraintsonmotionestimation[30]. Inordertoavoidover-smoothing

of the motion eld, Nagel suggested an oriented-smoothness constraint in which

smoothness isimposedalongtheobjectboundaries,butnotacrosstheboundaries

[29]. This has resultedin signicantimprovement in motion estimation accuracy

[4].

6.3.2 Using a Multipoint Neighborhood

Inthisapproach,whenestimatingthemotionvectoratapixelx n

,weassumethat

themotionvectorsofallthepixelsinaneighborhoodB(x n

)surroundingitarethe same,beingd

. Todetermined n

,onecaneitherminimizethepredictionerrorover B(x

), orsolve the optical owequation using aleast squares method. Here we

presenttherst approach. Toestimatethemotion vectord

n forx

, weminimize

theDFDerroroverB(x

n ): E n (d n )= 1 2 X x2B (x n ) w(x)( 2 (x+d n ) 1 (x)) 2 ; (6.3.2)

where w(x) are theweights assigned to pixel x. Usually, theweight decreases as

thedistancefromxto x

increases.

Thegradientwithrespecttod

n is g n = @E n @d n = X x2B (x n ) w(x)e(x;d n ) @ 2 @x x+dn ; (6.3.3) wheree(x;d n )= 2 (x+d n ) 1

(x)istheDFDatxwiththeestimated n

. Letd l n representtheestimateatthel-thiteration,therstordergradientdescentmethod

would yieldthefollowingupdate algorithm

d l+1 n = d l n g n (d l n ): (6.3.4)

From Eq. (6.3.3), the update at each iteration depends on the sum of the image

gradientsat variouspixelsscaledbytheweightedDFDvaluesatthosepixels.

OnecanalsoderiveaniterativealgorithmusingtheNewton-Ralphsonmethod.

FromEq.(6.3.3),theHessianmatrixis

H n = @ 2 E n @d 2 n = X x2B (x n ) w(x) @ 2 @x @ 2 @x T x+d n +w(x)e(x;d n ) @ 2 2 @x 2 x+dn X x2B (x n ) w(x) @ 2 @x @ 2 @x T x+dn :

TheNewton-Ralphsonupdatealgorithmisthen(See AppendixB):

d l+1 = d l H(d l ) 1 g n (d l ): (6.3.5)

This algorithmconvergesfaster thanthe rstorder gradientdescentmethod, but itrequiresmorecomputationineachiteration.

Insteadofusing gradient-basedupdatealgorithms, onecanalsouseexhaustive

search to ndthed

thatyields theminimal errorwithin adened searchrange.

Thiswillleadtotheexhaustiveblockmatchingalgorithm(EBMA)tobepresented

inSec.6.4.1. Thedierencefrom theEBMAisthat theneighborhoodusedhereis

aslidingwindowandaMVisdeterminedforeachpixelbyminimizingtheerrorin

itsneighborhood. Theneighborhood in generaldoesnothavetobearectangular

block.

6.3.3 Pel-Recursive Methods

Ina videocoder using motioncompensated prediction, oneneedsto specify both

theMVsandtheDFDimage. Withapixel-basedmotionrepresentation,onewould

needto specifyaMV foreachpixel, which isverycostly. Inpel-recursivemotion

In document Video Processing & Communications - Wang (Page 172-200)