Tracked Frame
Figure 6.4. Forwardandbackwardmotionestimation. Adaptedfrom[39 ,Fig.5.5].
anchor andtrackedframes, respectively. Ingeneral, wecanrepresent themotion
eld asd(x;a), where a = [a
1 ;a 2 ;:::;a L ] T
is avector containingall the motion
parameters. Similarly, the mapping function can be denoted by w (x;a) = x+
d(x;a):Themotionestimationproblemistoestimatethemotionparametervector
a. Methods thathavebeendevelopedcanbecategorizedinto twogroups: feature-
basedandintensity-based. Inthefeature-basedapproach,correspondencesbetween
pairsofselectedfeaturepointsintwovideoframesarerstestablished. Themotion
model parameters are then obtainedby a least squares tting of the established
correspondences into the chosen motionmodel. This approach is only applicable
toparametricmotionmodelsandcanbequiteeectivein,say,determiningglobal
motions. Theintensity-basedapproachappliestheconstantintensityassumptionor theoptical owequationateverypixelandrequirestheestimatedmotiontosatisfy thisconstraintascloselyaspossible. Thisapproachis moreappropriatewhenthe
underlyingmotioncannotbecharacterizedbyasimplemodel,andthatanestimate
ofapixel-wiseorblock-wisemotioneldisdesired.
Inthis chapter, we only consider intensity-based approaches, which are more
widelyusedin applicationsrequiringmotioncompensatedpredictionandltering.
In general, the intensity-based motion estimation problem can be converted into
anoptimization problem, andthree keyquestionsneed to beanswered: i)how to
parameterizetheunderlyingmotioneld? ii)whatcriteriontousetoestimatethe parameters? andiii)howtosearchfortheoptimalparameters? Inthissection,we
rstdescribeseveralwaystorepresentamotioneld. Thenweintroducedierent
types of estimation criteria. Finally, we present search strategies commonly used
6.2.1 Motion Representation
A keyproblem in motion estimationis howto parameterize the motioneld. As
shown in Sec. 5.5, the 2D motion eld resulting from acamera orobjectmotion
canusuallybedescribedbyafewparameters. However,usually,therearemultiple objectsintheimagedscenethatmovedierently. Therefore,aglobalparameterized
model isusually notadequate. Themostdirect and unconstrainedapproachis to
specifythemotionvectorateverypixel. Thisistheso-calledpixel-basedrepresenta- tion. Sucharepresentationisuniversallyapplicable,butitrequirestheestimation
ofalargenumberof unknowns(twicethenumberofpixels!) andthesolutioncan
oftenbephysicallyincorrectunlessaproperphysicalconstraintisimposed during theestimationstep. Ontheotherhand,ifonlythecameraismovingortheimaged scenecontainsasinglemovingobjectwithaplanarsurface,onecoulduseaglobal motion representationtocharacterizetheentiremotioneld. Ingeneral,forscenes containingmultiplemovingobjects,itismoreappropriatetodivideanimageframe intomultipleregionssothatthemotionwithineachregioncanbecharacterizedwell
byaparameterizedmodel. This is known asregion-based motion representation,
3
whichconsistsofaregionsegmentationmapandseveralsetsofmotionparameters,
one for each region. The diÆculty with such an approach is that one does not
know in advance which pixels havesimilar motions. Therefore, segmentation and
estimationhavetobeaccomplishediteratively,whichrequiresintensiveamountof
computationsthatmaynotbefeasibleinpractice.
Oneway to reduce thecomplexityassociatedwith region-basedmotion repre-
sentationisbyusingaxedpartitionoftheimagedomainintomanysmallblocks.
Aslongaseachblockissmall enough,themotionvariationwithin eachblockcan
becharacterizedwellbyasimplemodelandthemotionparametersforeach block
can be estimated independently. This brings us to the popular block-based repre-
sentation. The simplest version models the motion in each block by a constant
translation, so that the estimation problem becomes that of nding one MV for
eachblock. Thismethodprovidesagoodcompromisebetweenaccuracyandcom-
plexity, and hasfound great successin practical videocoding systems. One main
problem with theblock-basedapproach is that itdoesnotimpose any constraint
on the motion transition between adjacent blocks. Theresulting motion is often
discontinuousacrossblockboundaries,evenwhenthetruemotioneld ischanging
smoothlyfromblocktoblock. Oneapproachto overcomethisproblemisbyusing
a mesh-based representation, by which the underlying image frame is partitioned
into non-overlapping polygonalelements. The motion eld over the entire frame
is described by the MVs at the nodes (corners of polygonal elements) only, and
theMVsattheinteriorpointsofanelementareinterpolatedfromthenodalMVs.
Thisrepresentationinducesamotioneldthatiscontinuouseverywhere. Itismore appropriatethantheblock-basedrepresentationoverinteriorregionsof anobject,
3
Thisissometimescalledobject-basedmotionrepresentation[27].Hereweusetheword\region- based"toacknowledgethefactthatweareonlyconsidering2Dmotions,andthataregionwith
Figure6.5. Dierentmotionrepresentations: (a)global,(b)pixel-based,(c)block-based, and(d)region-based. From[38 ,Fig.3].
whichusuallyundergoesacontinuousmotion,butitfailstocapturemotiondiscon-
tinuities at object boundaries. Adaptiveschemes that allow discontinuities when
necessaryisneededformoreaccuratemotionestimation. Figure6.5illustratesthe
eect of several motion representations described above for a head-and-shoulder
scene. Inthenextfewsections,wewillintroducemotionestimationmethodsusing dierentmotionrepresentations.
6.2.2 Motion Estimation Criteria
Forachosenmotionmodel,theproblemishowtoestimatethemodelparameters.In thissection,wedescribeseveraldierentcriteriaforestimatingmotionparameters.
CriterionbasedonDisplacedFrameDierence
The most popular criterion for motion estimation is to minimize the sum of the
errorsbetweentheluminancevaluesofeverypairofcorrespondingpointsbetween
the anchor frame
1
and the tracked frame
2 . Recall that x in 1 is moved to w (x;a)in 2
. Therefore,theobjectivefunction canbewritten as,
E DFD (a)= X j 2 (w (x;a)) 1 (x)j p ; (6.2.1)
whereisthedomainofallpixelsin 1
,andpisapositivenumber. Whenp=1,
the above error is called mean absolute dierence (MAD), and when p = 2, the
mean squared error (MSE). The error image, e(x;a) =
2
(w (x;a)) 1
(x), is
usuallycalled displacedframe dierence(DFD)image, andtheabovemeasurethe
DFD error.
Thenecessarycondition forminimizing E
DFD
isthat its gradientvanishes. In thecaseofp=2,this gradientis
@E DFD @a = 2 X x2 ( 2 (w (x;a)) 1 (x)) @d(x) @a r 2 (w (x;a)) (6.2.2) where @d @a = " @d x @a1 @d x @a2 @d x @aL @dy @a 1 @dy @a 2 @dy @a L # T :
CriterionbasedonOptical Flow Equation
Instead of minimizing the DFD error, another approach is to solve the system
of equations establishedbased onthe optical ow constraintgiven in Eq. (6.1.3). Let 1 (x;y) = (x;y;t); 2 (x;y) = (x;y;t+d t ): If d t
is small, we canassume
@ @t dt= 2 (x) 1
(x):Then,Eq.(6.1.3)canbewrittenas
@ 1 @x d x + @ 1 @y d y +( 2 1 )=0 or r T 1 d+( 2 1 )=0: (6.2.3)
This discrete version of the optical ow equation is more often used for motion
estimation in digitalvideos. Solving the aboveequations for all x canbe turned
intoaminimizationproblemwiththefollowingobjectivefunction:
E ow (a)= X x2 r 1 (x) T d(x;a)+ 2 (x) 1 (x) p : (6.2.4) ThegradientofE ow is,whenp=2; @E ow @a = 2 X x2 r 1 (x) T d(x;a)+ 2 (x) 1 (x) @d(x) @a r 1 (x): (6.2.5)
Ifthemotioneld isconstantoverasmallregion 0 , i.e.,d(x;a)=d 0 ;x2 0 ,
thenEq.(6.2.5) becomes
@E ow @d 0 = X x2 0 r 1 (x) T d 0 + 2 (x) 1 (x) r 1 (x): (6.2.6)
Settingtheabovegradienttozeroyieldstheleastsquaressolutionford 0 : d 0 = X 0 r 1 (x)r 1 (x) T ! 1 X 0 ( 1 (x) 2 (x))r 1 (x) ! : (6.2.7)
When themotion is not a constant, but canbe related to the model parameters linearly, onecanstill derivea similar least-squaressolution. SeeProb. 6.6 in the
Problemsection.
Anadvantageoftheabovemethodisthattheminimizingfunctionisaquadratic
function of the MVs, when p =2. If the motion parametersare linearly related
to the MVs, then the function has a unique minimum and can be solved easily.
This is not true with the DFD error given in Eq. (6.2.1). However, the optical
ow equation is valid only when the motion is small, or when an initial motion
estimate ~
d(x)thatisclosetothetruemotioncanbefoundandonecanpre-update 2
(x)to 2
(x+ ~
d(x))Whenthis isnotthecase,itisbettertousetheDFDerror
criterion, and nd the minimal solution using the gradient descent orexhaustive
searchmethod.
Regularization
MinimizingtheDFDerrororsolvingtheoptical owequationdoesnotalwaysgive
physicallymeaningfulmotionestimate. Thisispartiallybecausetheconstantinten- sityassumptionisnotalwayscorrect. Theimagedintensityofthesameobjectpoint
mayvaryafter anobjectmotionbecauseof thevariousre ectanceandshadowing
eects. Secondly,inaregionwith attexture,manydierentmotionestimatescan satisfytheconstantintensityassumptionortheoptical owequation. Finally,ifthe
motionparametersaretheMVs ateverypixel, theoptical owequationdoesnot
constrainthemotionvectorcompletely. Thesefactorsmaketheproblemofmotion
estimationaill-posedproblem.
Toobtainaphysicallymeaningfulsolution,oneneedstoimposeadditionalcon-
straintstoregularizetheproblem. Onecommonregularizationapproachis toadd
apenaltytermto theerrorfunction in (6.2.1)or(6.2.4),whichshould enforcethe
resultingmotionestimatetobearthecharacteristicsofcommonmotionelds. One
well-knownpropertyofatypicalmotioneldisthatitusuallyvariessmoothlyfrom pixeltopixel,exceptatobjectboundaries. Toenforcethesmoothness,onecanuse
apenaltytermthat measures thedierencesbetween theMVsofadjacentpixels,
i.e., E s (a)= X x2 X y 2N x kd(x;a) d(y ;a)k 2 ; (6.2.8)
where kkrepresents the 2-norm, N
x
represents the set of pixels adjacent to x. Eitherthe4-connectivityor8-connectivityneighborhoodcanbeused.
Theoverallminimizationcriterioncanbewrittenas
E=E DFD +w s E s : (6.2.9)
TheweightingcoeÆcientw
s
shouldbechosenbasedon theimportance ofmotion
smoothness relativeto thepredictionerror. Toavoidover-blurring,oneshould re- ducetheweightingatobjectboundaries. This,however,requiresaccuratedetection
Bayesian Criterion
TheBayesianestimator isbased onaprobablisticformulationof themotionesti-
mationproblem,pioneeredbyKonradandDubois[22,38]. Underthisformulation,
givenananchorframe
1
,theimagefunctionatthetrackedframe 2
isconsidered arealizationofarandomeld ,andthemotionelddisarealizationofanother
random eld D . The a posterior probability distribution of the motion eld D
givenarealization of and
1
canbewritten, usingtheBayesrule
P(D =dj = 2 ; 1 )= P( = 2 jD =d; 1 )P(D =d; 1 ) P( = 2 ; 1 ) : (6.2.10)
In the above notation, the semicolon indicates that subsequent variables are de-
terministicparameters. An estimatorbasedontheBayesiancriterionattempts to
maximize the a posterior probability. But for given
1 and
2
, maximizing the
aboveprobabilityis equivalent tomaximizing the numeratoronly. Therefore,the
maximumaposterior(MAP)estimateofdis
d MAP = argmax d fP( = 2 jD =d; 1 )P(D =d; 1 )g: (6.2.11)
Therstprobabilitydenotesthelikelihoodofanimageframegiventhemotion
eldandtheanchorframe. LetE representtherandomeldcorrespondingtothe
DFD imagee(x)=
2
(x+d)
1
(x)forgivendand 1 , then P( = 2 jD =d; 1 )=P(E =e);
andtheaboveequationbecomes
d MAP = argmax d fP(E =e)P(D =d; 1 )g = argmin d f logP(E =e) logP(D =d; 1 )g: (6.2.12)
From the source coding theory (Sec. 8.3.1), the minimum coding length fora
source X is its entropy, logP(X = x). We see that the MAP estimate is
equivalent to minimizing thesum of the coding lengthfor the DFD image e and
that for themotion eld d. As will beshown in Sec.9.3.1, this is precisely what
avideocoderusingmotion-compensated predictionneedsto code. Therefore,the
MAPestimatefordisequivalenttoaminimumdescriptionlength(MDL)estimate
[34]. Becausethepurposeof motionestimationin videocodingis tominimizethe
bitrate,theMAPcriterionisabetterchoicethanminimizingthepredictionerror.
The most common model for the DFD image is a zero-mean independently
identicallydistributed (i.i.d.) Gaussianeld, withdistribution
P(E =e)=(2 2 ) jj=2 exp P x2 e 2 (x) 2 2 ; (6.2.13)
wherejjdenotes thesize of (i.e.,thenumberofpixelsin ). With thismodel, minimizingtherstterminEq.(6.2.12)isequivalenttominimizingthepreviously
ForthemotioneldD ,acommonmodelisaGibbs/Markovrandomeld[11].
Suchamodelisdenedbyaneighborhoodstructurecalledclique. LetCrepresent
thesetofcliques,themodelassumes
P(D =d)= 1 Z exp( X c2C V c (d)); (6.2.14)
whereZ isanormalizationfactor. Thefunction V
c
(d)iscalledthepotentialfunc-
tion,which isusuallydenedto measurethedierencebetweenpixelsin thesame
clique: V c (d)= X (x;y )2c jd(x) d(y )j 2 : (6.2.15)
Underthismodel,minimizingthesecondterminEq.(6.2.12)isequivalenttomin-
imizing the smoothing function in Eq. (6.2.8). Therefore, the MAP estimate is
equivalenttotheDFD-basedestimatorwithanappropriatesmoothnessconstraint.
6.2.3 Minimization Methods
The error functions presented in Sec. 6.2.2 can be minimized using various opti-
mization methods. Here we only consider exhaustive search and gradient-based
searchmethods. Usually,fortheexhaustivesearch,theMADisusedforreasonsof
computational simplicity, whereasfor thegradient-basedsearch, theMSE is used
foritsmathematicaltractability.
Obviously,theadvantageoftheexhaustivesearchmethodisthat itguarantees reachingtheglobalminimum. However,suchsearchisfeasibleonlyifthenumberof unknownparametersissmall,andeachparametertakesonlyanitesetofdiscrete values. Toreducethesearchtime,variousfastalgorithmscanbedeveloped,which achievesub-optimalsolutions.
Themost commongradientdescent methods include the steepest gradient de-
scent andtheNewton-Ralphsonmethod. Abriefreviewofthesemethodsisprovided
inAppendixB. Agradient-basedmethodcanhandleunknownparametersinahigh
dimensional continuousspace. However,it canonlyguaranteethe convergence to
alocalminimum. Theerrorfunctions introducedintheprevioussectioningeneral
arenotconvexandcanhavemanylocalminimathatarefarfromtheglobalmini-
mum. Therefore,itisimportanttoobtainagoodinitialsolutionthroughtheuseof
aprior knowledge,orbyaddingapenaltytermtomaketheerrorfunctionconvex.
Withthegradient-basedmethod,onemustcalculatethespatio-temporalgradi-
entsoftheunderlyingsignal. AppendixAreviewsmethodsforcomputingrstand
second order gradientsfrom digitalsampled images. Note that the methods used
for calculating the gradient functions canhave profound impact on the accuracy
and robustnessofthe associatedmotion estimation methods, ashavebeen shown
by Barron et al. [4]. Using a Gaussian pre-lter followed by acentral dierence
Oneimportantsearchstrategyistouseamulti-resolutionrepresentationofthe
motioneldand conductthesearchin ahierarchicalmanner. Thebasicideaisto
rst search the motion parametersin acoarse resolution, propagatethis solution
into aner resolution, andthen rene thesolution in thener resolution. Itcan
combat both the slowness of exhaustive search methods and the non-optimality
of gradient-basedmethods. Wewill present themulti-resolution method in more
detailinSec.6.9.
6.3 Pixel-Based Motion Estimation
In pixel-based motion estimation, one tries to estimate a motion vector for ev-
ery pixel. Obviously, this problem is ill-dened. If one uses the constant inten-
sity assumption, foreverypixel in theanchorframe, there aremany pixelsin the trackedframe that have exactlythesameimage intensity. If oneusesthe optical
ow equation instead, the problem is again indeterminate, because there is only
oneequationfor twounknowns. Tocircumvent thisproblem, there are ingeneral
four approaches. First, onecan use the regularizationtechnique to enforce some
smoothnessconstraintsonthemotioneld,sothatthemotionvectorofanewpixel
isconstrained bythose foundfor surroundingpixels. Second,one canassumethe
motionvectorsin aneighborhoodsurroundingeach pixelarethesame,and apply
theconstantintensityassumptionortheoptical owequationovertheentireneigh-
borhood. Thethird approach is to makeuse of additionalinvariance constraints.
In addition to intensity invariance, which leadsto the optical ow equation, one
can assume that the intensity gradient is invariant under motion, as proposed in
[29,26,15]. Finally,onecanalsomakeuseoftherelationbetweenthephasefunc-
tions of the frame before and after motion [9]. In [4], Barron, et al. evaluated
variousmethods foroptical owcomputation,bytestingthese algorithmsonboth
synthetic and real worldimageries. In this section, we will describe the rst two approachesonly. Wewillalsointroducethepel-recursivetypeofalgorithmswhich aredevelopedforvideocompressionapplications.
6.3.1 Regularization Using Motion Smoothness Constraint
HornandSchunck[16]proposed toestimatethemotionvectorsbyminimizingthe
followingobjectivefunction,whichisacombinationofthe ow-basedcriterionand
amotionsmoothness criterion:
E(v (x))= X x2 @ @x v x + @ @y v y + @ @t 2 +w s krv x k 2 +krv y k 2 : (6.3.1)
Intheiroriginal algorithm,thespatial gradientofv x and v y areapproximatedby rv x =[v x (x;y) v x (x 1;y);v x (x;y) v x (x;y 1)] T ;rv y =[v y (x;y) v y (x 1;y);v y (x;y) v y (x;y 1)] T
Nagle and Enkelmann conducted a comprehensive evaluation of the eect of smoothnessconstraintsonmotionestimation[30]. Inordertoavoidover-smoothing
of the motion eld, Nagel suggested an oriented-smoothness constraint in which
smoothness isimposedalongtheobjectboundaries,butnotacrosstheboundaries
[29]. This has resultedin signicantimprovement in motion estimation accuracy
[4].
6.3.2 Using a Multipoint Neighborhood
Inthisapproach,whenestimatingthemotionvectoratapixelx n
,weassumethat
themotionvectorsofallthepixelsinaneighborhoodB(x n
)surroundingitarethe same,beingd
n
. Todetermined n
,onecaneitherminimizethepredictionerrorover B(x
n
), orsolve the optical owequation using aleast squares method. Here we
presenttherst approach. Toestimatethemotion vectord
n forx
n
, weminimize
theDFDerroroverB(x
n ): E n (d n )= 1 2 X x2B (x n ) w(x)( 2 (x+d n ) 1 (x)) 2 ; (6.3.2)
where w(x) are theweights assigned to pixel x. Usually, theweight decreases as
thedistancefromxto x
n
increases.
Thegradientwithrespecttod
n is g n = @E n @d n = X x2B (x n ) w(x)e(x;d n ) @ 2 @x x+dn ; (6.3.3) wheree(x;d n )= 2 (x+d n ) 1
(x)istheDFDatxwiththeestimated n
. Letd l n representtheestimateatthel-thiteration,therstordergradientdescentmethod
would yieldthefollowingupdate algorithm
d l+1 n = d l n g n (d l n ): (6.3.4)
From Eq. (6.3.3), the update at each iteration depends on the sum of the image
gradientsat variouspixelsscaledbytheweightedDFDvaluesatthosepixels.
OnecanalsoderiveaniterativealgorithmusingtheNewton-Ralphsonmethod.
FromEq.(6.3.3),theHessianmatrixis
H n = @ 2 E n @d 2 n = X x2B (x n ) w(x) @ 2 @x @ 2 @x T x+d n +w(x)e(x;d n ) @ 2 2 @x 2 x+dn X x2B (x n ) w(x) @ 2 @x @ 2 @x T x+dn :
TheNewton-Ralphsonupdatealgorithmisthen(See AppendixB):
d l+1 = d l H(d l ) 1 g n (d l ): (6.3.5)
This algorithmconvergesfaster thanthe rstorder gradientdescentmethod, but itrequiresmorecomputationineachiteration.
Insteadofusing gradient-basedupdatealgorithms, onecanalsouseexhaustive
search to ndthed
n
thatyields theminimal errorwithin adened searchrange.
Thiswillleadtotheexhaustiveblockmatchingalgorithm(EBMA)tobepresented
inSec.6.4.1. Thedierencefrom theEBMAisthat theneighborhoodusedhereis
aslidingwindowandaMVisdeterminedforeachpixelbyminimizingtheerrorin
itsneighborhood. Theneighborhood in generaldoesnothavetobearectangular
block.
6.3.3 Pel-Recursive Methods
Ina videocoder using motioncompensated prediction, oneneedsto specify both
theMVsandtheDFDimage. Withapixel-basedmotionrepresentation,onewould
needto specifyaMV foreachpixel, which isverycostly. Inpel-recursivemotion