Rochester Institute of Technology
RIT Scholar Works
Theses
Thesis/Dissertation Collections
2004
Feasibility analysis of correlation based prefetching
using digital signal processing
David L. Morse
Follow this and additional works at:
http://scholarworks.rit.edu/theses
This Thesis is brought to you for free and open access by the Thesis/Dissertation Collections at RIT Scholar Works. It has been accepted for inclusion
in Theses by an authorized administrator of RIT Scholar Works. For more information, please contact
Recommended Citation
Morse, David L., "Feasibility analysis of correlation based prefetching using digital signal processing" (2004). Thesis. Rochester
Institute of Technology. Accessed from
FEASIBILITY ANALYSIS OF CORRELATION
BASED PREFETCHING USING DIGITAL SIGNAL
PROCESSING
By
David L. Morse
August 2004
A thesis submitted in
partial
fulfillment of
the
requirements for the degree of
Master of
Science in
Computer Engineering
Rochester Institute of Technology
Approved by:
Greg P. Semeraro
Primary
Advisor:
Dr
. Greg
Philip
Semeraro, Assistant Professor
Juan Cockburn
Advisor:
Dr. Juan
Carlos Cockburn, Associate Professor
M.Shaaban
THESIS RELEASE PERMISSION FORM
Rochester Institute of Technology
Kate Gleason College of Engineering
Title: Feasibility
Analysis
of Correlation Based Pre fetching using Digital Signal
Processing
I, David Morse, hereby grant pennission to the Wallace Memorial Library to
reproduce my thesis in whole or in part.
David Morse
David Morse
rt
1)3.
b06
ivj
I
I
\
Dedication
This
workis
dedicated
to
my
mother whodidn't
makeit
long
enoughto
see meAcknowledgements
The
author wouldlike
to thank
Dr.
Greg
Semeraro,
Dr Juan
Cockburn,
andDr.
Muhammad Shaaban for participating in
the
author'sthesis
committee.The
author would alsolike
to
thankthe
Rochester Institute
ofTechnology
andin
particularthe
Computer
Engineering
Department
ofthe
Kate Gleason College
ofEngineering
andits
entirefaculty
for
the
exemplary
job it
has done in preparing
the
authorfor
the
researchworkrequiredby
this
thesis.ABSTRACT
FEASIBILITY
ANALYSIS OF
CORRELATION
BASED
PREFETCHING
USING
DIGITAL
SIGNAL
PROCESSING
by
David Morse
Supervising
Professor:
Dr.
Greg
P.
Semeraro
Department
ofComputer
Engineering
As
the
gap between
processor performance andmemory
performance continuesto
broaden
withtime,
techniques
to
hide memory
latency
such as correlationbased
prefetching become
exceedingly
important.
When
amemory
referenceissued
by
the
processor missesthe
level
onecache,
the
request propagatesdown
the
memory
hierarchy
untilit
finally
finds
the
requesteddatum. With
eachlayer traversed,
the
latency
grows exponentially.Prefetching
is
atechnique
usedto
hide
this
latency by
attempting
to
predict whichmemory
references willbe
requestedin
the
nearfuture,
andthenload
theminto
cachebefore
they
are needed.This
workinvestigates
the
use ofdigital
signalprocessing
techniques
in
designing
aneffective prefetch algorithm.
The
algorithm proposedin
this
work usesthe
Kalman
Filter
asthe
basic digital
signal-processing
block.
The
sequence ofmemory
addressfiltering
techniques,
a robust prediction algorithmis
presentedto
predictfuture
miss referencesbased
onthepattern of previousmiss references.The
algorithm was simulatedusing 40
benchmark
programsfrom
the
Olden,
MediaBench,
andSPEC benchmark
suitesfor
the
Alpha
21264
andthe
PISA (a
MIPS-like
ISA)
instruction
set architectures.A
maindifference between
these two
ISAs is
that
the
Alpha
21264
ISA
contains software prefetchinstructions,
andthe
PISA
instruction
set architecturedoes
not.The
simulations place aprefetcher unitbetween
the
level
onedata
cache andthe
level
two
unified cache.SimpleScalar
simulationresults
for
abroad
set ofbenchmark
programsusing
32 Kalman filter
blocks
show anaverage of
6.5% speedup for
theAlpha
21264
ISA,
and an average of5.6%
speedup
for
the
PISA
instruction
set architecturefor
thosebenchmark
programs whichhave
aTABLE
OF CONTENTS
Dedication
iii
Acknowledgements
iv
ABSTRACT
vGLOSSARY
ix
Chapter 1
1
Introduction
1
Chapter 2
6
Motivation
6
Chapter 3
15
Prefetching
andthe
Kalman Filter
15
Compiler
based prefetching
15
Stream Buffers
16
Stride
Prefetchers
19
Correlation based prefetching
the
Markov Predictor
21
The Kalman Filter
24
Chapter 4
30
Algorithm
Design
30
Chapter
5
45
Algorithm
Evaluation
45
Chapter 6
63
Simulation
63
Evaluation
metrics63
Integration
withSimpleScalar
64
Simulated
system68
Results
69
Compiler
Optimization
Levels
76
Chapter
7
79
Discussion
79
Timeliness
82
Feasibility
andpracticality
84
Chapter 8
88
Conclusions
88
Future
Directions
90
LIST OF FIGURES
AND TABLES
Number
Page
Figure
1.1:
Memory hierarchy
for
architecturesimulated2
Figure 2.1: Excerpt
from data level
one cache miss reference streamfor
equake(Alpha)
12
Figure
3.1: The
streambuffer,
(taken
from
[22])
17
Figure 3.2: Stride
prefetch algorithmhardware (taken from
[35])
19
Figure 3.3: Hardware implementation
ofthe
Markov
Predictor
(taken from
[4])
22
Figure 4.1: General
algorithmlayout using
ninternal filters
33
Figure 4.2: Flowchart
of control unit'slogic
whenreceiving
amiss reference38
Figure 4.3: Flowchart
of control unit'slogic
whenreceiving
a cachehit
40
Figure 4.4: Simplified implementation
diagram
of controllogic
andprefetchlogic
44
Table 5.1:
Benchmarks
used48
Table 5.2:
Algorithm
evaluationparameters49
Figure 5.3: Percentage
of correctprefetchesout ofprefetchesissued
for
Alpha
50
Figure 5.4: Percentage
ofmemory
accesscorrectly
prefetchedfor Alpha
51
Table 5.5:
Accuracy
of algorithmfor Alpha
ISA
52
Table 5.6: Effectiveness
of algorithmfor Alpha
ISA
53
Figure 5.7:
Accuracy
of algorithmfor
PISA
55
Figure 5.8: Effectiveness
of algorithmfor
PISA
56
Table 5.9:
Accuracy
of algorithmfor
PISA
57
Table 5.10: Effectiveness
of algorithmfor
PISA
58
Table 5.11: Effectiveness
of algorithmfor Alpha
withlarge
NKF
values59
Table 5.12: Effectiveness
of algorithmfor PISA
withlarge NKF
values60
Figure 5.13: Effectiveness
of algorithm vs.NKF
61
Figure
6.1:
Flowchart
ofSimpleScalar
logic
to
implement
the
algorithm67
Table 6.2:
Architecture
configuration usedfor
the
simulations68
Figure 6.3:
Speedup
onAlpha benchmarks
withpotentialspeedup
greaterthan
10%
70
Table 6.4:
Average
speedupsfor
Alpha benchmarks
with potentialspeedup
greaterthan
10%
70
Figure 6.5:
Speedup
onAlpha benchmarks
withpotentialspeedup less
than
10%
72
Figure 6.6:
Speedup
onPISA
benchmarks
with potentialspeedup
greaterthan
10%
73
Table
6.7: Average
speedupsfor
PISA benchmarks
with potentialspeedup
greaterthan
10%
74
Figure 6.8:
Speedup
onPISA benchmarks
withpotentialspeedup less
than
10%
75
Figure 6.9:
Average speedup for
various optimizationlevels
ofAlpha
76
GLOSSARY
Access Pattern
The
sequenceofmemory
accessesissued
by
aprogramGUI
Graphical
User Interface
Memory
Reference Stream
Sequence
ofmemory
addresses referencesby
an execution of a programMiss
Reference
Stream
Sequence
ofcache missesthat
are causedby
an execution of a programNKF
Number
ofKalman Filters
-the
number offilters
usedin
the
algorithm.Dictates
the
implementation complexity
ofthe
algorithm.RDM
Recycling
Delta
Multiplier
-the
factor
that
multiplies theamount of cycles
between
consecutiveinput
pointsfor
aparticularfilter. Used in
conjunction withthe
RT
to
recyclefilters back
to
theavailable pool.RT
Recycling
Threshold
thenumber of cyclesin
which afilter
willbe
recycledif
it
does
not receiveanotherinput
point.Sim-outorder
A
productofthe
SimpleScalar
CPU Simulation
suitethat
simulates an out of orderissue
superscalarmicroprocessorSRAM
Static
Random
Access
Memory
VSTT
Virtual Stream
Thresholding
Tolerance
the threshold
Chapter 1
Introduction
The
problem ofcache misslatency
continuesto
grow as processor speedsincrease
faster
than
their
memory
system counterparts.During
a given execution of aprogram on a
microprocessor,
according
to the
instructions in
the
program,
the processor will needto
accessmemory in
variouslocations
to
performits
task.Since
processor performanceis
increasing
at abetter
ratethan
memory
performance[31],
the
penalty
incurred
for accessing memory
continuesto
increase.
To
combatthis
problem,
the
cache wasdesigned
to
serve as anintermediate step
in
thememory
hierarchy
thatholds
the
currently
being
usedmemory
valuesin
ahigh
speedSRAM (Static Random
Access
Memory). The
concepts oftemporallocality
and spatiallocality
play
apartin
traditional cachedesign,
because
they
assumethat
memory
addresses thathave
recendy been
accessedhave
ahigher probability
ofbeing
accessed againsoon,
andthememory
addresses nearthat address alsohave
ahigher
probability
ofbeing
accessed.A
cache worksby bringing
blocks
ofmemory
into
the
cache as the processorrequeststhem.
Once
the
block is in
memory,
the
processor can access andmodify
these
memory
values asthough
it
weredirectly
modifying
the
valuesin
the main memory.from
the
cacheto
make roomfor
new cacheblocks. At
this
time
the
contents ofthe
cache
block
arewrittenback
to
mainmemory
and a new cacheblock is loaded into
the
previouscache
block's
space.There
areseveralvariations of cachedesigns
andpoliciesthat
attemptto
maximizeperformance,
but
the
algorithm presentedin
this
workdoes
notrequire
any
specific cachedesign.
The
algorithm usedin
this
workis
simulated on an architecturethat
usesthree
caches,
an
instruction level
onecache,
adata level
onecache,
and a unifiedlevel
two
cache.They
are organizedin
the
memory
hierarchy
asdescribed in figure
1.1
[31].
Processor
Data LI
cacheInstruction
LI
cache 1 i r i i1
rUnified L2
cache 1 if
LMain
memory
Figure 1.1:
Memory hierarchy
for
architecture simulatedAs
the processor executes aprogram,
it
will encounterinstructions
that
referenceand write values
from
andto
memory.These memory instructions
aregivenmemory
addressesas
their
arguments.This memory
addressis
then
transferred
to the
data level
one
cache,
whichsearches throughits
contentsto
seeif it has
the
particularvaluein its
cache.
If
it
does,
it
quickly
returnsthe
value containedin
that
address.If
not,
it
willsendan
inquiry
to
the
unifiedlevel
two
cacheto
seeif it has
the value.Since
the
level
two
cacheis
muchlarger
that the
level
onecaches,
it may have
this
value even when thelevel
one cachedoes
not.If
it
does,
it
returnsthe
cacheblock
to the
level
onecache
that
inserts it into its
owncache,
and returnsthe
value requestedto
the
processor.
If
it does
not,
it in
turn
inquires
the
mainmemory (or
nextlevel
of cacheif
there
are morelevels)
for
the address.For
eachlevel
ofthe
memory
hierarchy
that
has
to
be
traversedto
find
amemory
value,
thelatency
increases,
andthe
processorhas
to
wait a
longer
amount oftime for
the
valueit
needsto
continue processing.Prefetching
is
aprocessin
whichthese
memory
references arepredicted,
andloaded into
cachebefore
theprocessor asks
for
thevalues.The
goalis
tohave
a netdecrease in
executiontime
of a program.
The
sequence ofthese
memory
addressesis
the
memory
referencestream,
and
the
sequence ofthese
referencesthat
are cache misses(the
cachedoes
nothave
the
value),
is
referredto
asthe
miss reference stream.The
algorithm presentedin
this
workfocuses
onprefetching for
the
data level
onecache,
thoughit
couldbe
appliedto the
instruction
level
one cache andthe
unifiedlevel
two
cache as well.It
is based
onthe
presence of accesspatterns
in
thememory
predicted.
This
algorithmtreats
the
memory
reference streamas adigital
signal,
whichis
simply
a series ofdiscrete
points,
and asthe
basis
ofthe
prefetcher predictionlogic,
traditionaldigital
signalprocessing
techniques
are applied.The design
ofthe
algorithmpresented
in
this
workis
configuredto
recognizeandlock
onto
strided accesspatternsin
thedata.
The
resultis
a prefetcherthat
is
shownin
simulationusing SimpleScalar
[15]
to
actually
decrease
executiontime
for benchmarks
from
theOlden
[32],
MediaBench
[33],
andSPEC2000
[11]
benchmark
suites.The
evaluations show anaverage reduction
in
miss references of29.4%
for
the
Alpha
ISA
benchmark
programs,
and40.9%
for
the
PISA
architecturebenchmark
programs.The
SimpleScalar
simulations ofthis
algorithm show an averagespeedup
of6.5% for
the
benchmark
programsfor
theAlpha
ISA,
and5.6% for
benchmark
programsfor
thePISA
architecture.The
remainder ofthis
document is
organized asfollows:
Chapter
2
describes
themotivation
for
this
work,
in
terms
ofexisting
prefetchdesigns
andhow
this
algorithmcan
improve
onthem.In Chapter
3,
detailed
explanations of somecurrently
proposedprefetch algorithms are
discussed,
along
with atheoretical
background
oftheKalman
Filter
algorithm.In Chapter
4,
the
design
of the prefetch algorithm proposedby
this
work
is
presented.In Chapter
5,
a theoretical evaluation ofthe effectiveness ofthis
algorithm
is
conducted.In
Chapter
6,
the code modificationsto
SimpleScalar
[15],
andthe
results ofthe
SimpleScalar
simulationsare presented.In Chapter
7,
adiscussion
ofthe
algorithm and a comparison ofthe
theoreticalresults with the simulation resultsChaptet2
Motivation
There
aremany
classes of applicationsthat
sufferfrom
the
latency
ofmemory
operations,
and asprocessorclockspeedsincrease,
thisdisparity
growslarger
[1, 2,
14].
One
majortechnique to
hide
this
memory
latency
is
prefetching.Prefetching
schemeshave been
proposedusing both
software[19]
andhardware
[1, 2,
3,
4,
5,
10,
16]
techniques.
Prior
researchhas found
that
many
schemeshave
been
provento
have
apositive
impact
onreducing
the
number of cachemissesfor
variousprograms,
but
noprefetcher
is ideal in
all situations.In
addition,
to
be
effective aprefetching
algorithmmustalso
be
shownto
actually
reduce executiontime
[12].
The
mostbasic
prefetch schemeis
to
simply increase
the
cacheline
size[20].
This
relies on
the
concept of spatiallocality,
andautomatically
prefetchesdata
atmemory
addresses
that
are nearthe
original miss address.This
idea is
extendedin
[21]
withSequential
Prefetching. Sequential
Prefetching
does
notjust
fetch
one cacheline,
but
fetches
the
nextfew
cachelines
aswell,
in
hopes
they
willbe
accessed soon.A further
improvement
onthis
schemeis
presentedin
[22],
the
streambuffer. The
streambuffer
stores prefetched
data in
a separatehardware
unitfrom
the
cache,
and asdata is
consumedfrom
the
streambuffer,
it
continuesto
prefetch sequential cachelines.
These
streambuffers
arefetched using
aunitstride,
fetching
each cacheline
one after[23, 35],
which calculated a non-unit stridebetween memory
accessesby
a givenload
instruction.
Prefetchers
that
arebased
on spatiallocality
such asthe
streambuffer
and strideprefetcher provide a reduction
in
cache missesfor
applicationsthat
accessdata in
regular
patterns,
such asmany
scientific applications.However,
for
applicationsthat
access
data in
non-regular patterns such as applicationsthatuselinked data
structures,
these
prefetchersdo
notachievethe
same success thanthey
do in
other applications[1,4].
For
this
reason,
correlationbased
prefetchershave been
proposed[1, 2,
4, 5,
24].
Correlation based
prefetchers workby
attempting
to
exploitany
correlationbetween
previous
memory
accesses and other execution parameters withfuture memory
accesses.
If
these
future memory
accesses canbe
predicted with some amount ofaccuracy,
then a prefetcher canbe
employedto
bring
thesememory
addressesinto
cacheearly
andhide
memory
latency.
Typical
correlationbased
prefetchersemploy
the
use of a correlationtable[4]
in
whichmemory
accesses are storedin
a tablealong
withfuture memory
accesses.Joseph
andGrunwald
[4]
propose the use of aMarkov
model togeneratethe
probabilitiesthat
agiven
memory
access willfollow
some othermemory
access.These
patterns are storedin
the
correlationtable
structurealong
withtheir transitionprobabilities.When
there
is
a miss
reference,
the
tableis
consultedto
see whatthe
next miss references arelikely
to
ninning,
the
transition
probabilities areconstantly
being
updatedto
determine
the
most
likely
patterns ofmissreferences.The
downside
ofthis
modelis
that
it
cantake
along
time
for
the
systemto
build
a suitabletable
oftransition
probabilities given acertain
program,
and alsorequiresalarge
amount ofhardware
to
storethe
correlationtable.
This
work proposes a new approach to correlationbased prefetching
that
does
notrequire
large
tables
of valuesto
be
storedin
memory,
andhas
the
potential of notneeding
long
periods oftime
andmany
repeatedmemory
accesspatternsto
be
ableto
accurately
issue
prefetches.This
work evaluatesthe
feasibility
ofusing digital
signalprocessing
toimplement
aflexible
prefetch algorithm.Digital
signalprocessing
has
previously been
proposedfor
usein
microarchitecturefor
branch
prediction[17].
The FAB Predictor
usesFourier
analysis onbranch
outcomesto
predictfuture
branches
and achieve processor speedup.The
mainbenefit
that the
FAB
predictorhas
over otherbranch
prediction schemesis
that
it
can representpatterns of
branch
outcomesthatarevery
long,
and wouldnormally
takean unrealisticamount of
memory
to
store.By
transforming
branch
history
patternsfrom
the
time
domain
to the
frequency
domain,
it
is
possibleto
store patternsthat
wouldtake
2
bits
in
the
time domain in only
52
bits in
thefrequency
domain. This
allowsdetection
ofpatterns
that
have
avery
long
period.Digital
signalprocessing is
also usedin
otherclock signal
is divided
acrossthe
processordie into
multipledomains. To
controlthe
frequencies
of eachdomain,
anAttack-Decay
algorithmbased
ondigital
controltechniques
[25]
is
used,
withthe
objectivebeing
to
saveenergy
by
reducing
the
clockfrequency
in domains
that
arenotfully
activefor
givenportionsoftime.
The
miss reference stream of aprogramexecutionis defined
asthe
sequences oflevel-one cache miss addresses.
In
this
work,
the
term miss reference steam refersto the
level
onedata
cache miss referencestream,
unless otherwise noted.The
miss referencestreamof a program execution can
be
treatedas adigital
signal with respectto
time.
Each
miss addressis
adiscrete
signal point.As
mentionedearlier,
correlationbased
prefetchers attemptto extrapolate correlations
between
past miss referencesto
predictfuture
references,
but many
ofthe
proposed prefetchers accomplish thisby
creating
and
maintaining large
history
tables which are examinedto
seeif
a prefetch couldbe
made.
The
algorithm proposedin
this
workinvestigates
the use ofthe
Kalman Filter
[6, 7, 8,
9]
to
analyzethe
miss reference stream and usethe
prediction capabilities ofthe
Kalman Filter
to
determine
prefetch addresses.The Kalman Filter
was chosenbecause
ofits
flexibility,
ease ofimplementation,
andability
to
be
pipelined[7,
26].
Because
ofthis
andthe
algorithm's tolerancefor
latency,
animplementation
ofthis
algorithm would not
impact
processor cycletime.
Using
the
miss reference stream asthe
input
signalfor
theKalman
Filter,
thepredictions are stored and when apreviously
predicted value matches
the
nextmissreference,
a prefetchis issued for
the
next valuepredictable manner
according
to
the state model assumptions ofthe
Kalman
filter,
then predictions can continue
to
be
made,
and prefetches can continueto
be issued.
Since
the
Kalman Filter only
needsto
storeits internal
stateparameters,
large
tables
ofvalues
do
not needto
be
storedto
implement
the
prefetcher.The
mainlimitation
ofthis
prefetch algorithmis
the
computationalcomplexity
ofimplementing
the
Kalman
Filter
algorithm,
but
the
International
Technology Roadmap
for Semiconductors
(ITRS)
[30]
predictsthat the trend
ofincreasing
transistor
counts on microprocessorswill
continue,
enabling
these
filters
to
be
implemented.
An
argument canalsobe
madeto
simply
increase
sizes oflook-up
tables
to
enhance otherexisting
prefetchalgorithms,
but
asmemory
sizesincrease,
the speedstend to
decrease. The
algorithmproposed
in
this
workdoes
nothave
that
limitation.
The
algorithm proposedhere
consists of a variable number ofKalman Filters. Each
Kalman Filter
is
responsiblefor
analyzing
its
own virtual miss reference stream.The
overallmiss reference stream
is
a conglomerate of allthe
miss referencesissued
by
the
program,
and canbe
thought of as the superposition of multipleindividual
missreference
streams,
which whendivided
canbe
analyzedby
aKalman filter
to
producecorrect predictions.
Each
virtual miss reference streamis
one ofthese
individual
reference streams.
In
orderto
determine
which miss referencebelongs
to whichstream;,
simplethresholding
with respectto
missaddressis
used.If
the
miss addressis
within the threshold value of
the
previous miss address that was routedto
a givenfilters
had
a previous value thatis
withinthe
threshold,
then
a newfilter is initialized
with the current point.
If
allfilters
arecurrently in
use,
the
pointis ignored.
A
mainbenefit
ofthis
algorithm andits
methodofusing
the
Kalman
filter
ratherthan
storing
large
tables
of valuesis in
the
context-switching
environment of an actualoperating
system.When
a context switchoccurs,
typical
prefetch algorithms needto
flush
their predictortables
andbegin
learning
againif
they
are made aware ofthe
context
switch,
or willjust
begin running
the unrelated code withinaccurate
data
structuresguiding
their prediction algorithms.The
algorithm proposedin
this workcan
typically
begin
to
prefetchaccurately
withinfour
cachemiss references of a given virtualstream,
so a context switchdoes
nothave
as much of apenalty
asit does
onother prefetch algorithms.
Figure 2.1
shows an example ofthe
miss reference streamfor
equake(Alpha 21264
ISA). There
are several virtual streamsvisibly
present,
each of whichhaving
a stridedX107
MissReference StreamforEquake [DataLevel 1cachej
1.73 -i 1.725 -i/p V 172 r. i/i 5 1.715 1.71 1.705 I 1 1 iii, -4.8 4.9 5 5.1 Cyclenumber 5.2 5.3 5.4 x107
Figure 2.1: Excerpt from data level
one cache missreferencestreamfor
equake(Alpha)
The
algorithmproposedin
this
workis
functionally
similarto the
DSTRIDE
predictor[10]. The
DSTRIDE
predictor uses the miss reference stream and maintains aStride
Prediction
Table
ofprevious miss addresses.Comparisons
aremadebetween
the
pastfew
missaddresses andthe
currentmissaddressto
determine
allpossiblestride values.The
algorithm also uses aSteady
Stride Table
with statebits
to
determine
when asteady
strideexists,
andissues
prefetchesbased
onthat.
The
algorithmproposedhere
is
similarin
the
respectthat the
proposed algorithm uses aparticle-with-constant-velocity
state modelfor
the
Kalman Filters.
These
Kalman
Filters
aredesigned
to
be
algorithm
does
not needto
storetables
ofpredictionvalues.Each Kalman Filter
usedmaintains
the
history
ofits
miss reference streamviaits
recursive nature.Typical
stride prefetchers also requireknowledge
ofthe
programcounter,
orthe
nature of
the
currentinstruction
being
executed[23]. The
algorithm proposedin
thiswork
does
not requirethis
information. It only
needs accessto the
miss referencesissued
by
the
data
level-
1
cache,
and adata
port onthe
data
level-1
cache with whichto
insert
the
prefetchesinto
the
data
stream.Due
to
this
design,
a physicalimplementation
ofthis
prefetching
algorithm can executein
parallelwiththe
processor suchthat
it
would not affectthe
mannerin
which current processors aredesigned.
In
fact,
it
wouldbe
quite simpleto
augmentmany existing
processordesigns
totake
advantage of
this
algorithm.The
objective of this workis
to
present an algorithm which usesdigital
signalprocessing techniques,
specifically
theKalman
Filter,
to
perform prefetching.The
algorithm will then
be
evaluatedin
atheoretical
mannerusing
pre-generated missreference streams
to
determine its
effectivenessin
being
ableto
predictfuture
missreferences.
It
willthen
be
integrated into
a cycle-by-cycle super-scalar processorsimulator
to
determine its
effect ofexecutiontime
for
a set ofbenchmark
programs.The
experiments conducted showthat
it is
in
fact feasible
to
usethese
digital
signalthis
workto
microarchitecturalresearch consists of an algorithmthat
is
shownto
be
Chapter 3
Prefetching
andthe
Kalman Filter
There
have been many
prefetch algorithmspreviously
proposed,
andmany
ofthe
algorithms
have been
shownto
be
effectivein reducing
the
numberof cachemisses aprogram execution experiences
for
certain conditions.Both
software[19]
andhardware
[1, 2,
3,
4,
5,
10,
16]
approacheshave been
proposed.Most
prefetchalgorithms are
designed
by
first
identifying
predictable patternsin
the
memory
accessesin
aprogram,
andthen
trying
to
exploitthose
patterns.This
section willdescribe
afew
of
the
most wellknown
prefetchalgorithms.Compiler based prefetching
One
technique
for prefetching
is done
at compiletime.
The
compiler analyzesthe
workload and
determines if it is
appropriatetoinsert
specialPREFETCH
instructions,
which will
load memory into
the
cacheearly,
in
an attemptto
have it ready in
cachewhen the processorneeds
it.
Mowry
et al.[19]
proposes an algorithmfor embedding
compiler-timeprefetches
in
compiled codefor
scientificapplications,
especially
those
that
deal
withdense
matrices.One
ofthemainissues
with softwareprefetching is
that
nothing is known
at runtime about thedynamics
of anexecution,
andthe
prefetchalgorithm cannot adapt.
Unnecessary
prefetches are also anissue
in
this
algorithm,
compiledcode
does
notknow
whatis in
the
cache atthe
time it
executesthe
prefetch,
these
prefetchinstructions may be using
processor cycletime
withoutadding any
benefit.
The
simulation ofthis
algorithmin
[19]
showedthat
for
the
selectedbenchmarks,
50%
to
90%
ofthe
processor cycles used while stalledwaiting for
memory
accesses wereeliminated,
andit
requiredon averageless
than
a15% increase
in
theinstruction
countsfor
these
programs.The
existence of compilerbased prefetching is important
to this
workbecause
the
Alpha
21264
architecture[34]
does
contain specialinstructions for
prefetching,
which willload
cacheblocks into
the
data level
one cacheif
they
arenotalready
present.Of
course,
if
the
blocks
arealready in
cache,
then this
is
a wastedinstruction.
The PISA
architecturehas
noanalogousinstruction
in its
ISA.
Stream Buffers
Jouppi
[22]
proposedthe
use of a streambuffer,
whichis
aFIFO
queuefull
of cachelines
that
have been
prefetcheddue
to
a cachemiss reference.This
technique
employsa
secondary hardware
unitthat
is
checkedin
parallel with thelevel
one cache.When
there
is
a missreference,
the
streambuffer
will prefetchthe nextfew
cachelines
and store themin
thisqueue,
in hopes
that
the processor willneedthese
cachelines
next.When
thecacheis
checkedfor
existence of amemory
address,
thehead
ofthe
streambuffer
queueis
also checked.If
the
requested cacheline is in
thestreambuffer,
then
it
Under
goodcircumstances,
this
process will continue andthe
streambuffer
willbe
able
to
continually
providethe
next requested cacheline.
Fromprocessor Toprocessor
Head entry
Tad entry
Directmapped cache
Stream buffer (FIFO Queue version)
Tonextlowercache Fromnextlowercache
Figure 3.1: The
streambuffer,
(taken
from
[22])
The
streambuffer FIFO
as shownin
figure
3.1
is
madeup
of a sequence ofentries,
each of which
has
a cacheline tag,
an availablebit,
and a cacheline
ofdata.
When
aline is
prefetched,
the
tag
is inserted into
the
queue with the availablebit
set tofalse.
When
the
cacheline
arrives,
the
availablebit is
setto
true.In
themostbasic
version ofthe stream
buffer,
only
thehead
ofthe
FIFO
is
comparedin
parallelwiththe
cacheto
determine if
the
cacheline is
available,
so evenif
therequested cacheline is later
onin
parallel concurrent
searching
ofthe
entire queuefor
the
requested cacheline,
atthe
cost of additional
hardware
complexity.The
streambuffer
works wellfor
the
instruction
cache,
but does
nothave
the
samesuccess with
the
data
cache,
sincedata
accesses aretypically
interleaved from different
sources(the
algorithm proposedin
chapter4
refersto
these streams as virtual missreference streams).
An
expansion on the streambuffer is
themulti-way
streambuffer
which
is actually
a collection of streambuffers,
each of whichis
working
onits
ownvirtual miss reference stream.
When
a cache accessoccurs,
the
head
of each ofthe
queues
is
examinedin
parallelfor
a match.If
there
is
amiss referencethat
is
notin
any
of
the
queues,
then the
least recently
used queueis flushed
and assignedto the
new stream.Furthermore,
if
the
memory
accesses are notsequential or unitstride,
then the streambuffer has
difficulty. If
the
data
access patternis
a strided cache-block accesspattern,
then
the
streambuffer
does
not work as well.The
standard streambuffer
algorithmproposed
in
[22]
showeda reductionin
data
level
one cache missesranging from
about6%
to
25% for
the
selectedbenchmarks.
The instruction
level-one
cache misseswerereduced
by
figures
between
60%
and80%,
due
to
the
sequential nature of theStride Prefetchers
Fu
et al.[35]
propose a strideprefetcher,
whichis
anotherhardware-based
prefetching
algorithm
that
is designed
to
detect
and prefetchbased
on stridesin
the
memory
accesses.
It
worksby
rnaintaining
aStride Prediction
Table
(SPT)
in
a cache-likehardware
structure.The SPT
contains a series of entriesthat
consist of aninstruction
address
(IA) field,
alast memory
address(MA)
accessedfield,
and a valid(V)
bit.
As
instructions
that
referencememory
areexecuted,
they
are storedin
the
SPT along
withthe
memory
addressthey
referenced.Figure 3.2
showsthe
design
of the strideprefetcher.
instruction
addressmemory
address<JA>
(MA)
INSTRUCTION
ADDRESS
(IA)
l_t
LAST MEMORY
ADDRESS
(MA)
ranparattH
subtracter adderspthit prefetchaddress
When
aninstruction
references amemory
address,
the
SPT is
checked.If
there
is
amatching IA in
the
SPT
withthe
V bit
notset,
the
IA
andMA
pairwill overwritethis
entry
and setthe
V
bit;
this
is
anSPT
miss.If
the
IA is in
theSPT
andthe
V bit
is
set,
then the
stride willbe
calculatedbased
onthe
currentmemory
address andthe
MA in
the
SPT. If
the strideis
notzero,
then
aprefetch canbe
issued. The
SPT
entry
is
then
updated withthe
newIA/MA
pair.Chen
andBaer
[23]
proposea similar stride prefetch algorithmthat
relies on aLook-ahead
PC
andbranch
predictionto
calculate strides andissue
prefetches.This
algorithm addressesthe
issue
oftimeliness in
prefetching,
whereasthe
algorithmproposed
in
[35]
does
not.The
look-ahead PC is designed
tolook
ahead at possibleinstruction
referencesthat
mightbe issued
a number of cyclesin
the
future
equalto
the
memory latency.
If
the
look-ahead
PC
determines it
canmake aprefetch,
the
idea
is
thatby
thetime
the realPC
reachesthis
instruction,
the
memory
latency
willhave
already
passedandthedata
willbe
immediately
availablein
the
cachefor
the
processorto
access.The
stride prefetcheris
ableto
prefetchmemory
accessesin
patternsthat
the
traditional stream
buffer
cannot,
due
to the streambuffer's limitations
onits
fetching
patterns.
A
majorimplementation issue is
that the
stride prefetcherrequires accessto
the
instruction
stream.A
stride prefetcher thushas
ahigh impact
onprocessordesign
Correlation based
prefetching
the
Markov Predictor
Correlation based
prefetching
[1, 2,
4,
5,
24]
is
adifferent
style ofprefetching
compared
to
compilerbased
prefetching,
streambuffers,
and stride prefetchers.Correlation
based prefetching
attemptsto
find
a correlationbetween
past missreferences and
future
missreferences,
ratherthan
just prefetching
severalblocks
ofsequentialcache
lines
wheneverthere
is
amissreference,
such asin
the
streambuffer
case.
One
correlation-basedprefetcheris
Joseph
andGrunwald's
[4]
Markov Predictor.
The Markov
predictorbuilds
and utilizesMarkov
prediction sequencesbased
on thehistory
ofmissesthat
werepreviously
encounteredin
an execution.By inferring
that
a given pattern on cache misses willprobably
repeatin
thefuture
with someprobability,
these
patternscanbe
storedin
a cache-like structure and canbe
usedto
predictfuture
memory
access patterns.Figure
3.3
showstheportion ofthe
Markov
prefetcherdesign
Current
Miss Reference
Address
Next Address
Prediction Registers
< M ,MtssAddr
1
1-Pred
2"Pred
a*"Pred
4*Pred
**1-Pred
2ndPred
3"*Pred
4"Pred
-
1-Pred
2-Pred
3*Pred
4thPred
MissAddrN
1"Pred
2"dPred
3dPred
4*Pred
Prefetch
Request
Queue
Prefetch Request
Prefetch Request
I
CPU Address
MUX
L2
I
Request
iUrji
To
CPU
Figure 3.3: Hardware implementation
oftheMarkov Predictor (taken from
[4])
The Markov Predictor
worksby
watching
miss reference patternsfrom
a programexecution and
storing
the missreferencesthat
aremostlikely
to
be issued
after a givenmissreference.
The implementation does
this
by
using
atable
ofentries,
each of whichcontains a miss
address,
andfour
possible miss addressesthat
may
occurnext,
based
on previous executionhistory. In
one ofthe
schemevariations,
eachmiss address alsopossible prefetch addresses
to
issue
a prefetch on.As
codeexecutes,
this
table
is
continually
updated and usedto
issue
prefetches.When
a miss referenceis
encountered,
prefetchesareissued for
the
possiblefuture
miss referencesand placedin
a
16-entry
FIFO
queue.These
prefetches arethen
issued according
to
bandwidth
availability.
Demand-fetch
cachemissesalwayshave
thehighest
priority.When
a prefetchis
satisfied,
it is
placedin
afully
associative stream-bufferlike
structure
instead
ofthe
level
one cache(not
shownin
Figure 3.3). This
structureis
thenexamined
in
parallelwiththe
level
onecacheto
determine
if
thereis
a cachehit.
If
anentry
from
this
structureis
used,
it is
transferredto the
reallevel
one cache.In
contrast
to the
traditional streambuffer,
any
of the entriesmay be
used,
and thecorresponding
later
entries willbe
shiftedup
one slotif
an earlierentry
is
removed.This
approach showedto
work wellin
tracedriven
simulation when combined withstride prediction and stream
buffers
[4].
The
problem withthis
algorithmis
that the
correlation
tables
are createdfrom
past executions of programs suchthat
future
executions of
the
same programs willthen see the mostbenefit.
It
is
proposedin
[4]
that
it
wouldbe
used across all executions on a processorin
order to populatethe
table.
However,
no mentionis
madeto
the effect ofmultiple programsrunning in
different
memory
spaces andpolluting
the tables.The
tablemay
not reflectthe
propermemory
transitionsby
thetime
a given programis
executed,
profiledinto memory
memory
allocatedto
it,
andit
might requireavery large
table to
be
ableto
workin
the proposed manner.The
size ofthe table
is limited
to the
amount ofmemory
availableto the
prefetcherimplementation,
but
there are no simulationsthat
deal
with amulti-program environment
like
that
that
atypical
microprocessor wouldbe
subjectedto,
only
singleexecutiontraces.
Many
ofthe
hardware
prefetch algorithmsdiscussed
makeuse of ahardware
unitthat
storesthe
prefetchedvalues,
whichis
then
checkedin
parallelwith thelevel
one cache.This
worksto
reducepolluting
the
cache with useless prefetch values andreplacing
usable cache
blocks from
the
cache.This
algorithmproposedin
section4
does
not usesuch a
structure,
but
that
may be
apossible enhancementto the
proposed architecture.The Kalman Filter
In
1960,
R. E. Kalman
[6]
proposedhis Kalman filter
as anewrecursive optimallinear
estimator.
The
Kalman
filter is
not as much of a 'filter'as
it is
adata
processing
algorithmthat
recursively
operates ondiscrete data
pointswithoutthe
needto
store allpast
data
points.Take
for instance
a physicalprocess,
which needsto
be
estimated.First,
amathematicalmodel ofthe
processis
derived,
andrepresentedin
a statematrixtransition
form.
However,
since thismodelis
notprecisely
accurate withrespectto the
actual
process,
there
areinherent
errorsin
the
mathematical calculations.The
Kalman
filter
maintains a statemodel ofvarious stateparameters,
which aregenerally
the
statedata (measurements
ofthe
process)
that
is
assumedto
have
aGaussian
errorprobability
density
function,
incorporates
past measurements with currentmeasurements and
in
accordance with aspecificdegree
ofreliability,
estimatesthe
newvalues of the state variables
based
onthe
measurements.Basically,
aKalman
filter
canproduce
the
best
estimatefor
a set ofstate variablesbased
onthe
measurements andthe
knowledge
ofhow
accuratethe
measurementsandthe
processstatetransitions
are.With
respectto the
physical processandthe
measurementofthat
physicalprocess,
the
Kalman
filter
can produce an optimal estimation ofthe
values of those stateparameters.
In
[7]
and[8],
a more simplifiedintroduction
to the
Kalman
filter is
given,
andthe
most relevant points are summarized
here. The Kalman
filter is defined
by
five
matrices,
the n x n system matrixA
and optional n x1
input
matrixB
,the
m x nobservation matrix
H
,the
n x n system covariance matrixQ,
andthe
m x mobservation covariance matrix
R
.These
matrices representthe
statetransition
parameters,
the
observablevariables,
theuncertainty in
the
statetransitions,
andthe
uncertainty
in
the
observation,
respectively.The Kalman filter
attemptsto
estimate thelinear
stochasticdifference
equationfor
jcett":
with aprocess
measurement
z 5Rmdetermined
by:
zk=Hxk+vk.
(3.2)
The
variableswk
andvk
are random variablesrepresenting
the
process andmeasurement
uncertainty,
respectively.These
arerandom variablesthat
are assumedto
be
independent,
white,
andhave
normalprobability distributions
governedby:
p(w)~N(0,Q).
(3.3)
p(v)~N(0,R).
(3.4)
Two
other variablespertaining
to
the
algorithm are the n x n estimation errorcovariance matrix
Pk
andthe
n x mKalman
gain matrixKk
.The
derivations
ofthese
parameters
have been
omittedfor
brevity,
but
they
representthe
covariance ofthe
error
between
the estimationof the current state and the actual currentstate,
andthe
blending
factor
between
the
measurement andthe
previouspredictionthat
minimizesthe
covariance ofthe
errorin
the currentmeasurement,
respectively.These
matricesgovern the
balance between
how
much a measurementis
trusted vs.how
much aprediction
is
trustedduring
theKalman filter
updates.The discrete Kalman
filter
algorithm consists of anongoing
processbetween
two
phases:
the
time
update,
or predictionstage,
and the measurementupdate,
orused
to
predict whatthe
currentstatevariables willbe. This
is
theapriori estimate,
andis denoted
with a superscript minussign,
e.g.xk
.Note
the
hat
symbol(A),
whichdenotes
thisis
a predicted value.In
thetime
updatestep,
the estimation errorcovariance
Pk
is
also updated.The time
update equations are:x'k
=Axk_x+Buk_x
(3.5)
Pk-=AP^AT+Q
(3.6)
These
equationspredictfrom
the
previousiteration's time step
k-1
to the
currenttime
step k. After
the time
update equationshave
been
processed,
these
apriori
estimatevalues are subjected
to
the
measurement updateequations,
whichincorporate
thecurrent measurement value
to
determine
the
new estimated state.The
measurementupdate equations are:
Kk=PkrHT(HPk-Hr+R)-1
(3.7)
xk=x-k+Kk(zk-Hx-k)
(3.8)
Pk=(I-KkH)Pk~
(3.9)
The
measurement equationsincorporate
the
observationdata zk
to
producethe
acovariance matrix
R
andthe
apriori
estimatefor
the
estimation errorcovariancematrixPk
in
(3.7).
This
new gain representshow
muchthe
new observationis
trusted
in
comparison
to the
estimated value.Measurement
update equation(3.8)
computesthe
aposteriori
estimate ofthe
state variables and usesthis
new gain matrixto
integrate
aportion of
the
new measurement with the previous estimation value.Finally,
the
aposteriori
estimate ofthe
estimation error covariancematrixis
calculatedfrom
equation(3.9).
The
time
update andmeasurementupdate steps continueto
alternate,
producing
newpredictions and
continually
correcting its
predictions with measurementinformation.
Because
ofthis
predictiveproperty
that
theKalman
filter
has,
it is
afeasible
choicefor
the
prefetching
algorithmdiscussed
later in
this work.To
usethe
Kalman
filter
algorithm
for
predicting,
afterthemeasurement update equationshave been
appliedto
integrate
a newdata
pointinto
the estimatedmodel,
the
time
update equations canbe
applied
to the
state variablesto
attain a new apriori
estimatefor
the
next state.This
estimationcanthen
be
used asa predictionfor
the
nextprefetchaddress.The
system covariance matrixQ
andthe
observation covariance matrixR
aretreated
as static matrices
that
do
not change overtime in
this
description,
but in fact
these
matrices are notalways stable
in
reallife
processes.For
the
sake ofsimulation ofthis
covariance matrix
Pk
andthe
Kalman
gain matrixKk
will stabilizequickly
and remain constantunder these conditions[7].
Since
the
Kalman
filter is
avery
generalalgorithm,
it
was chosenfor
use as thedigital
signal-processing
elementin
this
work.A
simple modificationto the
Kalman
filter's
internal
parameters canmakethe
filter
predictin different
manners,
so an algorithmthat
usesthese
filters
as abasic logic
processing block
canpotentially have
alarge
number of opportunities
to
usethe
samefilter
implementation
andmechanism,
andbe
able
to
producemany different
types
of resultsby
simply
changing
theinternal
parameters ofthe
filters.
Another
highlight
ofthe
Kalman
filter is
thatit does
notneedto
storelarge
amounts ofdata,
sinceit is
a recursive algorithm.This
eliminatesthe
need
for
storing
large
data tables,
which arepresentin
many existing
correlationbased
prefetch algorithms today.
In
chapter4,
an algorithmis
presentedthat
treats
theKalman filter
as ablack
box
predictionmechanism,
and usesits
predictive capabilitiesChapter 4
Algorithm Design
The
algorithmpresentedin
this
work appliesthe
principles ofdigital
signalprocessing,
in
particulardigital
signalprocessing using
the
Kalman
filter,
to
analyze miss referencestreams and produce predictions
that
are used as prefetch addresses.The
algorithmencapsulates a given number of
filters,
each of whichhaving
its
owninternal Kalman
filter
parameters,
such asits
state model and covariance.Before
discussing
the
generalalgorithm
design,
theinternal
Kalman
filter design
chosenfor
this
workis
covered.The Kalman
filter
was chosen asthe
internal filter for
this
algorithmbecause
ofits
flexibility
andapplicability
in
other areas ofengineering
and science[7]. The Kalman
filter
algorithm can alsobe implemented in
apipelinedfashion
[26],
so onemustonly
wait
for
the
initial
latency
for
the
first
calculation,
andthen
a new calculation canbe
produced eachCPU
cycle afterward.In
fact,
this
prefetching
algorithm could use another type offilter
orpredictor asits internal
predictiondevice.
The
filter is
treated
as a
black
box
withrespectto the
general algorithmdesign,
and couldbe
changedto
adifferent
filter,
or could evenbe
ahybrid
ofKalman filters
and othertypes
offilters
working
in
parallel.For simplicity
ofthiswork,
all ofthe
filters
usedin
the
algorithm willbe Kalman
filters
that
use a simple particlemoving
with constantvelocity in
a singledimension
model.Note
that
this
modelis
notfinely
tunedfor
optimalperformance;
it simply
servesas afilter
that
is
ableto
predict constant stride patterns.In
orderto
comeup
withthe
exact
numbers,
intuition
and experimentation with various numbers was used.It
wasfound
that the
numbers selected produced afilter
that was ableto
lock
ontothe
desired
access patternquickly
andefficiently,
and alsohad
the
property
ofcorrecting
itself
quickly
in
the
event of adisruption in
the
pattern.The
specificKalman
filter
model used
is:
System Matrices
Observation Matrix
A
=1
1
0
1
H=[l
0]
System
Covariance
Matrix
Q
=l
o
0
1
Observation
Covariance
Matrix R
=[.001
J
This
model's system matrix resembles a particlewith constant velocity.The
two
stateparameters are the position and
the
velocity
ofthe
particle,
and with eachtime step it
is
predictedthat the
position state variable will changeby
an amount givenby
the
velocity
state variable.The
observable parameteris
the
position ofthe
particle,
whichin
thisalgorithmis
analogousto the
memory
address ofthe
data
point.The
velocity
ofThis filter
modelis designed
to
allowthe
internal filters
to
be
ableto
recognize andpredict stride patterns
in
the
cache accesses.Other
filter
parameters couldbe
substituted
for
these
in
orderto
makethe
filters be
ableto
recognize more complexmiss reference
patterns,
but
asmentioned
earlier,
the
algorithmdiscussed in
this
workwill use these
filter
parameters andfocus
onpredicting
the
stride-like patternsthat
appear
in
the
miss reference streams.A
miss reference characterization of variousbenchmark
programs showsthat
stride patterns and next-block patterns(the
nextcache miss occurs
in
the
next cacheblock,
or unitstride)
accountfor
a considerable portion ofthe
cache miss patternsin many benchmark
programs[27]. In
addition,
the
success of stride prefetchers
in
general[28]
contributesto the
decision
to
usethis
model
for
theinternal
Kalman
filter for
this
algorithm.The
algorithmlayout
consists ofthree
majorlogical
blocks.
The first is
the
controllogic
block,
whichis
responsiblefor watching
the
cache access patterns andde
multiplexing
these cache accessesinto data
pointinputs for
the
variousfilters. The
second
logical
block
is
the
filter
array,
which receivesinput data
pointsfrom
the
control
logic
block. The
third
logical
block is
the
prefetchlogic
block,
whichwatchesthe
predictions madeby
the
Kalman
filters,
anddetermines
whento
issue
prefetchesbased
on whether or notthe
a givenKalman
filter is accurately
predicting
its
virtualreference stream.
Cache
accessStream
Control
Logic
wFilter
1
Prefetch
Logic
Prefetches
toinsert into
cache miss stream W ^Filter
2
wFilter
3
w wFilter
n ^W i i.Filter
predictionfeedback
Figure 4.1:
General
algorithmlayout using
ninternal
filters
As
shownin figure
4.1,
there
are a variable number offilters in
the
filter
array,
whichintroduces
the
first
general algorithmparameter,
the number ofKalman filters
(NKF)
to
use.Experimentation
showsthat the
marginal return of each additionalfilter
usedin
the
algorithmdiminishes in
terms
of miss reference prevention(see
chapter
5).
All
ofthese
Kalman
filters
that
areinternal
to thealgorithm arein
one oftwo states at
any
giventime,
available or waiting.All
oftheinternal Kalman
filters
begin in
the availablepool,
and asthey
begin
to
listen
to
missreferencestreams,
they
enter
the
waiting
stage wherethey
will waitfor
miss referenceaddressesthat
are giventhem
using
their
Kalman
filter
algorithms,
andproducea prediction pointthat
may
ormay
notbe
used as aprefetch,
based
onthe
successofpastpredictions.The
algorithm analyzesthe
miss reference stream generatedby
the
missesfrom
the
data level-one
cache.The
algorithm thenin
turndetermines
which,
if
any,
internal
Kalman
filter
shouldreceivethis
input
point.The
algorithmmustkeep
track
ofthe
last
data
point thateachinternal
Kalman
filter has
processed,
along
withthe
predictionit
made
for
the nextdata
point.By dividing
the
aggregate miss reference streaminto
theseseparate streams
intended for
eachinternal Kalman