The Effects of the architectural design, replacement algorithm, and size parameters of cache memory in uniprocessor computer systems

(1)

Rochester Institute of Technology

RIT Scholar Works

Theses

Thesis/Dissertation Collections

11-1-1998

The Effects of the architectural design, replacement

algorithm, and size parameters of cache memory in

uniprocessor computer systems

Eric Berzofsky

Follow this and additional works at:

http://scholarworks.rit.edu/theses

This Thesis is brought to you for free and open access by the Thesis/Dissertation Collections at RIT Scholar Works. It has been accepted for inclusion

in Theses by an authorized administrator of RIT Scholar Works. For more information, please contact

[email protected].

Recommended Citation

(2)

THE EFFECTS OF THE ARCHITECTURAL

DESIGN, REPLACEMENT ALGORITHM, AND SIZE

PARAMETERS OF CACHE MEMORY IN

UNIPROCESSOR COMPUTER SYSTEMS

by

Eric Benjamin Berzofsky

A Thesis Submitted

In

Partial Fulfillment of the

Requirements for the Degree of

MASTER OF SCIENCE

In

Computer Engineering

Approved by:

Principle

Advisor:

_

Roy S. Czernikowski, Professor and Department Head

Committee Member:

-Muhammad E. Shaaban, Assistant Professor

Date: __

tv_o_V_'_~-+1

99

X'

)

Date:

_11_-_6_"_'3_~

Committee

Member:.

_

Kenneth W. Hsu, Professor

Department of Computer Engineering

College of Engineering

Rochester Institute of Technology

Rochester, New York

November 1998

(3)

RELEASE PERMISSION FORM

Rochester Institute of Technology

The Effects of the Architectural Design, Replacement Algorithm, and Size

Parameters of Cache Memory in Uniprocessor Computer Systems

I, Eric Benjamin Berzofsky, hereby grant permission to any individual or organization to

reproduce this thesis in whole or in part for non-commercial and non-profit purposes only.

Eric Benjamin Berzofsky

11-

0 (,;,

q&

(4)

ABSTRACT

To investigate

the effects of cache

coherency

on _{multiprocessors,}

it

is

helpful

to

first

explore

coherency

issues

within _{uniprocessors,}

working

with a small part of a

big

problem

instead

of

attacking

the

big

problem

from

the start.

This

thesis will

investigate

the

design

and

implementation

of three

different

cache

designs,

varying

the

mapping

strategy, replacement algorithm, and size parameters to

determine

the effects each

have

onthecache missratio, coherency, andaverage

memory

accesstime.

VHDL

is

used to create software models of each cache

design

investigated,

so that

parameter values can

be easily

changed, and so that no

money

ortime

is

wasted

by

first

prototyping

the cache

design in

actual

hardware. These VHDL implementations

are

presented,

along

with several test-bench programs thatwere used tonot

only

validate the

performance of the

VHDL

implementation,

but

also to explore the program-dependent

performance

factors

and coherency.

Several

snoopy

cache coherence protocols are

presented at the end of the

thesis, in

order to suggest

future

research

into

the

VHDL

(5)

TABLE OF

CONTENTS

LIST OF FIGURES V

LIST OF TABLES VI

LIST OF EQUATIONS VIII

GLOSSARY IX

1 THEMEMORY HIERARCHY 1

l.l

Inclusion,

Coherence,

andLocality 2

1.1.1 Inclusion 3

1.1.2Coherence 3

1.1.3

Locality

3

2 INTRODUCTION TOCACHE MEMORY 5

2.1 TypesofCache Memories 5

2.1.1 Direct Mapped 6

2.1.2

Fully

Associative 6

2.1.3 Set Associative 6

2.1.4 The Effect ofAssociativityontheCache Performance 6

2.2FindingaBlock WithintheCache 7

2.3 Writingto theCache 7

2.3.1 Write-Through 7

2.3.2 Write-Back 8

2.3.3 Write-Once 8

2.3.4

Comparing

theStrategies 9

2.3.5

Writing

to theCacheon aCache Write Miss 9

2.3.5.1 Write-Allocate 9

2.3.5.2No- Write-Allocate 10

2.4 SourcesofCache Misses 10

2.4.1

Compulsory

Cache-Misses 10

2.4.2

Capacity

Cache-Misses 10

2.4.3 Conflict Cache Misses 10

2.4.4 The Overall Effect ofThe Three C'sonthe_{Design ofCache}

Memory

11

2.4.5

Reducing

the_{Effect ofThe Three C's} 12

2.5 ReplacingaBlockon aCache-Miss 14

2.5.1 RandomReplacement 75

2.5.2 Least

Recently

Used

(LRU)

15

2.5.3First In First Out

(FIFO)

75

2.5.4 Pseudo-Random Replacement 75

2.5.5

Comparing

theReplacementSchemes 16

3 IMPLEMENTING UNIPROCESSOR CACHEMEMORIESIN VHDL 17

3.1 WHATISVHDL? 17

3.2 Why Use VHDLtoImplementCache Memory? 18

3.3DescriptionofSpecialFunctions UsedThroughouttheVHDL Implementations 18

3.3.1 Log_Base_2 18

(6)

3.3.3 IvJoJnteger 19

3.4VHDL Implementationof aMemoryUnit 20

3.4. 1 Description oftheConstantsandTypes Used Throughoutthe _{VHDL Implementation of}a

Memory

Unit 21

3.4.2 Description ofSignals Used Withinthe

Memory

UnitImplementation 22

3.4.3InitializationProcess 24

3.4.4

Memory

Process 24

3.4.5

Modify

Process 25

3.4.6

Modify

Process 25

3.4.7

Modify

Process 25

3.5 VHDL Implementationof theFIFO Replacement Algorithm 26

3.6 VHDL Implementationof theLRU Replacement Algorithm 26

3.7 VHDL Implementationof theCache Architectures 27

3. 7.1 Description oftheConstantsandTypes Used Throughoutthe_{VHDL Implementation of}aCache27

3.7.2 Description ofthe _{VHDL Implementation of}aWrite-ThroughCache

Using

No-Write-Allocate

on aWrite-Miss 32

3.7.2.1 DescriptionoftheSignalsUsed WithintheWrite-ThroughCacheImplementations 32

3.7.2.2 Read_or_Write Process 34

3.7.2.3 Read_Write_Miss Process 35

3.7.2.4Direct_Mapped_Add,Fully_Associative_Add,andSet_Associative_Add Processes 36

3.7.2.5

Dump

Process 37

3. 7.3 Description ofthe _{VHDL Implementation of}aWrite-Back Cache

Using

Write-Allocateon a

Write-Miss 37

3.7.3.1 DescriptionoftheSignals Used WithintheWrite-BackCacheImplementations 38

3.7.3.2 ReadorWrite Process 39

3.7.3.4_{Direct_Mapped_Add,}Fully_Associative_Add,andSet_Associative_Add Processes 40

3.7.3.5

Dump

Process 42

3.7.4Description oftheVHDL Implementation ofaWrite-OnceCache 42 3.7.4.1DescriptionofSignals Used WithintheImplementationof aWrite-OnceCache 43

3.7.4.2 ReadorWriteProcess 44

3.7.4.4Direct_Mapped_Add,Fully_Associative_Add,andSetAssociativeAddProcesses 46

3.7.4.5

Dump

Process 47

4 CACHE AND PROCESSORPERFORMANCE PARAMETERS 49

4.1 CPUTIME 49

4.2 Average Memory Access Time 51

4.3Keeping Trackof theMemory Accessesin theVHDL Implementations 5 1

4.4Keeping Trackof theCache Performance 5 1

5 TESTING THE VHDL CACHE IMPLEMENTATIONS 53

5.1 The Read Test 54

5.2The Write Test 54

5.2.1 Write-BackCache Implementations 55

5.2.2 Write-OnceCache Implementations 56

5.2.3 Write-Through Cache Implementations 56

5.3 TheRead-Write-_Write-Read

(RWWR)

Test 56

5.3.7 Write-BackCache Implementations 57

5.3.2 Write-Once CacheImplementations 59

5.3.3 Write-ThroughCache Implementations 60

5.4TheSumTest 61

5.4. 1 Write-Back DirectCache ImplementationandFully-Associative CacheImplementationwith

(7)

5. 4. 2 Write-Back Fully-Associative Cache ImplementationwithLRU Replacement Algorithm 64

5.4.3 Write-BackSet-Associative Cache Implementation 65

5.4.4 Write-Once DirectCacheImplementationandFully-Associative Cache Implementationwith

FIFO Replacement Algorithm 66

5.4.5Write-Once Fully-Associative Cache ImplementationwithLRUReplacement Algorithm 67

5.4.6Write-OnceSet-Associative CacheImplementation 68

5.4.7 Write-Through DirectCacheImplementationandFully-Associative CacheImplementationwith

FIFOReplacement Algorithm ~0

5. 4. 8 Write-Through Fully-AssociativeCacheImplementationwithLRU ReplacementAlgorithm 71

5.4.9 Write-Through Set-Associative Cache Implementation 72

5.5 Summaryof theCoherency BetweentheCacheandMain MemoryforAllof the

Test-Benches 74

6 IMPROVING CACHE MEMORY PERFORMANCE 75

6.1 ReducingtheMissRatebyIncreasingtheAssociativityof theCache 76

6.2 ReducingtheMissRatebyUsing VictimCaches 82

6.3 ReducingtheMiss RatebyUsing Hardware PrefetchingofData 82

6.4 ReducingtheMiss RatebyUsing Compiler Optimizations 83

6.5 ReducingtheMiss PenaltybyGivingPrioritytoReadMisses Over Writes 85

6.6 ReducingtheMissPenalty By Using Sub-block Placement 86

6.7 ReducingtheMiss PenaltybyUsing Early RestartandCritical Word First Methods...88

6.8 ReducingtheMiss PenaltybyUsing Second Level Caches 88

6.9 ReducingtheHit Time ByUsingaSmallandSimple Cache Design 92

6.10Cache Optimization Summary 92

7 INTRODUCTIONTO SHARED-MEMORY MULTIPROCESSORS 95

7.1 The Cache Coherence Problem 95

7.2 The General CategoriesofSolutionsto theCache Coherence Problem 96

7.2.1

Disallowing

PrivateCaches 97

7.2.2

Allowing

Private Caches 97

7.2.3 Non-Cacheable Shared Writeable Data 97

7.2.4

Allowing

Shared Writeable Data 98

7.2.5 Bus-Oriented Multiprocessors 98

7.2.6 Examples ofCache Coherence Solutions in

Existing

Multiprocessors 99

8 SNOOPY CACHE COHERENCE PROTOCOLS FOR MULTIPROCESSORS 100

8.1 Write Through Protocol 100

8.2 Write Back Protocol 100

8.3 Comparing Write BacktoWrite Through 101

8.4 Write Once Protocol 101

8.5 Comparing Write OncetoWrite BackandWrite Through 103

8.6 Papa's Protocol 104

8.7 Read Broadcast Protocol 105

8.8 Read Write Broadcast Protocol 108

8.9 BerkeleyOwnershipProtocol Ill

8. 10 Comparisonof theBerkeley Ownership Protocolto theWrite OnceProtocol 112

8.11 Firefly Protocol 113

8.12 AdvantagesandDisadvantagesoftheFirefly Protocol 114

9 FUTURE WORK 116

9.1 Implementing MultiprocessorCache Coherence SolutionsinVHDL 116

9.2 UsingtheExistingVHDL Codeas aTeachingAide 118

(8)

(9)

LIST

OF

FIGURES

Figure1. DesignofaFive-Level Memory Hierarchy [HwangI

993,

p.

189]

1 Figure 2. The InclusionPropertyandData Transfers BetweenAdjacent LevelsofaMemory

Hierarchy.[Hwang

1993,

page

191]

4

Figure 3. NumberofMemory Accessesvs.DegreeofAssociativityfor theWrite-Back Cache78 Figure4. NumberofMemory Accessesvs.DegreeofAssociativityfor theWrite-Once Cache78 Figure5. NumberofMemoryAccessesvs.DegreeofAssociativityfor theWrite-Through

Cache 79

Figure 6. Miss Ratesvs.DegreeofAssociativityfor all of theCacheImplementations 80

Figure 7. Average Memory Access Time Versus DegreeofAssociativity. 81

Figure 8. Placementof theVictim Cachein theMemory Hierarchy[Hennessy

1996,

p.

398]

82 Figure 9. The Sub-block Placement Strategy

[Hennessy1996,

p.

413]

87

Figure 10. Write Once Protocol[Hwang1993p.

354]

103

Figure1 1.Papa's Protocol 105

Figure12. RB (Read

Broadcast)

Protocol

[Rudolph1984]

108

Figure 13. RWB (ReadandWrite

Broadcast)

Protocol

[Rudolph1984]

110

(10)

LIST OF TABLES

Table 1.Memory CharacteristicsofaTypical Mainframe

[Hwang1993,

p.

190]

2 Table 2. The IncreaseofAccessTimeandDecreaseinBandwidthasOneMovesAwayfrom the

CPU

[HENNESSY1996,

_P.41] 5

Table3.The Total Miss RateforEach CacheSizeandPercentageofEach Accordingto the

ThreeC's

[Hennessy1996,

P.

391]

11

Table4. Design Target Miss Ratiosfor aUnified Cache[Smith

1987]

13 Table 5. ActualMiss Rate Versus Block SizeforFive Different-Sized Caches [Hennessy

1996,

p.

394]

13

Table6. Miss Rates Comparing LRUtoRandom ReplacementforSeveral Cache Sizesand

Associativities

[Hennessy1996,

p.

379]

16

Table 7. Descriptionof thePerformanceof the lv_to_integerFunction 20 Table 8. Descriptionof theConstantsandTypes Usedin theVHDL Implementationof a

Memory Unit 21

Table9. Descriptionof theSignalsused in theVHDL Implementationof aMemoryUnit 23 Table 10. Descriptionof theConstantsandTypes Usedin theVHDL Implementationof a

Cache 29

Table 11. Data Analysisof theRead Test 54

Table 12. Data Analysisof theWrite Test 55

Table 13. Data Analysisof theRWWR Test 57

Table 14. Decsriptionof theCacheandMain Memory Performancefor theRWWR Teston the

Write-BackCache Implementations 58

Table15. Descriptionof theCacheandMain Memory Performancefor theRWWR Teston the

Write-OnceCache Implementations 59

Table1 6. Descriptionof theCacheandMain Memory Performancefor theRWWR Teston the

Write-ThroughCache Implementations 61

Table 17.Data Analysisof theSum Test 62

Table 18. Descriptionof theCacheandMain Memory Performancefor theSum Teston the

Write-BackDirect-MappedandFully-Associative MappedwithFIFO Replacement

AlgorithmCache Implementation 63

Write-BackFully-Associative MappedwithLRU Replacement Algorithm Cache

Implementation 64

Write-Back Set-AssociativeMapped Cache Implementation 65

Table2 1. Descriptionof theCacheandMainMemory Performancefor theSum Teston the

Write-OnceDirect-Mapped Cache ImplementationandFully-Associative Mappedwith

FIFO Replacement Algorithm Cache Implementation 66

Write-OnceFully-_{Associative Mapped}_with_LRU_{Replacement Algorithm}_Cache

Implementation 68

Table 23. Descriptionof theCacheandMainMemoryPerformancefortheSum Testonthe

Write-Once Set-Associative Mapped CacheImplementation 69

Table 24. Descriptionof theCacheandMain MemoryPerformancefor theSum Teston the

Write-ThroughDirect-Mapped Cache ImplementationandFully-_Associative_Mapped_with

FIFO Replacement AlgorithmCacheImplementation 70

Table25.Descriptionof theCacheandMain MemoryPerformancefor theSum Teston the

Write-ThroughFully-AssociativeMappedwithLRU ReplacementAlgorithm Cache

(11)

Table 26. Descriptionof theCacheandMain MemoryPerformancefor theSum Teston the

Write-ThroughSet-AssociativeMapped CacheImplementation 73

Table 27. The Coherency StatusofEach Test-BenchonEach CacheArchitecture 74 Table28.AverageMemoryAccess TimeforDifferent CacheSizesandAssociativities

[HENNESSY1996,

P.

397]

76

Table29. CacheandMain MemoryPerformanceoftheSumTestforDifferent Degreesof

Associativity 77

Table30.Miss Ratesfor theVarying DegreesofAssociativity 79

Table 31. Average Memory Access TimeforVarying DegreesofAssociativity. 81 Table 32. Typical ValuesandParametersofaSecond-Level Cache[Hennessy

1996,

p.

471]

91 Table 33. SummaryofCacheOptimizationsandImpacton theThree AspectsofCache

(12)

LIST

OF

EQUATIONS

Equation1. Thenumber of bitsneededfor addressing mainmemory 22

Equation 2. AddressBitsNeededin aDirect-Mapped Cache 30

Equation 3. Address Bits NeededintheSet-AssociativeMappedCache 31 Equation 4.ConvertingtheAddress Portionof aCacheBlockin aDirect-Mapped Cacheto

Compareto theRequested Address 34

Equation 5.ConvertingtheAddress Portionof aBlockin aSet-AssociativeCachetoCompare

to theRequested Address 35

Equation 6. CPU Time[Hennessy

1996,

p.

385]

49

Equation 7. Memory Stall CyclesperRead-MissandWrite-Miss Separately [Hennessy

1996,

p.

386]

49

Equation 8. Memory Stall CyclesperMemory Access [Hennessy

1996,

p.

386]

50 Equation 9. CPU TimeinTermsofMemory AccessesperInstruction [Hennessy

1996,

p.

386]

50 Equation 10. Miss RateinTermsofMissesperInstruction

[Hennessy1996,

p.

386]

50 Equation 1 1. CPU Time UsingtheMiss RateinTermsofMissesperInstruction[Hennessy 1

996,

P.

386]

50

Equation 12. Average Memory Access Time [Hennessy

1996,

p.

384]

51 Equation 13. AverageMemory Access Time Broken DownintoMemory Accesses Dueto

InstructionandMemory Accesses DuetoData [Hennessy

1996,

p.

385]

51 Equation 14. Average Memory Access TimeforTwo-LevelCache [Hennessy

1996,

p.

417]

89 Equation 15. Miss Penaltyof theFirst LevelofCache[Hennessy

1996,

p.

417]

89 Equation 16. Expanded Average Memory Access Time FormulaforTwo-Level Cache

(13)

GLOSSARY

2to1cache rule ofthumb

Amain ruleofcache_memorywhich statesthatadirect-mappedcache of sizeNhasaboutthesame miss

rateas a

2-way

set-associative cache of sizeN/2 12

accesstime

Thetotal timeittakes toaccesstheCPUfromtheith levelofthe_memory

hierarchy

1

addressincache

Aninputsignal

indicating

theaddress referenced

by

theprocessor 33 addressinmem

Aninputsignal controlled

by

theprocessorthatdenoteswhichaddresshas beenreferenced

by

the

processor 22

addressneededincache

Anoutput signalthatissenttomain_memorywhich containstheaddressthatwasreferenced

by

the

processor,butwhichdoesnot existinthecache 33

addresstoevict

Anoutputsignal senttomain_memorywhich containstheaddress of a

dirty

blockthatwas evictedfrom

thecache 38

addresstoevictfromcache

An inputsignalthatreceivestheaddressthatwasevictedfromthecache 22

averagememoryaccesstime

Aperformance parameterthatisa measure ofthehit time,missrate,and miss_penaltyassociated with

thecache 51

B

bandwidth

Therate at whichinformation istransferredbetweentwolevelsofthe_memory

hierarchy

1

Berkeley

Ownership

cache coherence protocol

Acache coherence protocol which uses ownership-based multiprocessor cache_consistencyprotocol. 1 1 1

BLOCKS_PER_

SET

A VHDLconstant_storingthenumber ofblocksper_set,orthedegreeof_{associativity} 28 byte

Aunitof measurethatmeasures8bits 2

cacheblock

Theunit oftransferbetweenthecacheand main_memory 3

cache coherence problem

Aproblemthatoccursina_{shared-memory}multiprocessor whentwoor more processors eachhave a

local copyof adatavalue,and one oftheprocessors changesthedatavalue withinits localcache,as

partof aninstructionthatisexecutedin itsprogram 95

cacheflush

Invalidating

thecontents ofthecache._{This usually}occursafter a critical sectionhas been left

during

the

executionof program on a processor 98

cacheinterrogatesignal

Asignal usedtoobtainexclusive_ownershipof a sharedblocksothatmodificationsofthisblockcanbe

performed without

jeopardizing

the_coherencyofthisblockwith respecttoother processorsthat_may

shareitsvalue 98

(14)

Asmall,

fast,

memoryunitthat_{is usually}placedbetweentheCPUand main_memoryand which stores themost_recentlyuseddatavalues oftheprogram_executingontheCPU 5

CACHE_ BLOCK_DATA_

BITS

A VHDLconstant_storingthenumber ofbitsusedfor data 28

CACHE_BLOCK_STATUS_BITS

AVHDLconstant_storingthenumberofstatusbitstouse withthecache architecture 28 cachedump

by

theprocessorthatrequeststhat thecontents ofthe_cache,andtheFIFO queue orLRU_arraywhere_applicable,bewritten,or"dumped,"toafile 33 CACHE_SIZE

A VHDLconstant_storingthecache sizeinunits ofkilobytes 28

cache-miss

Whena referencedblock isnot presentinthecache 10

capacitycache-miss

Atypeof cache missthatoccurs whenthecacheisnot capable ofstoringall ofthedataneeded within a cacheblock

during

theexecutionoftheprogram ontheCPU 10 centralized globaltable

Atable thatisusedtostorethestatus of eachblock inthecachethatisshared

by

morethanone cache.98 coherence_property

A_{fundamental property}ofthe_memory

hierarchy

whichstatesthatcopies ofthesameinformationitem athigher levelsofthe_memory

hierarchy

mustbeconsistent withthoseofthelower levels 3 coherent

Ascenarioinwhich_everyread

by

_anyprocessoralways returnsthevalueproduced

by

thelastprevious

write,no matter which processor performedthewrite 95

cold start miss See compulsorycache miss.

collisionmisses Seeconflict cache miss

compulsorycache-miss

Atypeof cache missthatoccurs whenthevery firstaccesstoa neededblockresultsina miss 10 conflictcache-miss

Atypeof cache missthatoccurs when ablock is discarded inordertomake roomforanotherblockthat

mapsto thesamelocationor same set.Thistypeof cache miss can_onlyoccur within a set-associative

ordirect-mappedcache 10

copy back Seewritebackcache

costperbyte

Thecost perbyteoftheith levelofthe_{memory hierarchy. This quantity is usually}estimated asthe

productofthecost andthesize ofthe_{memory level} 1

CPI

Cycles Per Instruction. Ameasure ofthenumber ofCPUcycles requiredtoexecutetheinstruction.50

CPU

Central

Processing

Unit. Thepart ofthecomputerthatperformstheexecution of a program onthe

computer.

Also,

thehighest levelofthe_memory

hierarchy

1

CPUtime

Aperformance parameterthatisa measure ofthe totalnumber of clock cyclesthattheCPUspends executing itsprogram,andthetimethattheCPU isspent_waitingfora_memoryaccesstoreturn with

the_necessarydata 49

critical wordfirst

Acacheimprovementtechniqueinwhichtherequired wordisreadintothecache

first,

senttothe

CPU,

andthen therest oftheblockisreadintothecache 88

D

data_in_cache

Aninputsignal

indicating

thedataassociated withtheaddress specified

by

theaddress_in_cachesignal.

(15)

data_in_from_mem

An inputsignalthatreceivesthedatacontentsfrommain_memoryofthe_{corresponding block}inmain memorythatisindicated

by

theaddress_in_cache signal 33 datainmem

An inputsignalcontrolled

by

theprocessorthatdenotesthedatatowriteto thecacheblock 22 data_out_from_mem

Anoutputsignalthatreturnsto thecachethe_{data corresponding}totheaddress suppliedtomemoryin

theaddressinmem signal 23

data_read_out_cache

Anoutput signalthatreturnstherequesteddatacontentsto theprocessor 33 datatoevict

Anoutput signal senttomain_memorythatcontainsthedatacorrespondingto theevictedblockwhose

addressisaddress_to_evict 43

datatoevictfromcache

Aninputsignaltomain_memorythatreceivesthe_{data corresponding}totheevicted cacheblockwhose

addressisaddress_to_evict_from_cache 22

datatowritefromcache

Anoutput signalthatissenttomain_memorywhich containsthedatatowritetoablock inmain

memorywhen an update of ablockmustbeperformed 33 degreeof_{associativity}

n-wayset associative See

designtargetmiss ratio

Acachedesignparameterthatisusedtoachieve a rough estimate oftheexpected miss ratioas a

functionofthecache size 12

directmappedcache

Atypeof cachemapping strategy inwhichthere_{is only}one placethata referencedblockcould reside. 6

DIRECT_ CACHE_ BLOCK_ ADDRESS_

BITS

AVHDLconstant_storingthenumberof addressbitstouseinadirect-mappedcache 28

DIRECT_ CACHE_

BLOCK_SIZE

A VHDLconstant_storingthesize of ablock inadirect-mappedcache 28

directcacheblock

A VHDLsub-type usedtodescribea cacheblockinadirect-mappedcache 28

directcacheunit

A VHDLtypeusedtodescribeadirect-mappedcache 29 directmappedaddprocess

A VHDLprocess usedwithinthecacheimplementationsofthethesistohandlethe_addingof ablockto

adirect-mappedcache 36

dirty

Astateinthewrite-once cacheinwhichtheblockhasbeen

locally

modified morethan_once, andhence its datavalueisinconsistentwiththatinmain_memory 9

dirty

bit

Abitthatisusedtodeterminethestatusof ablock inthecache.Ifthisbit isa_one,theblockisclean.If

itisa_zero,theblock is

dirty

8

dirty

block

Ablockina cachethathas beenwrittentoanditsmodificationhasnot yetbeenreportedtomain

memory 8

DRAM

Dynamic Random Access Memory. Adesign

technology

_using_{dynamic memory}cellsinthe designof a memoryunit.Inadynamic memorycell,thecontents must_{be occasionally}refreshed sothatits contents are notlost.

Usually

used withthedesignof a main_memoryunit 2

dump

process

(16)

earlyrestart

Acacheimprovementtechniqueinwhichtherequested wordissentto theCPUas soon asitarrivesinto

thecache sothat theCPU_maycontinue withitsexecution as soon as possible 88 ECL

EmitterCoupledLogic. Adevice

technology

usedinthedesignofCPU'sand other computerhardware2

exclusive-modifiedstate

Astate usedin Papa'sprotocolthatindicatesthatno other cachehasthisblockandthat thedata inthe block is inconsistentwiththatinmain_memorysincethedata has been

locally

modified 104 exclusive-unmodifiedstate

Astate usedinPapa'sprotocolthatindicatesthatno other cachehasthis

block,

andthat thedatainthe

blockisconsistentwiththatinmain_memory 104

fetchon write Seewrite-allocate

FIFOreplacement_strategy

FirstIn First Out. Areplacement methodinwhichthefirstblockthatwas placedintothecacheisthe

firsttobeevictedfrom it 15

Firefly

Acache coherence protocol which allows multiple cachestocontain a writeable cacheblock

simultaneously,with no-pre-arrangement requiredfora processortowritetoa sharedlocation 113

firstreference miss _{See compulsory}cache miss.

first-write

Astate usedintheRWBprotocolthatindicatesthefirstwritetoablock inthecache 109

fully

associative cache

Acache_{mapping strategy in}which a referencedblockcould reside anywhere withinthecache 6 fuIly_associative_add

A VHDLprocess used withinthecacheimplementationsofthethesistohandlethe_addingof ablockto

afully-associativemapped cache 36

FULLY_CACHE_

BLOCK_SIZE

A VHDLconstant_storingthesize of ablock inafully-associativemapped cache 28 fullycacheblock

A VHDLsub-type usedtodescribea cacheblock inafully-associativemapped cache 28

FULLY_CACHE_BLOCK_ ADDRESS_

BITS

A VHDLconstant_storingthenumber of addressbitstouseinafully-associativemapped cache 28 fullycacheunit

A VHDLtypeusedtodescribeafully-associativemapped cache 29

fusing

The

joining

oftwoloopsthataccessthesame_arraywiththesame

loops,

butperformdifferent

computations onthecommondata inordertoreducethenumber of_memoryaccesses requiredto

performthecalculations specifiedintheloops 84

GbyteorGB

Gigabyte. Aunitof measurethatmeasures

2A30,

or1073741824bytes 2

global miss rate

Thenumber of missesinthecachedivided

by

thetotalnumber of_memoryaccessesgenerated

by

the

CPU 89

/

(17)

InstituteofElectricalandElectronics Engineers._{The governing}

body

thedeterminedthestandardsfor

VHDL 17

inclusion property

A fundamental propertyofthe_memory

hierarchy

which statesthatallinformationcontainedinleveli is

also presenti level i+1 3

init

by

theprocessorthatinitializesthecontentsof main_memory 23

initialization process

A VHDLprocess used withinthethesistoinitializethecontents ofthemain_memory 24

integer_to_lv function

Afunctionusedwithinthethesisthattakes anintegerand writesin

binary

19 invalidstate

Astateinthewrite-once cacheinwhichtheblockcontainsnodata 8 K

KbytesorKB

Kilobytes. Aunit of measure_measuring

2A10,

or 1024bytes 2

localconfiguration

Aconfiguration usedintheRBprotocolthatindicatesthata variableXthatis localtoprocessing

elemnetiwillbe inthelocalstateincacheiandintheinvalidstate_{in any}other cache_containing variableX. Thisconfiguration allows cachei tohaveexclusive_ownershipof

X,

inordertobeableto

modifythecontents ofit 106

localmissrate

Thenumberof missesinthecachedivided

by

the totalnumber of_memoryaccessesto thecache 89 localstate

Astate used withtheRBprotocolthatindicatesthat thedatainthecacheblockcanbereador writtento

locally,

causingnobus activity 106

log_base_2function

A functionused withinthe thesis thatreceivesanintegeranddeterminesthenumberofbitsneededto

expressit in

binary

18

LRUreplacement algorithm

Least

Recently

Used. Areplacementmethodinwhichtheblockthathasnotbeenusedinthelongest

amountoftimeisevictedfromthecache 15

lvtointeger

A functionused withinthethesisthattakesa standardlogicvector andconvertsittoitsdecimal integer

equivalent 19

M

MbytesorMB

Megabytes. Aunitof measurethatmeasure

2A20,

or1048576bytes 2 memorysize

Thenumber ofbytesin level iofthe_memory

hierarchy

1 memorystall cycles

Aperformanceparameterthatisa measure oftheread misses per program andthewrite misses per

program,andthe_penaltyassociated with each ₄₉

memoryaccess process

A VHDLprocess usedwithinthe thesis tohandle basicqueriesfromthecache aboutdifferent locations

inmainmemory 24

memory_block

(18)

A VHDLconstant usedtostorethenumber ofbitsinthe_memoryblockthatare usedforaddressing... 21 MEMORY_BLOCK_DATA_BITS

A VHDLconstant usedtostorethenumber ofbits inthe_memoryblockthatare usedfor data 21 MEMORY_BLOCK_SIZE

AVHDLconstant usedtostorethesize of ablock inmainmemory,inunits ofbits 21 MEMORY_SIZE

A VHDLconstant usedtostorethevalueofthesize of main_memoryinunits ofblocks 21 MEMORY_TYPE

A VHDLpackageused withinthe thesiswhich storestheconstants usedtoimplementa_memoryunit.2 1 memoryunit

A VHDLtypeusedtodescribea_memoryunit 22

miss-ratio

Aratio ofthenumber of cache missesthatoccurintheprogramto thenumber of_memoryreferencesin

theprogram 6

modifyevict process

A VHDLprocess used withinthethesis tohandleupdates of_memoryblockswhen ablock has been

evictedfromthecache 25

modifywritehit process

A VHDLprocess used withinthe thesis tohandleupdatesof_{memory blocks}whenablockismodified

forthefirsttimeina write-once cache 25

modifywritemiss process

AVHDLprocess used withinthe thesis tohandletheupdate of_{memory blocks}when a write-miss occursinthewrite-once cacheforthefirstwrite ofthereferencedblock 25 ms

Millisecond. Aunitof measurethatmeasures 10A-3seconds 2

multi-level_{inclusion property}of second-level caches

Theinclusionprinciple ofthememory

hierarchy

thatrequiresthesecond-level cachetocontain all ofthe

datathatappearsinthefirst-levelcache 91

TV

no-_{write-allocate}

Acachedesignoptioninwhich,on a_write-miss,thereferencedblockismodified

directly

inmain

memorywithouttheblock

being

bought intothecachefirst 10

ns

Nanosecond. Aunit of measurethatmeasures 10A-9seconds 2

NUMBER_OF_

SETS

A VHDLconstant_storingthenumber of setsinthecache 28

n-wayset associative

Anotation usedtodenotethenumber ofblockswithin a set.Nrepresentsthenumber ofblocksineach

set ofthecache 6

O

Owned

Exclusively

state

Astate usedinthe

Berkeley Ownership

protocolthatindicatesthatthe_owningcacheholdsthe_only

cached_copyoftheblock.Updatescan occur

locally

withoutfirst

informing

theother caches 1 1 1 Owned

NonExclusively

state

Astate usedinthe

Berkeley Ownership

protocolthatindicatesthatother cacheshavea_copyofthecache blockand mustbeinformedasto_anychanges made

locally

ina cachethatcontainstheblock Ill ownership-basedmultiprocessor cache_consistencyprotocol

Acache_coherencyprotocolinwhich a processor must own ablockof_memorypriorto

being

allowedto

(19)

package

AtermusedinVHDLtodescribeafilethatcontains

frequently

usedfunctionsand parameters.Instead of

typing

thesame code_{in every}

file,

thecodeistypedonce,thefunction isgiven aname,andthe packageis instantiatedinallfilesthatwill usethefunction 18 pages

Theunit oftransferbetweenexternaldisksand main_memory 3 Papa'scache coherence protocol

Acache coherence protocol whose goalistoreducebustrafficandthusdecreasethewaittime thata

processor must waitpriorto_accessingthebus 104

PE

Processing

Element 106

pseudo-random replacementalgorithm

Areplacementalgorithm used withintheset-associative cacheimplementationsofthisthesis.A block is

randomlychosen,

however,

some guidelines arefollowed 16

Q

queue

AVHDLtypeusedtodescribethe_{LRU array}orFIFOqueue 29

queueentry

A VHDLsub-type usedtodescribean_{entry in}eitherthe_{LRU array}orFIFOqueue 28 R

random replacement algorithm

Areplacement methodinwhich ablock inthecache_{is randomly}chosentobeevicted whenthecacheis

full 15

Read Broadcast

(RB)

Acachecoherence protocol whichis basedonthewrite once_protocol,butusesthebus broadcast capabilities more_efficientlyforboth dataand event

broadcasting

105 readtest

Atest-benchprogramdesignedto_verifytheread operation ofthecacheimplementations 54 Read WriteBroadcast

(RWB)

Acache coherenceprotocol,similarto theRB_protocol,inwhich all ofthecaches readthedataonthe

busonboth busreadsandbuswrites 108

readmissfromcache

Aninputsignaltomain_memorythatreceivesthefactthata read-miss occurred withinthecache 23 readmissout

Anoutput signalthatissenttomain_memorywhichindicatesthata read-miss occurredinthecache... 33 readnotwrite

Aninputsignal

indicating

whethera read or a write occurredintheprogram executed

by

theprocessor. 33

readorwriteprocess

AVHDLprocess used withinthecacheimplementationsofthethesis toperform either a read or a write

of ablockinthecache ₃₄

readwritemissprocess

A VHDLprocess used withinthecacheimplementationsofthe thesistoreportread-misses,

write-misses,and updatestomain_memory 35

readable state

Astate usedwiththeRBprotocolthatindicatesthatthecontents ofthecacheblockarevalid and

consistentwith_memory 106

Read-For-Ownership

operation

Abusoperationusedinthe

Berkeley Ownership

protocolthatissimilartoa normal_read,exceptthat the requestingcachebecomestheexclusiveowner aftertheread_completes,and all othercaches

(20)

Read-Sharedoperation

Abusoperation usedinthe

Berkeley

Ownership

protocolthatisa conventional readthatgivesthecache

anUnOwned_copyoftheblock 112

read-write-write-read

(RWWR)

test

Atest-benchprogramdesignedto_verifytheoperationofthecacheimplementationswhen reads are followed

by

writestodifferent

locations,

which areinturnfollowed

by

writestotheblocksthatwere firstread,which are

finally

followed

by

reads oftheblocksthatwerefirstwrittento 56 requested wordfirst

critical wordfirst See

reservedstate

Astateinthewrite-oncecacheinwhichtheblockhas been

locally

modified_exactlyonce andtheresults

ofthismodificationhavebeenreportedtomain_memory 8

S segments

Theunitof storage ofinformation inthe

back-up

storagedevice 3 sequential

locality

Afundamental_propertyofthe_memory

hierarchy

whichstatesthattheexecution of a programtends to

followa certain sequential order 3

set associative cache

Acache_{mapping strategy in}whicha referencedblock is firstmappedtoa set andthencan reside

anywhere withintheboundsofthatset 6

set_associative_add

A VHDLprocess used withinthecacheimplementationsofthe thesis tohandlethe_addingof ablockto

afully-associativemapped cache 36

SET_CACHE_ BLOCK_ ADDRESS_

BITS

A VHDLconstant_storingthenumberof addressbitstouseina set-associative mapped cache 28

SET_CACHE_

BLOCK_SIZE

AVHDLconstant_storingthesize of ablock ina set-associative mapped cache 28 setcacheblock

AVHDLsub-type usedtodescribea cacheblock ina set-associative mapped cache 28 set_cache_unit

A VHDLtypeusedtodescribea set-associative mapped cache 29

shared configuration

Aconfiguration usedintheRBprotocolthatimpliesthat the_{shared, read-only}variableY is inthe

readable stateinall cachesthatcontainit.Thisallows_anycachetoreadthedatavalue associated with Yandbeensuredthat_{it is receiving}themost up-to-date value 106 shared-memorymultiprocessors

Acomputer systemthatconsists of atleasttwoindependentprocessor modules_executingeither a small

taskof alargerprogram,or_completelyindependentprograms.Alloftheprocessors make references

toinstructionsanddatathatresideina main_memorymodulethatisshared_amongtheprocessors... 95

shared-unmodified state

Astate usedin Papa'sprotoclthatindicatesthatsome other_{cache(s) may}havethisblockandthatthe

data intheblock isconsistent withthatinmain_memory ₁₀₄ snoopycache coherence protocol

Acache coherence protocolthatrequiresthat the_{responsibility}of_maintainingcachecoherenceis

distributed_amongthelocalcaches ₁₀₀

snoopycache controller

A

bus-watching

mechanism usedin_{shared-memory}multiprocessorsthatwatchesthe communication

bus forall actions_affectingshareddata ₉₈

spatial

locality

A fundamental propertyofthe_memory

hierarchy

which statesthata processtends toaccessitemswhose

(21)

SRAM

StaticRandom Access Memory. Adesign

technology

_usingstatic_memorycellsinthedesignof a memoryunit.Ina static_{memory cell,}thecontents are retained until poweriseliminated,andthe

contents storedinthe_memorycell arelost.

Usually

used withthedesignof a cache_memoryunit 2 standardlogicvector

AVHDLtermusedtodescribean_arrayofbits intheformof a

binary

number.Eachbit inthe_arraycan

be

individually

accessed 19

storeback Seewritebackcache

storethrough Seewritethroughcache

sub-block placement

Acacheimprovement_{strategy in}which a cacheblockisdividedintoseveral smaller

blocks,

called

sub-blocks,

anda validbit isassociated with each ofthesesub-blocks 87 subtype

A VHDLtermusedtodescribea user-defined variabletype thatwill

help

describeanother user-defined

type 22

sumtest

Atest-benchprogramdesignedas a sample applicationthatcouldberunontheprocessor.Thisprogram findsthecumulative sum of allthedatavaluesinmain_memory 61 Synopsys

AsynthesizerthatacceptsVHDLcode and returns a gate-level schematic_performingthefunctionsof

theVHDLcode 18

T bytesorTB

Terabytes. Aunitof measurethatmeasures

2A40,

or 109951 1627776bytes 2

temporal

locality

A_{fundamental property}ofthe_memory

hierarchy

which statesthat_recentlyreferenceditemstend to

referencedagaininthenearfuture 3

thrashing

Whentheupper-levelof_{memory is}muchsmallerthanwhatisneededfora_program,_causingtheCPU torun closeto the_{lower-level memory}speed,sincethatiswhere most ofthedatareferenced resides.

12

transferbandwidth

Therate at whichinformation istransferredbetweenlevels iandi+1ofthe_memory

hierarchy

1

type

A VHDLtermusedtodescribeuser-defined variablesthat

help

theprogrammerimplementthegoals of

his/herprogram 22

U

unit oftransfer

Thegrain sizeforadatatransferfromlevelito i+1 1

UnOwnedstate

Astate usedinthe

Berkeley Ownership

protocolthatindicatesthatseveral caches_mayhavecopies of

thisblock.Theblockcontains validdatathatis_possiblyshared_amongothercaches Ill

valid state

Astateinthewrite-once cacheinwhichthecacheblockcontainsdatawhichhas beenreadfrommain

memory,but hasnot yetbeenmodified 8

VHDL

VHSIC Hardware DescriptionLanguage. An

industry

standardlanguageusedtodescribe hardware from

theabstracttotheconcretelevel ₁₇

(22)

Very

High Speed Integrated Circuitprogram.ThepredecessortoVHDL 17 victim cache

Asmallfully-associativecache placedbetweenthemaincacheand main_memorythatcontains_only blocksthatare evictedfromthecache,inordertogivethena"secondchance"

toremaininthecache

priorto

being

reportedinmain_memory 82

W

wrappedfetch

critical wordfirst See

write around Seeno-write-allocate

writebackcache coherence protocol

Acache coherence protocolinwhichthecontents ofthecacheblockare writtentomain_{memory only} whentheblock isrequested

by

anotherCPUanditscontentshavebeenmodified 100 writebuffer

Adeviceaddedto thedesignofthecache which storesblocksthatneedtobewrittentomain_memoryso

thattheCPUcan resumetheexecution ofitsprogram 8

write once cache coherence protocol

Acache coherence protocolthatrequiresthefirstwriteto_anycache_entrytobewrittenthrough tomain

memory,usingthewritethrough protocol,and_anysubsequent writetobereportedto_memoryafter

theblock isevictedfromthe_cache,_usingthewritebackprotocol 101

Writeoperation

Berkeley

Ownership

protocolthatisa conventional writethatcauses main

memorytobeupdatedand all cached copiestobe invalidated 112

write stall

WhentheCPUis halted dueto thecache

having

to_continuouslywritetomain_memory 8

writetest

Atest-benchprogramdesignedtoverifythewrite operation ofthecacheimplementations 54

writethroughcache coherence protocol

Acache coherence protocolinwhich all cache updates are reportedtomain_memory 100

WRITE_BACK_CACHE_TYPE

AVHDLpackage used withinthe thesis tostorethe parameters used withintheimplementationsof a

write-backcache 27

writehitaddressmodify

Anoutput signalsenttomain_memorythatcontainstheaddress of ablockthatis

being

modifiedforthe firsttimeastheresult of a write-hitina write-once cache 43

writehitaddressmodifyfromcache

An inputsignaltomain_memorythatisused withthewrite-once cacheimplementations.Thissignal receivestheaddressoftheblockthathasbeenmodifiediftheblockwasfirstreadintothe_cache,and latermodified,

indicating

that thefirstwriteto thisblockoccurred astheresult of a cache write-hit. 23 writehitdatamodify

Anoutput signal senttomain_memorythatcontainsthedataassociated withtheblockaddressed

by

the

writehitaddressmodifysignal 43

writehitdatamodifyfromcache

Aninputsignalthatreceivesthe_{data corresponding}totheaddressindicated

by

writehitaddressmodifyfromcache 23

writemissaddressmodify

Anoutput signal senttomain_memorythatcontainstheaddress of ablockthatis

being

modifiedforthe

firsttimeas a result of a write-missina write-once cache ₄₃ write_miss_address_modify_from_cache

An inputsignalthatisused withthewrite-once cacheimplementations.Thissignal receivestheaddress ofablockreferenced

by

awrite,

however,

theblock doesnot yet exist withinthe_cache,

indicating

thata write-misshasoccurred withinthecache ₂₃

(23)

Anoutput signal senttomain_memorythatcontainsthedataassociated withtheblockaddressed

by

the

write_miss_address_modifysignal 44

write_miss_data_modify_from_cache

Aninputsignalthatreceivesthe_{data corresponding}to theaddressindicated

by

write_miss_address_modify_from_cache 23

writemissfromcache

Aninputsignaltomain_memorythatreceivesthefactthata write-miss occurred withinthecache 23 writemissout

Anoutput signalthatissenttomain_memorywhichindicatesthata write-miss occurredinthecache.. 33 WRITE_ONCE_CACHE_TYPE

A VHDLpackage used withinthethesistostoretheparameters used withintheimplementationsofa

write-oncecache 27

WRITE_THROUGH_CACHE_TYPE

AVHDLpackage used withinthe thesistostoretheparameters used withintheimplementationsof a

write-throughcache 27

write-allocate

Acachedesignoptioninwhich, on awrite-miss,theneededblock is bought intothecache and modified

without main_memory

being

informedofthemodification 9

write-back cache

Acachedesign inwhichinformationon a writeiswritten_onlytothecache 8 Write-For-Invalidationoperation

Berkeley Ownership

protocolthatisa quick version oftheconventional write,but doesnot reportthemodification ofthedatavaluetomain_memory 1 12 write-miss

Aneventthatoccursinthecache whentherequireddatablockneedstobemodified,but isnot resident

inthecache 9

write-once cache

Acachedesign inwhichthefirstwritetoablock inthecacheisreportedtomain_memory,butall subsequent writesto thisblockare not reportedtomain_memoryuntiltheblockisevictedfromthe

cache 8

write-throughcache

Acachedesign inwhichinformationon a writeiswrittentoboththecache andtomain_memory 7 Write-_{Without-Invalidation}_operation

Abusoperationinthe

Berkeley

Ownership

protocolthatcauses main_memorytobeupdated withthe

(24)

1

The

Memory Hierarchy

When

discussing

the interactions between a cache and main _memory, it is importantto understand the

concept ofthe _memory

hierarchy

andthe properties that

justify

the different levelsthat the

hierarchy

is broken into. Oncethe_memory

hierarchy

is explained,and itsrules are_presented,not_onlycan onebetter

understand the communications between the cache and main _memory, but those ofthe entire _memory structure.

The memory

hierarchy

basically

consists offive levels:theregistersintheCentral

Processing

Unit

(CPU),

the_{cache, the}main_memory,adiskstorage

device,

and

backup

units,such as magnetictapes. Theselevels

are shownin Figure 1.

Increasein capacity andaccess

time

Level

4\

_{Tape Units (Magnetic tapes, Optical}

disks)

/

Level 3

\

Disk Storage (Solidstate,magnetic)

/

Level 2

\

Main

Memory

(dRAMS)

/

Level 1 Cache

(sRAMS)

Level 0

Registersin CPU

Increasein cost perbit

[image:24.553.53.470.294.499.2]

Capacity

Figure 1. Designof aFive-Level

Memory

Hierarchy

[Hwangl993,

_p.189]

Thereare alsofiveparametersthat

help

characterizethe levels:the accesstime

(ti),

_memorysize

(si),

cost

perbyte

(a),

the transferbandwidth

(bi),

and the unit_oftransfer (xt).The access timeis thetotal time it takesto accesstheCPU fromtheithlevelofthe_memoryhierarchy. _{The memory}sizereferstothenumber

ofbytesin leveliofthe_memoryhierarchy. Thecostofthe_{ith level is usually} _{estimated as}the cost per

byte,

or Cis;.Thebandwidth istherate at whichinformation istransferredbetween levels iandi+1.

Finally,

theunit _oftransferrefers to the grain size foradatatransfer from level i to i+1. As a general rule, the

(25)

more expensive per

byte,

have a higher

bandwidth,

and use a smaller unit oftransfer as compared with thoseat ahigher levelofthe_memory

hierarchy

pyramid

[Hwangl993,

p. 188]. Insymbolicterms,wehave

thefollowing:

ti-i <ti accesstimeincreasesas onemoves_upthe_memory

hierarchy

pyramid

si-i < si sizeincreasesas one moves_upthe_memory

hierarchy

pyramid

en > ci cost perbit decreasesas one moves_upthe_memory

hierarchy

pyramid bi-i>bi bandwidthdecreasesas one moves_upthe_memory

hierarchy

p\Tamid

xi-i <xi unit oftransferincreasesas one moves_upthe_memory

hierarchy

pyramid

Theserelationships arebetterunderstood

by

thevaluesdepicted in Table 1.

Memory

Level Characteristics Level 0 CPU Registers Level 1 Cache Level 2 Main

Memory

Level 3 Disk Storage

Level 4 Tape

Storage Device

technology

ECL 256K-bit SRAM 4M-bit DRAM 1-Gbyte magnetic diskunit 5-Gbyte magnetic tapeunit Access_time,fc 10ns 25-40ns 60-100ns 12-20ms 2-20min

(search_time)

Capacity,

si(in

bytes)

512by.es 128Kbytes 512

Mbytes

60-228

Gbytes

512Gbytes-2Tbvtes

Cost,

ci(in

cents/KB)

18,000 72 5.6 023 0.01

Bandwidth,

bi (in

MB/s)

400-800 250-400 80-133 3 to5 0.18-0.23

Unitof_transfer, Xi

4-8bytesper word 32bytes perblock 0.5-1 Kbytesper page 5-512 Kbytes perfile

Backup

storage Allocation Management Compiler Assignment Hardware control

Operating

system

Operating

system/user

Operating

system user

Table 1.

Memory

Characteristicsof aTypical Mainframe

[Hwangl993,

p.

190]

1. 1

Inclusion,

Coherence,

and

Locality

When

discussing

issueswithinthe_memory

hierarchy,

it is importanttounderstandthethree fundamental

concepts of

inclusion,

_coherence, and locality. In the

following

sections, it is assumed that the cache

memoryisthe lowest levelMiand communicates

directly

withthe CPUanditsregisters, labeledMo. The highestlevelis labeled

M_,

andcontains all oftheinformation words storedin the_memory

hierarchy,

as

(26)

1.1.1

Inclusion

The inclusion _{property is} stated as Mi c M> c Ms ... c Mn. This property implies thatall information

itemsare_originallystoredinthehighest level Mn.

During

theexecution of aprogram, portions,or subsets ofMnare copiedinto

Mn-i,

and portions ofMn-i arefurthercopiedtolevelMn-2.

Thus,

ifa wordisfound in level

M,

itwill also be found in level

M+i, M-_,

and so on _upthechain, until the highest level Mn is

reached. Awordmiss occurs when a wordis searchedforin level

M,

but is not found. Ifa word miss occursin level

M,

italso meansthata word miss occurredinalllower levels

Mi-i, M-2,

... Mi.

Anotherconcept associated withinclusion inthe_memory

hierarchy

is themethodofinformationtransfer between two levels ofthehierarchy. The CPU and cache communicatethrough words

(typically

4 or 8 byteseach). Duetotheinclusionprinciple, the size of a cacheblockmustthen be biggerthan thesize of the memoryword, and is typically 32

bytes,

or 8 words. The cache and main _memory communicate through blocks. The main _{memory is divided} into pages

(typically

128 bytes). Pages are the units of informationtransferbetweenthediskand main memory. Atthehighest levelofthe_memory

hierarchy,

the

pages ofthe main _memory are stored as segments in the

back-up

storage _{device. These terms,} and the principleof

inclusion,

areillustrated in Figure 2.

1.1.2

Coherence

Thecoherence_propertyrequiresthatcopies ofthe sameinformation itemathigher levels ofthe_memory

hierarchy

be consistent. This impliesthatifa word ismodified in the _cache, forexample, copies ofthat word mustbeupdatedeither

immediately

_{(write-through method)}or_{eventually (write-back method)}at all higherlevelsofthe_{memory hierarchy.}

1.1.3

Locality

The memory

hierarchy

was developed based on a behavior characteristic observed within the CPU that

(27)

MO:(CPU

Registers)

Ml:

(Cache)

CPU Registers

*.

'r

b

a

/

i r

M2: (Main

Memory)

M3: (Disk

Storage)

M4:

(Magnetic

Tape Unit

for

Backup

Storage)

Access

by

word(4

bytes)

from a cacheblockof32

bytes,

such

asblocka

Access

by

block (32

bytes)

froma_memorypage of32

blocksor 1

KBytes,

suchas

blockbfrompageB.

Access

by

page(1

Kbytes)

froma_{file consisting}of_many

pages,such as pageAand

pageBfromsegmentF.

[image:27.553.32.509.66.442.2]

Segmenttransferwithdifferent numberof pages.

Figure 2. The Inclusion

Property

andData TransfersBetween Adjacent Levelsof a

Memory

Hierarchy.

[Hwangl993,

page

191]

In

designing

a certain level ofthe _memory

hierarchy,

the above dimensions of

locality

offer some suggestionsto an effective design. Thetemporal

locality

dimensionwould leadto the popular use ofthe

Least

Recently

Used

(LRU)

replacement algorithm and would

help

determine the size of_memory at successivelevelsofthe_{memory hierarchy.} Thespatial

locality

dimensionassistsin

determining

thesize of

(28)

Introduction to

Cache

Memory

Cache memory is a _small,

fast,

_memoryunit that_{is usually} placed between the CPU andthe physical

memory.It

typically

storesthemost_recentlyusedinstructionsand/ordataundertheassumptionthat these instructions/datawillbeused again shortly. Since instructions are_rarely written_to, this thesiswill solely dealwith accessestodata_residinginthecache.

Thecachememoryis fastertoaccessthan thephysical memory.Ascanbeseenin Table2.as one moves furtherandfurther_awayfromthe

CPU,

thesize ofthememorystorage unitincreasesas wellastheaccess

time.

Thus,

_memoryunits closeto the

CPU,

suchastheinternalregisters andthe_cache,requirelesstimeto

accessthanthephysical_memoryand externaldiskstoragedevices.

Level 1 2 3 4

Called Registers Cache Main

Memory

Disk Storage

Typicalsize <1KB <4MB <4GB >1 GB

Implementation

Technology

Custom memory with multiple_ports, CMOSorBiCMOS

On-chip

or off-chip CMOS SRAM

CMOS DRAM Magneticdisk

Accesstime_{(in ns)} 2-5 3-10 80-400 5000000

Bandwidth (in

MB/sec)

4000-32000 800-5000 400-2000 4-32

Managed

by

Compiler Hardware

Operating

System

Operating

Systemand User

[image:28.553.64.474.277.432.2]

Backed

by

Cache Main

Memory

Disk Tape

Table 2. The IncreaseofAccessTimeandDecrease in BandwidthasOne Moves

Away

fromthe CPU

[Hennessy

1996,

p.41]

2. 1

Types

of

Cache Memories

Withinthe cache _memory, there mustbe a _way forthe CPU to knowwhere the _necessary information

resides. This enforces a _mapping on the data/instruction blocks in the cache. There are three general formats for the _mapping of a block into the cache:

direct-mapped,

fully-associative mapped, and

(29)

2.1.1

Direct

Mapped

Inadirect-mappedcache,each_{block has only}oneplacethatitcan go.

Thus,

whentheCPUneedsa certain

data

block,

there is _only one placethat it could _possibly reside inthe cache. Ifit is not _there,then the

needed blockmustbe fetched from the higher level ofthe _memory

hierarchy,

and the block that was

previouslyinthisposition inthecache mustbeevicted. In orderto determinewherethe blockresides in

the_cache,theblock frameaddressis divided

(modulo)

by

thenumber ofblocksinthecache. Theresultof

thisintegerdivisiongivesthepositioninthecache wheretherequesteddatablockwould reside.

2.1.2

Fully

Associative

Inafully-associativecache, thereare no restrictions astowheretheblockcanbeplaced. WhentheCPU

needs a certaindata

block,

itmust check eachblockthatresides inthecache todetermine ifthe required

information ispresentinthe cache. A blockis _only evictedfroma fully-associative cache ifthe cache is full.

2.1.3

Set Associative

Ina set-associativecache,thedata block isrestrictedtoa certain set of places.Asetconsists of a_groupof

twoor moreblocks inthecache._{When placing}ablock frommain_{memory into}the_cache,theblockisfirst

mappedtoaset, andthen theblock is freeto go anywhere withinthatset.

Thus,

set-associative_mapping combinesthe featuresofboth

direct-mapping

and _{fully-associative mapping in} thatthe block is

directly

mappedtoa_set, andthenis

fully

associative withinthatset. To determinethesetthat theblockshouldbe

placed

in,

the block frame address is divided

(modulo) by

the number of setsthat are inthe cache. The

result ofthis integerdivision gives the set number

(starting

from zero and _going until one minus the

number of_sets)that therequestedblockwouldbemappedto.Acacheissaidto_{be n-way}set associativeif thereare nblocksineach set.

2.1

.4

The

Effect

of

Associativity

on

the

Cache

Performance

Most caches _today are set-associative, and

increasing

the degree of _{associativity} has the effect of

decreasing

the miss-ratio since moreblocksare allowedintothe cachebefore one needstobe evictedin

order to make more room in a set [Smith1987]. The highest miss ratios are thus observed in direct mapping,inwhichthereis_onlyoneblockthatcanbemappedtoeach set.

Two-way

_{associativity is slightly}

better

by

_allowingtwoblockswithin each set, and,asassumed,

three-way

associativity is betterthan

two-way [Smithl987].

Eventually

a pointisreached wherefurther increases inthe_{associativity has}noeffecton

(30)

Increasing

the degree of _{associativity} also has some disadvantages. First of _all, the normal parallel implementationofacache requiresthat thenumber of comparators anddatareadout paths equalthedegree

of associativity. As the degree of_{associativity is}thus

increased,

the hardware costs of

implementing

it increases and becomes _very expensive [Smithl987]. _{A study} on

increasing

the degree ofassociativity

withina cache was performedinthis thesisandis discussedon page76.

2.2

Finding

a

Block Within

the

Cache

Whenthe CPUneeds a certain_memory

block,

it shouldfirst checkthe cacheto see ifthe block resides

there.

Accessing

the neededblock fromthe cache_memorywill be fasterthan_{retrieving it from} ahigher levelofthe_{memory hierarchy.}

To determine ifthe blockresidesinthe_cache, eachblock inthe cache includes an address_tagthat gives

the block frame address. Each appropriate _{tag in} the cache is checked in order to see if the block it

corresponds to contains the information needed. These tags can be checked in parallelto speed _up the

memoryaccess. Notethatinadirect-mappedcache, only oneblockneedstobechecked, whileina

set-associativecache, allblockswithinthe set needtobechecked. Theworse caseiswith a fully-associative

cache, where all the blocks in the cache must be checked to determine if_they contain the requested

information,

however,

oncetherequestedblock is

found,

the_{remaining blocks do}not needtobechecked.

2.3

Writing

to the

Cache

WhentheCPUneedsto_modifythe value of adata

block,

the requestedblockshouldfirstbe checkedto

seeifitisinthecache.Ifit

is,

threeschemes existfor_{writing data}to thecache: _{write-through, write-back,}

andwrite-once.Theseschemes also representthe threedifferentcache architecturesthatwereimplemented

inthis thesis.

2.3.1

Write-Through

Ina write-through cache(alsoknownas store_through),the information iswrittenbothto theblock inthe

cache andto the _{corresponding}block inmain memory. In

implementing

this _strategy, boththe cache and main_memorycontaintheupdated value. This is importantwhen_working with external input and output

(31)

Oneproblem withthewrite-through cacheisthattheCPU couldhavetowaitforallofthewritestomain

memoryto_{finish before continuing}withtheexecution ofitsprogram.Whenthe CPU ishalteddueto the cache

having

towritethe updateddatavalueto main_memory, a write stallhas occurred. To reducethe effect of write_stalls, a write _{buffer is usually included in}the design of a cache. Thewritebuffer stores

blocks thatneedto bewrittentomain_memory sothat theCPUcan resume the execution ofits program

whilethewritebufferwrites itscontentsbacktomain_{memory in}parallel withthe execution ofthe CPU. Write buffers of different sizes have been implemented in various cache designs,

however,

their

implementationis beyondthescope ofthis thesis.

2.3.2

Write-Back

Ina write-backcache(alsoknownas_copy

back,

or store

back),

theinformationiswrittenonlyto theblock inthecache.

Thus,

themodified value willbewrittentomain_{memory only}whenthisblockisevictedfrom thecache,if iteveris. Whilethismodifiedblockstill residesinthe _cache,it isreferredtoas a

dirty

block,

meaning that theblockhas beenmodified while itwas inthe cache, but its contents have not yet been

writtenbacktomain memory.Thestatus of a_{block is usually determined}

by

a

dirty

bitwhichtellswhether the cache data block is clean (the contents ofthe cache arethe same as that _{in memory)} or

dirty

(the

contents ofthe cache differ fromthat in memory). Ifthe block is clean, there is no need to report its

contentstothehigher level of_memorywhen it is evicted since its contents have not been changed. The

dirty

bit featureis _especially attractive tomultiprocessors since it speeds _up the memory access

by

not

requiring everywritetogo

directly

tothemain_memory

[HennessyT996,

p. 380].

2.3.3

Write-Once

The write-once_protocol, proposed

by

John Goodman

[Goodmanl983],

combines the write-through and write-back protocolsintoone, inordertogetthebenefitsofboth. Thewrite-once protocol requiresthefirst

write to _any cache _entry to be writtenthrough to main memory, usingthe write-through protocol.

Any

subsequent writetothatcacheblockwillbe done

locally

inthe _cache, butthe modifications will_{only be} writtentomain_memoryaftertheblockisevictedfromthecache,hencetheuse ofthewrite-back protocol.

To implement the write-once protocol, two bits are associated with each cache block. The two bits distinguish_amongthefourstatesthata cacheblock_may residein:

invalid,

valid, reserved,ordirty. Inthe

invalidstate,the cacheblockcontains nodata. Inthevalid_state,thecacheblock contains datawhichhas

been read from main _memory and has not yetbeen modified.

Hence,

the cache and main _memory are

consistent with respecttothisblock. Thereserved stateindicatesthat theblock has been

locally

modified

(32)

state,

dirty,

indicatesthat thecacheblock hasbeenmodified morethanonce sinceit hasbeen brought into

the cache,andthelatestchangehasnot yetbeentransferredtomain memory.Inthis state, thecache and

mainmemoryareinconsistentwith respectto thiscacheblock.

2.3.4

Comparing

the

Strategies

Thewrite-throughprotocolhas beenthemost common approach sinceitissomewhat simplertoimplement

than thewrite-backprotocol,and also sinceitisnever_necessarytoreport_anythingto_memoryafterablock

is evicted from the cache. Another advantage ofthe write-through protocol is that the cache and main

memory are always consistent with each other, whereas in the write-back protocol, the most up-to-date

copy mayresideinthecache.This advantage_{may be}useful_{in shared-memory}_{multiprocessors,} discussed

on page95.

Despitethe above advantages ofthe write-through protocol overthe write-backprotocol, the write-back

protocolisstartingtobecomeverypopularduetoitsprevention ofthebandwidth bottleneckofthe write-through protocol [Smith1987]. Since not _every write to the cache is

immediately

broadcast to main

memory inthe write-back_protocol, as is done inthe write-through protocol, considerable cacheto main

memory bandwidth is preserved, thus _making the write-back protocol preferable _{in shared-memory}

multiprocessorsin ordertoreducethe traffic tomain _{memory from} each processor.

However,

the write

back protocol suffers from the cache coherence _problem, described later in this thesis. Hence

shared-memory multiprocessors would want a write-through cache in order to

keep

the cache and _memory

consistent.Multiprocessorcachecoherence protocols willbe discussedon page96.

2.3.5

Writing

to the

Cache

on a

Cache Write Miss

The algorithms presented above _onlywork when the _{memory block} needed is in the cache. When the

desired blockisnotinthecache,a write-misshasoccurred, andthecachehastwooptions of

dealing

with

it:writeallocate or no-write allocate.

2.3.5.1

Write-AIIocate

Ina write-allocate cache(alsoknownasfetchonwrite), theneededblockis loaded intothecachefromthe

main_memory, andthenewdatavalue iswrittento theblockwhile itis in the cache. Themodified data

value isnot reportedtomain_memoryuntiltheblockisremovedfromthecache and writtenbacktomain

(33)

2.3.5.2

No-

Write-AIIocate

Ina no-write-allocate cache(also knownas write_around),theblock ismodified

directly

inmainmemory

and notfirst brought intothe cache. This scheme _{is generally}used withthewrite-through cache scheme

sincethetwostrategies are_verysimilar.

2.4

Sources

of

Cache Misses

Ifa neededblock isnotfoundwithin a_cache, a cache-miss occurs andtheneededblockmustbeobtained

from main memory. It is importantto investigate the reasons for cache-misses _occurring so that future

cache designs can incorporatethese scenarios into the designs of more optimal cache architectures. The

threemain sourcesof cache-misses areknownasThe Three C's:compulsory,capacity,andconflict.

2.4.1

Compulsory

Cache-Misses

A compulsorycache-miss meansthat the_{very first}accessto theneededblockresultsina miss. Underthis

typeof_cache-miss, the neededblockmust be broughtinto the cache from main memory. This type of

cache-missisalsoknownas cold start miss orfirstreferencemiss.

2.4.2

Capacity

Cache-Misses

During

the executionof alargeprogram orprocess,it may be impossible forthe cacheto contain allthe

blocks needed

during

its execution.

Therefore,

a _capacitymiss occurs when the needed blockhas been

recentlyevictedfromthecachetomake roomforanotherblockand now mustbeplacedback inthecache

since it isneeded a^ain.

2.4.3

Conflict Cache

Misses

Conflict misses _only occur in set-associative or direct-mapped caches, since these are the _only cache

architecturesinwhichablockis evictedinordertomake roomforanotherblock intheset.

Thus,

ifa

set-associative ordirect-mapped placement _{strategy is} _used, conflict _misses, in additionto _compulsory and

capacitymisses,will occur sincetoomanyblockswillbemappedtoa_set,or a particular position. Conflict

(34)

2.4.4

The Overall Effect

of

The Three C's

on

the

Design

of

Cache

Memory

Todeterminethe overall effect ofthe Three C's on a cache memory, a simulation was performed on a

cache with32-byte blocksand_using anLRU replacementscheme on aDECstation 5000 computer. The

results ofthesimulation arereportedin

[Hennessy

1996,

p.

391]

and are showninTable 3.

CacheSize Degreeof

Associativity

Total

Miss-Rate

Compulsory Capacity

Conflict

1 KB

1-way

0.133 0.002 1% 0.080 60% 0.052 39%

1 KB

2-way

0.105 0.002 2% 0.080 76% 0.023 22%

1 KB

4-way

0.095 0.002 2% 0.080 84% 0.013 14%

1 KB

8-way

0.087 0.002 2% 0.080 92% 0.005 6%

2KB

1-way

0.098 0.002 2% 0.044 45% 0.052 53%

2KB

2-way

0.076 0.002 2% 0.044 58% 0.030 39%

2KB

4-way

0.064 0.002 3% 0.044 69% 0.018 28%

2KB

8-way

0.054 0.002 4% 0.044 82% 0.008 14%

4KB

1-way

0.072 0.002 3% 0.031 43% 0.039 54%

4KB

2-way

0.057 0.002 3% 0.031 55% 0.024 42%

4KB

4-way

0.049 0.002 4% 0.031 64% 0.016 32%

4KB

8-way

0.039 0.002 5% 0.031 80% 0.006 15%

8KB

1-way

0.046 0.002 4% 0.023 51% 0.021 45%

8KB

2-way

0.038 0.002 5% 0.023 61% 0.013 34%

8KB

4-way

0.035 0.002 5% 0.023 66% 0.010 28%

8KB

8-way

0.029 0.002 5% 0.023 79% 0.004 15%

16KB

1-way

0.029 0.002 7% 0.015 52% 0.012 42%

16KB

2-way

0.022 0.002 9% 0.015 68% 0.005 23%

16KB

4-way

0.020 0.002 10% 0.015 74% 0.003 17%

16KB

8-way

0.018 0.002 10% 0.015 80% 0.002 9%

32KB

1-way

0.020 0.002 10% 0.010 52% 0.008 38%

32KB

2-way

0.014 0.002 14% 0.010 74% 0.002 12%

32KB

4-way

0.013 0.002 15% 0.010 79% 0.001 6%

32KB

8-way

0.013 0.002 15% 0.010 81% 0.001 4%

64KB

1-way

0.014 0.002 14% 0.007 50% 0.005 36%

64KB

2-way

0.010 0.002 20% 0.007 70% 0.001 10%

64KB

4-way

0.009 0.002 21% 0.007 75% 0.000 3%

64KB

8-way

0.009 0.002 22% 0.007 78% 0.000