A Shared memory multiprocessor system architecture utilizing a uniform

(1)

Rochester Institute of Technology

RIT Scholar Works

Theses

Thesis/Dissertation Collections

8-1-1998

A Shared memory multiprocessor system

architecture utilizing a uniform

Frank Casilio

Follow this and additional works at:

http://scholarworks.rit.edu/theses

This Thesis is brought to you for free and open access by the Thesis/Dissertation Collections at RIT Scholar Works. It has been accepted for inclusion in Theses by an authorized administrator of RIT Scholar Works. For more information, please [email protected].

Recommended Citation

(2)

A Shared Memory Multiprocessor System Architecture

Utilizing a Uniformly Shared Level 2 Data-Only Cache

by

Frank Casilio

A Thesis Submitted

In

Partial Fulfillment of the

Requirements for the Degree of

MASTER OF SCIENCE

In

Computer Engineering

Approved by:

Committee Member:

Date:

~,

I

f?)

'9!J

l?

Roy S. Czernikowski, Professor and Department Head

Date:

Tony H. Chang, Professor

Committee

Member:---,---Department of Computer Engineering

College of Engineering

Rochester Institute of Technology

Rochester, New York

(3)

THESIS RELEASE PERMISSION FORM

Rochester Institute of Technology

College ofEngineering

A Shared Memory Multiprocessor System Architecture

Utilizing a Uniformly Shared Level 2 Data-Only Cache

I, Frank Casilio, hereby grant permission to any individual or organization to reproduce

this thesis in whole or in part for non-commercial and non-profit purposes only.

Frank Casilio

..

(4)

Abstract

Due to VLSI

lithography

problems and the limitation ofadditional architectural

enhancementsuniprocessor systems are _nearingthe end oftheir life cycle.

Therefore,

it

is believed

that Symmetric

Multiprocessing

(SMP)

systems will be the next mainstream

computer. These systems allow multiple _processors, _accessing the same _memory

image,

tocooperateon a number of computationaltasksasa single entity.

While multiprocessor systems can offer a substantial performance increase

compared to uniprocessor systems, major design considerations must be addressed to

achieve desired system _{efficiency levels.}

Managing

cache coherence

is

a significant

problem in multiprocessor systems. Current

implementations

cope with this problem

by

utilizing a cache coherence protocol. This protocol puts a large amount ofoverhead on

the systembusto ensureproper program _execution, _effectively

decreasing

overall system

performance. This thesis approaches the cache coherence problem from a new angle.

Instead of_utilizinga cache coherence_protocol, a new_memory system

is

proposed which

eliminatestheneed fora cache coherence_protocol,

by

_utilizing asharedlevel 2

data-only

cache. This new architecture allows for better utilization ofthe system and

improved

performance and scalability.

A data rate analysis

is

conducted to

demonstrate

the potential performance

increase from the proposed architecture over conventional approaches. The

data

rate

model _clearly shows an increase

in

system performance and utilization when _using the

architecture proposedinthis thesis.

(5)

To

My

Parents,

withouttheirconstantloveand

supportthismilestonein my career could not

have beenaccomplished

(6)

Acknowledgements

I would

like

to thank the

following

individuals for their support

during

the

completion of this thesis.

First,

and foremost I would like to thank _my graduate

committee _members, Dr.

Roy

S.

Cznernikowski,

Dr.

Tony Chang,

and _especially Dr.

Muhammad Shaaban forthe

help

andinsighthe offeredintothis thesis.

Secondly,

I would like to thank all of_my _professors, managers, coworkers, and

peers who have given me the privilege to

learn,

experience, and grow with them

during

(7)

Trademarks

Intel,

Pentium aretrademarksofIntel Corporation

(8)

Table

of

Contents

Abstract iii

Acknowledgements v

Trademarks vi

TableofContents vii

ListofFigures ix

ListofTables x

ListofEquations ~ xi

Glossary

xii

1 Introduction 1

1.1 VLSI Advancements 2

1.2 Architectural Advancements 4

1.2.1

Pipelining

4

1.2.2 Branch Prediction 6

1.2.3 SuperscalarDesign 7

1.2.4 Cache 7

1.3 Flynn's ClassificationofComputer Architectures 11

7.3.7 SISD 12

1.3.2 MSD 12

1.3.3 SIMD 13

1.3.4 MLMD 13

1.4 TheQuestforaMainstreamSupercomputer Architecture 14

1.4.1 SMP 75

2 Cache Coherence 19

2.1 CacheBasics 20

2.7.7 CacheOrganization 20

2.1.1.1 Direct MappedCache ₂₁

2.1.1.2FullyAssociative ₂₁

2.1.1.3 Set Associative 22

2.7.2 CacheBlock

Lookup

22

2.1.3 Replacement

Strategy

₂₃

2.1.3.1 Least_RecentlyUsed_(LRU) 23

2.1.3.2Random 24

2.1.3.3Fiist-In,First-Out_(FIFO)

'.'"'"'"".^24

2.1.4 Write

Policy

24

2.1.4.1 Write-Through ₂₄

2.1.4.2Write-Back

ZZZ^25

2.2 MultiprocessorCachecoherence ₂₅

2.2.7 Data

Sharing

26

2.2.2 ProcessMigration 26

2.3 WaystoHandleCache Coherence 27

2.3.7

Consistency

31

2.4.2 Weak

Consistency

33

3 DesignofMPArchitectures 35

3.1 Current MultiprocessorImplementation 35

3.7.7 TheChipset 36

3.1.1.1 Circuit Switched Buses 37

3.1.1.2 Split TransactionBus 38

3.7.2

Memory

Type 39

3.1.2.1 _ReadingaCache BlockFrom_Memory 40

3.7.3 The MESICacheCoherence Protocol 41

3.2 ProposedMultiprocessor Architecture 43

3.2.7 Cache Arbitration Unit_(CAU) 46

3.2.2 SharedL2*_Cache ₄₇

3.2.3 SharedL2*_bus ₄₈

3.2.4 Processor Requirements 48

4 Performance Analysis 50

4.1 Performance Analysis Methods 50

4.2 Data Rate Analysis 51

4.2.1 Data Rate Analysis for CurrentSMP

Memory

System's 52

4.2.2 _{Data Rate Analysis of}theProposedArchitecture 54

4.2.3 InvalidationOverhead. 54

4.2.4 _{Comparison ofData Rate Models} 58

5 Conclusions 63

5.1 Future Work 64

6 References 65

(10)

List

of

Figures

Figure 1-1:Grand ChallengeApplications 1

Figure 1-2:

Chip Density

forIntel Microprocessors -2

Figure 1-3: ProjectedCPU

Frequency

fornext15years 3

Figure 1-4: Fivestagepipeline -5

Figure1-5:

Memory

Hierarchy. 9

Figure 1-6: Cache EffectonaSystems Performance 11

Figure 1-7: Block Diagram ofanSMPSystemwith4 Processors 15

Figure 1-8: CacheeffectonaSMPsystem 16

Figure 2-1: Cache Organization Schemes 21

Figure 2-2: A Typical Cache Block Address 22

Figure 2-3: Data

Inconsistency

duetoData

Sharing

26

Figure 2-4: Data

Inconsistency

duetoProcess Migration 27

Figure 2-5: Initial State ofthe

Memory

System 28

Figure 2-6:_{State of}theSystemafter a Write-lnvalidationOperation 29

Figure 2-7:_{State of}the

Memory

Systemafter a Write-UpdateOperation 30

Figure 2-8:State Diagram forthe Write-Once Protocol 31

Figure 2-9: The Sequential

Consistency

Model 32

Figure 2-10: The TSO Weak

Consistency

Model 34

Figure 3-1: The Intel DualPentium_{II Processor}

Memory

System 36

Figure 3-2: A Circuit SwitchedBus 38

Figure 3-3: ASplit Transaction Bus 39

Figure 3-4: The MESI Write-Invalidate Protocolwith Write-Back 41

Figure 3-5:Modified Architecture toSupportaL2*Cache 44

Figure 3-6: Modificationto theCPU

Packaging

45

Figure 4-1: Data Rate ofaTypical Program ₅₂

Figure 4-2:_{Program Data Rate of}Current SMP

Memory

Systems 53

Figure 4-3: Program Data Rate ofProposed

Memory

System 54

Figure 4-4: Effect ofInvalidationonCacheMisseswhile

Varying

BlockSize 56

Figure 4-5: Effect ofInvalidationonCache Misseswhile

Varying

Cache Size 57

Figure 4-6: Effect ofData

Sharing

onBus Utilization ₅₈

Figure4-_{7: Performance Comparison}whenProgramsExhibitFine Grain

Sharing

61

Figure 4-8: Performance DifferencewhenPrograms Exhibitper-processor

locality

₆₂

(11)

List

of

Tables

(12)

List

of

Equations

Equation1:

Mapping

_{for Block}inaDirect Mapped Cache 21

Equation2:

Mapping

_{for Block}inaSetAssociativeCache 22

Equation3:Execution Time fortheCurrentArchitecture 59

Equation4:ExecutionTime fortheProposed Architecture 60

(13)

Glossary

Bus,

Aset ofconductors _connecting varies functionalunits in a computer. A shared _memory

bus

_specifically

denotes

a

bus

_connectingtheprocessorsto thechipset

Branch

Prediction,

Amethodto predict the destination of conditional branch instructions to reduce stalls in

the

instruction

pipeline

Cache,

A _relatively small amount of high-speed _memory that contains

frequently

used

instructions

and data. It is intendedto reducethe accesstimesto the next higherlevel of

the_memory

hierarchy

Cache

Coherence,

A problem which occurs in multiprocessor systems when multiple private caches have

differentvalues ofthesame cacheblock

Cache

Hit,

The data blockrequested

by

theprocessor existsincache

Cache

Miss,

Thedatawordrequested

by

theprocessordoes not existin cache. Theentire cacheblock

containingthedataword mustbereadfromthenexthigher level of_memory

Central

Processing

Unit

(CPU),

Responsible for processing

instructions

in the computer system and _managing cache coherenceinmultiprocessor systems

Chipset,

Responsible for controlling all major

functions

in computer systems. The chipset controls allaccessto_memoryand controlsthesystembus

Circuit Switched

Bus,

Abus arbitration scheme which givesthebus master exclusivecontrol overthe

bus

until

itsrequest

has

been filled

Consistency

Model,

Specifies the order

by

which the events

from

one process should

be

observed

by

other

processes

in

themachine

(14)

Direct Mapped

Cache,

A cache organization that allows ablock to

be

placed in a specific location only inside

thecache

Dynamic Random Access

Memory

(DRAM),

Atype ofsemiconductor_{memory in} which the information

is

stored in capacitors on a

integrated

circuit.

Typically

each bit is stored as an amount of electrical charge in a storage cell _consistingof a capacitor and atransistor.

Extended Data Out Dynamic Random Access

Memory

(EDO

DRAM),

Allows the data outputs from memory to be kept active after the control signals have gone

inactive.

This can be used in pipelined systems for_overlapping accesses where the

next cycle

is

startedbeforethedata fromthelastcycle isremovedfromthebus.

First-In,

First-out

(FIFO)

Replacement,

Acacheblock replacement _strategythatremovestheblockthathas beenthe cache forthe longestperiod oftime

Fully

Associative

Cache,

Acache organizationthatallows ablocktobe placed anywhereinsidethe cache

Least

Recently

Used

(LRU)

Replacement,

A cache block replacement _strategy that removes the block which has not been used in

thelongestperiod oftime.

Level 2

(L2) Cache,

Asecond level of cachethat existsbetweentheprocessor and main _memory

Massively

Parallel Processor

(MPP),

A computer system made from commodity processors that uses _{physically distributed}

memorytoachieve ahigh levelof parallelismthroughahigh

bandwidth

interconnect

Multiprocessor

(MP),

See Symmetric

Multiprocessing

MultipleInstruction MultipleData

(MTMD),

Eachprocessorfetches its own instructionand operates onits own data

MultipleInstruction Single Data

(MISD),

Eachprocessorfetches

its

own

instruction,

but

all processors operateonthe same

data

Pipelining,

An architectural enhancement wheremultiple

instructions

areoverlapped

in

execution

(15)

Rambus DRAM

(RDRAM),

Intendedtoreplace

SDRAM in future

computer systems. Itoffers sustainedtransferrates

ofaround

1000Mbps,

so

faster buses

can

be

implemented.

Random

Access

Memory

(RAM),

A

data

storage

device for

whichthe order ofaccessto different locations

does

not affect

the speed of access

Reduced

Instruction

Set

Computer

(RISC),

A processor whose

design is based

on the rapid execution of a sequence of simple

instructions

ratherthanontheprovision of a

large

_varietyof complexinstructions

Scalability,

The measureof

how

systemperformanceincreasesas system resources are

increased

Set

Associative

Cache,

Acache organization which

divides

the entire cache

into

separate setswhich can

house

a

specific setof

blocks

Single Instruction

Single

Data

(SISD),

See

Uniprocessor

Single Instruction Multiple Data

(SEVfD),

The sameinstruction

is

executed

by

multiple processors_using

different

data

Snoopy

Bus,

A

bus based

protocol, commonly utilized

in

shared _memory multiprocessor cache

coherence protocols

Split Transaction Bus

(STP),

Abusarbitration scheme

by

which a master

does

not

hold

ontothe

bus if

the slave

device

cannot respond immediately. Instead control

is

givento another

device

which can use

it

atthatmoment

Superscalar,

An architectural enhancement

for

microprocessors

by

which multiple

instructions

are

processed _{simultaneously using}

dynamic scheduling along

withcompileroptimizations

Symmetric

Multiprocessing (SMP),

A system configuration

in

which all multiple

identical

processors are connectedtogether

viathe same shared

bus

and

have

equalaccesstoallresources

Synchronous Dynamic Random Access

Memory

(SDRAM),

A form ofDRAM which adds a separate clock signal to the control signals.

SDRAM

chips can contain morecomplexstate_machines,

allowing

them to support "burst" access

modesthatclock out aseries ofsuccessive

bits

(16)

Uniprocessor,

Acomputerthat

has

only

one processor

Write

Back,

A

caching

mechanism where cache

blocks

are written back to the next level

in

the

memory

hierarchy

onlywhenneeded

Write-Once

Protocol,

Acache coherence mechanismwhich

forces

a cacheblockto

be

writtenback to the next

higher level

of_{memory only}afterthe

first

write

by

theprocessor

Write

Invalidate,

Atypeof cache coherenceprotocolthat

invalidates

all other copies ofthe cacheblock

in

otherprocessor'sL2cache

Write

Through,

A cachingmechanism

by

which cache

blocks

arewrittenbackto thenext higher

level

of

memoryafter each writetothecacheblock

Write

Update,

A type of cache coherence protocol that updates all other copies ofthe cache block

in

other processor'sL2 cache

Very

Large

Scale

Integration

(VLSI),

Aterm

describing

semiconductor

integrated

circuits composed of

hundreds

ofthousands

of

logic

elements or_memorycells.

(17)

1

Introduction

Forthepast 20years the

majority

ofimprovements in computing

has

come

from

more powerful processors. In today's information age _computing power

is

being

challenged at all

levels,

from

multimedia applications to the grand challenge problems.

The President

instituted,

in

1992,

the

five-year federal

High Performance

Computing

and

Communications

Initiative. This

has

spurred the development of advanced processor

technology

and was

initially

focused

on the solution ofthe grand challenges shown in

Figure

1-1. These are fundamental problems in science and engineering, with

broad

economic and scientific

impact,

whose solution could be advanced

by

_{applying high}

performance_computingtechniquesand resources.

FirstTferaflopMachine

ntelfb hips)

g

op Ma oE-t IntelfbrDcE-TOO Ft

TERAFLOP (Trillion Operations persecond) P E R F O R M A N C E 100 GIGAFLOPS 10GIGAFLOPS GIGAFLOP (Billion Operations persecond) 100 MEGAFLOPS (MiHrtiOperation: peisecondj INTFXFAEAGOH(S7S8) Fujiisu.VPF500(80) CKj-nDdOM) TMCCM5C102* nCUBE-2 CRAYC-#>(Faialld)^ EMSF2C512) GRAND CHALLENGES Integratedfluidand structural airframesumnaiioD* Fluid Trubulence Pollution Dispersion Human &enome OceanCirculation Pharmaceutical Design QuantumChromod^iarrics SerriicoTMiuctar_Modeling Combustion Systems VisionandCognition

GJHOUR .WEflTHER . IPKBEK3KIN 72 HOUR WEATHER FREPKn'ION tiCUBE-1

High End WtksMimi

CRAVES

C-M CRAy-XM? CRAr-YMP 8IRKI

DJfciKJi

i*r' ' 'tis' "li1 " 'ilk1 " 3Dbo YEAR

[image:17.577.134.448.338.631.2]

MtssToeryPiraflel ? _ModestlyParallel Sequential ^ficroprocessor -Supercomputer -Supercomputer

(18)

What

was once considered a supercomputer dedicated to _solving particular

problems, now

functions

in

tiny

handheld devices.

Hence,

theneed

for

faster processors

will always exist.

The ability

to produce faster processors

has

been possible

due

to advances

in

VLSI

(Very

Large

Scale

Integration)

technology

andcomputer architectureoverthe past

20years.

1.1

VLSI

Advancements

In 1965 Gordon Mooreobservedthat thenumber oftransistorsper squareinchon

integrated

circuits

had doubled

_every year since the integrated circuit was invented.

Moore predictedthat this trendwouldcontinue forthe

foreseeable

future. In subsequent

years, the pace slowed down a

bit,

but data

density

has doubled

_{approximately} _{every 18}

months. Tothispointthat

theory

has

heldthrough andin many casesthe actual

increase

has

exceeded Moore's prediction. Figure 1-2

[INTEL]

shows the transistor count for

Intelmicroprocessorssince the mid-1970's.

1 Billion _,

V

1.000,000 -,

\

Transistors

100.000 - ^^

10.000

-0**Pentium-Pro Processor 1,000 -^ idftfi Pentium- _Processor 100 -*s I^*i386 30286 10 -1 -19 8086 75 1 1980 1 1985 1 1990

i i t i

1995 2000 2005 2010 2015

[image:18.577.147.431.447.617.2]

Projected *

(19)

At

the current rate of_growth, processors with 1 billiontransistors should surface

around 2010. Atthis _point, clock

frequencies

of processors will

be

around 10GHz as

shown

in Figure

1-3

[INTEL].

MHz

uu.uuu-10.000-

10GHz^^-

1,000-

100

-Pro -Processor

^"HpPentium

-Processor 9***^*i486"'-'_Processor

10-m J> i386"; rZ>rfr 80286

"

Processor

1 -8086

0.1

-r i 1 1 i 1 !

00 Projected

Figure 1-3: ProjectedCPU

Frequency

fornext 15years

Thisexponential

increase

has beenaccomplished

by

the

incredible

advancements

in VLSI technology. The

feature

size of modern computers

has

reached the 0.25Dm

mark and

is

dropping

further. Thissmaller

feature

size

has

allowed

designers

toproduce

smaller, cooler processorswithahigherclock

frequency.

However,

it is believed

thatcurrent

lithography

techniques

for

silicon will not

be

applicable at

feature

sizes

less

then 0.1 Dm. Evenwith a 0.1Dm

feature

_size, 1

billion

transistors would _occupy an enormous amount of space and consume a

large

amount of

power. It

is

predictedthat within the next 5 years current siliconVLSI

technology

will

reach

its limit.

Once

this point

is

reached an alternative to

Silicon,

such as Gallium

Arsenide

(GaAs),

will

be

needed.

However,

this new

technology

would require

comprehensive modification of current VLSI

technology

and a complete

retooling

of

fabrication

facilities,

which would take an outrageous amount of

time

and money.

This

(20)

leaves

architectural advancements as the _onlyviable alternative to achieve higher levels

ofperformance.

1.2 Architectural Advancements

The second reason thatmicroprocessors were ableto

keep

_up with the desire for

morepower

is due

to the advancementsin computer architecture. Computerarchitecture

deals

largely

withthe

instruction

set _{architecture,} and performance enhancementissues of

CPU design.

There have beena number ofdramaticchangesto CPUarchitecture sincethefirst

IC-based CPU was created. The list below

is

by

no means a complete list of

advancements incomputer architecture,but it serves as a point of referenceto the impact

that architectural advancementshavemade.

1.2.1

Pipelining

has had the most dramatic

impact

on the performance of the CPU.

This architectural improvement

is

an implementation technique that exploits parallelism

among instructions in a sequential

instruction

stream. It

has

the substantial advantage

that, unlike some _{speedup techniques, it is} not visibleto theprogrammer. Most modern

processors use sometype oflinearsynchronouspipelinewith added

features

such as data

forwarding

andbranchprediction.

A linear pipelined processor

is

a cascade of

processing

stages that are

linearly

connected to perform a

fixed function

over astream of

data

flowing

from

one endtothe

other. The

intent is

to

be

able to

introduce

anew

instruction into

the pipeline at _every

clock cycle sothatno stage

in

the pipeline

is

_every

left idle.

Ifthis

is

accomplished then

(21)

As stated _previously, _linear pipelined processors are constructed with k

processingstages. External

inputs

(operands)

arefed intothepipeline atthe first stageS;.

The processed results are passed

from

stage

Si

to stage

S;+l,

for all i=l,2,...,k-l. The

final

result emerges at the

last

_stage, Sk. Each result

is

passed to the next stage based

uponthe clock cycle ofthe pipeline.

Ideally,

we expect the clock pulses to arrive at all

the stages at the same time.

However,

due to a problem known as clock _skewing the

same clock_mayarrive atdifferentstages with atimeoffset. To avoidthis theclock cycle

ofthepipeline mustbethecombined maximum ofthe executiontime ofthelongest stage

ofthe pipeline and

its

clock skewed offset. Theblock diagramof a five stage pipeline is

shown in Figure 1-4 [HENPAT96].

Instructionfstch

PC

Aed NiPC

tnstrurton memory

rosier?etofc

FteClsirc

-0--~tiH

<6 /&nr.\32

:xeai'e' gikkass

r Zero? Branch

Cond

u

)ALU ALU

ouSJUt

Memory

access

a

Dais

ternary MD

WiSe back

Figure 1-4:Fivestage pipeline

Each stage ofthe pipeline performs one part ofthe _processing of an

instruction.

[image:21.576.86.493.335.590.2]

(22)

While

pipelining

has resulted

in

a tremendous increase in the throughput of a

CPU,

it

has a

large

amount ofoverhead associated with

its

implementation. In addition,

resource and

data

dependencies

_among the instructions

being

processed in the pipeline

prevent

full

utilization ofthe pipeline. This manifests itself in terms of pipeline stall

cycles.

Therefore,

pipelining

complicates the traditional processor

by introducing

the

need for additional advanced architectural concepts such as data

forwarding

and branch

prediction.

1.2.2 Branch Prediction

Data

dependencies

and branch instructions limit the performance of pipelined

processors dueto the additional logicneeded tocope withthem. Branch instructions are

verycommonin any process dueto the behavioroftheprogram

being

executed. When a

pipeline

is

full and a branch is encountered the address of the next instruction to be

processed is not known until the previous instruction has finished executing since the

processor condition codes will not have been set _correctly yet. When a branch is

encountered a branch prediction unit

is

responsible to pick the next

instruction

to

be

executed in the pipeline. There are _many advanced algorithms for choosing the correct

instruction,

based

on the past

history

ofexecution. This architectural enhancement

is

essential to maintainthe throughputofthepipeline at an acceptable

level.

Some branch

prediction schemes

have been

ableto reach95% accuracy. Ifthe_wrong

branch is taken,

the pipeline must be

flushed

and execution

is

restarted

from

the

last

correct

instruction

(23)

1.2.3

Superscalar

Design

Superscalar designs

incorporate

additional functional units that are used to

process a number of

instructions

simultaneously. These processors aresometimes called

multiple-issue _processors, since more than one instruction can

be

issued to functional

unitsin a single clock cycle. The processor

issues

a_varying number ofinstructions per

clock, which _may

be

_statically scheduled

by

the compiler or

dynamically

scheduled.

Usually,

these

instructions

must

be independent

and will have to _satisfy

dependency

constraints. Such

dependency

constraints include _resource, control and data

dependencies. Ifsome instruction

in

the instruction stream

is

dependent or doesn't meet

theissuecriteria, onlythe _{instructions proceeding}thatone inthe sequence willbe

issued,

hence the _{variability in issue} rate. Most modern processors have a superscalar

design,

with some

being

able to _{issue up} to 6

instructions

at once if the conditions are

appropriate.

1.2.4

Cache

Along

the same lines as _pipelining cache has an enormous

impact

on the

throughput of a CPU. In the earliest microprocessor days

instructions

and data were

stored inmain _memorywhich

is

notlocated

directly

ontheprocessor chip. This

involves

incurring latency

due to the _memory access time. When

dealing

with main _memory,

latency

can

lead

to a lot of processor idle time since an external _memory

bus

access

request

is

issued.

To deal with

this,

a small,

fast

chunk of_memory

is

placed close to the

CPU

to

hold intermediate datathatmightbe needed again soon. This small amount of_memory

is

(24)

Locality

statesthatmost programs

do

notaccess all code or datauniformly. This

principle, plus theguideline that smallerhardware

is

faster,

ledto the

hierarchy

based on

memories of

different

speeds and sizes. Since fast _memory

is

expensive, a _memory

hierarchy

is

organized into several levelseach smaller, faster and more expensive per

byte

than thenextlevel. Thegoal

is

toprovide a_memorysystem with cost almost as low

asthe cheapestlevel of_memory and speed almost as

fast

as thefastest level. The levels

of

hierarchy

_usually subset one _another; all data in one level is also found in the level

below,

and alldata inthelowerlevel is found inthe onebelowit.

The memory

hierarchy

of a computer system starts atthe processor level with its

internal registers. These are the fastest and easiest forthe processor to access. Next in

lineistheLevel 1

(LI)

cache stored onthe same dieastheprocessor. This level of cache

also

has

a_{very low}

latency

sinceit is operating atthe same speed ofthe chip. The Level

2 cache, which isthe nextlevel inthe

hierarchy,

canbe on the same package asthe CPU

or onthemainboard. Ineither case

it

has ahigher

latency

since data must comethrough

the _{memory bus into} the CPU. _{Main memory is} the next level and is much larger

(by

many orders of _magnitudes) than L2 cache. Programs that are _{currently running} are

stored in main _memory and are accessed throughthe_{memory bus} when

they

areneeded.

Hard disk

is

_generally considered the lowest level onthe _memory

hierarchy

chain. This

level

is

the largest and

by

far the slowest level since

it involves

an actual physical

movement ofthereadhead. Figure 1-5 gives avisual

depiction

ofthe _memory

hierarchy

(25)

Room

Box

Board

Begfie _Chip

Cache

Main_Memory

[image:25.577.180.394.48.398.2]

Secondary Memory

Figure 1-5:

Memory

Hierarchy

Table 1

[HENPAT96]

showsthe rangeofsizesand access timesof each

level in

(26)

Level 1 2 3 4

Called Registers Cache Main

Memory

Disk Storage

Typical

Size

<1KB <4MB <4GB >1 GB

Implementation

Technology

CMOS or

BiCMOS

CMOS

SRAM CMOS DRAM Magnetic disk

Access Time

(ns)

2-5 3-10 80-400 5,000,000

Bandwidth

(MB/sec)

4000-32,000

800-5000 400-2000 4-32

Managed

by

Compiler Hardware

Operating

System

Operating

System/User

Backed

by

Cache Main

Memory

[image:26.576.68.509.60.233.2]

Disk Tape/CD

Table 1: Rangeofsizes and accesstimesineachlevel inthe_memory

hierarchy

The need for cache

is

due to the factthe CPU performance has advanced faster

than _memory performance. CPU performance has improved 35% per year until

1986,

55%per year_since, while_memoryperformanceimproved only 7%per year. Withthis in

mind, cache has proven to be a _very effective _way to improve overall system

performance.

Figure 1-6 shows the speed ofthe

dotproduct,

oftwo vectors on the cache based

RS/6000-980. For vector lengths greaterthan 2000 the cache cannot accommodate all

relevant data and the performance drops as more data has to be transported to/from

memory.

(27)

70,0

60.0

I

40,0

30.0

20,0

Speedofdotproduct

onlBMRSraooOSW

200O.0 4000.0 6000.0 vcstftfl*

[image:27.576.145.423.62.295.2]

8000.0 10000.0

Figure 1-6: CacheEffecton aSystems Performance

While,

each of these enhancements

individually

can increase processor

performance, modern processors use them all to make an _extremely advanced design.

These enhancements, coupled with VLSI

technology

advancements, have made it

possible for computingpowertogrow at an exponential rate.

1.3 Flynn's Classification

of

Computer Architectures

All uniprocessor systems follow the von Neumann model. The von Neumann

architectureis characterized

by

aCPU and central _memory _system, with

instructions

and

data

being

readfrom memory.

Afteraninstruction has beenreadthe

instruction is decoded

and then_any relevant

operands fetched from _{memory, the}

instruction is

executed andthe result stored back in

memory. The single

data

path

between

the

CPU

and _memory over which

both

instructions and

data

must _pass, and the sequential nature of

instruction

execution

(28)

together limit the performance possible

from

the computer. This

is

sometimes known as

the von Neumann

bottleneck.

This

is

aided in uniprocessor's

by

_pipelining and

superscalardesign enhancements.

One

form

of classificationforvonNeumann machines is based onthe number of

instructions

that can be executed at _any one time and onthe number of chunks ofdata

that can be operated on at atime. In 1972 Michael Flynn introduced a classification of

various computer architectures based on notions ofinstruction and data streams. The

number of

instructions is

given as either SI for single instruction or MI for multiple

instructionand the number ofpieces ofdata is given as either SD forsingle data orMD

formultipledata. Machinescanthus beclassified as

SISD, MISD,

SIMDorMTMD.

1.3.1

SISD

The classical von Neumann machine can be regarded as a

single-instruction-single-data machine in that at _any one time _only a single instruction

is

being

executed,

and _only a single piece ofdata

is

being

operated upon. Thisis wherepart ofthe problem

arises, since we often want to performthe same instruction on _{many different} pieces of

data,

andthevonNeumannmachine requires ustofetchthe same instruction many

times,

once for each piece ofdata. In fact the situation is much worse since a von Neumann

machine will _usually require us to create a

loop,

and so we will need to execute _many

instructions for each piece of data. This can slow the machine _{down many} times over

whatthearithmetic unit

is

capable ofperforming.

1.3.2

MISD

The multiple instruction single

data

(MISD)

architecture

is

the most uncommon

one. Inthis architecture,the same

data

stream

flows

througha

linear

_arrayof_processors,

(29)

executing

different instructions

onthe stream. This kindof architecture

is

also known as

a systolic _{array for} pipelinedexecution of specific algorithms.

1.3.3

SIMD

Forproblems inwhich the same operation needs to be performedon _many pieces

of

data,

_particularly those

involving

vectors and _arrays, SIMD (single-instruction

multiple-data) architectures are often capable ofhigh speeds. A single CPU controls

many arithmetic _units, each of which operates on

its

own data. Each arithmetic unit

executesthe same instruction as determined

by

the

CPU,

but uses data found in

its

own

memory. Thus all the elements oftwo vectors could be added together simultaneously,

increasing

the speed oftheoperation _manytimesover a SISDmachine.

In_{practice, the}provision of_many arithmetic units

is

expensive, particularly since

many ofthemwill notbe in use at_anygiventime. Evenifalarge number of arithmetic

units are _provided, the size of vectors and arrays will _{rarely be} amultiple ofthe number

ofarithmetic units and so some

inefficiency

intheuse ofthearithmetic units will arise.

A more effective use of hardware can be obtained

by

_pipelining the arithmetic

unit. A hardware

floating

point accelerator will _already contain dedicated hardware for

each part ofthe calculation of a

floating

point operation.

By

_pipelining the use ofthis

hardware,

significant improvements can be made in processor performance. This

technique will not give as high a performance as a true SIMD _machine,

but

the

improvementscanbesignificant.

1.3.4 MIMD

The most general

form

of vonNeumann architecture

is

the

multiple-instruction-multiple-data machine. A MTMD machine

is

_usually a number of separate processors

(30)

connected together through some interconnection network. The actual format of

interconnection between

the processors can take _many

forms, depending

on the type of

problem, whichthe machine

is

designedto solve. This

is

the most common architecture

chosen for multiple processor machines

because

modern processors have the control

logic for parallel systems

built

in.

Therefore,

this is attractive since software,

replacementparts and additionsto the system are_easilyaccessible.

1.4 The Quest for

a

Mainstream Supercomputer Architecture

As stated _{previously CPU} performance advancements have come from two main

areas, VLSI and computer architecture improvements. It seems certain that the

advancements in VLSI technologies are

hitting

the limits.

Also,

most architectural

enhancementshavebeen implemented incurrentdesigns. Withthisin_{mind, the}futureof

mainstream _computing

is

in need of an alternative computer platform. This alternative

lies inparallel processing. _{Parallel processing}

involves

_utilizingmorethan one CPU ina

computersystem; working cooperativelyto achieve

increased

performance. As shown in

the previous sectionthere arefour architectures that couldbe usedto

implement

parallel

machines. It is believed that the appropriate choice for future machines will

be

ofthe

shared _{memory MIMD} "tightly-coupled" variety.

These

systems will _usually contain

between 2 to

O(10)

CPU's on a single system

board

with _uniformly shared _memory

between the processors and an interconnection network on the

board.

Boards

ofthis

nature are referredto as

"tightly-coupled"

due to the

fact

that the processors

lie

closeto

each other on the same system board.

Systems

created

in

this nature are referred to as

SMPmachines.

(31)

1.4.1

SMP

An SMP node contains several identical processors, each

typically

with its own

on-chip cache and a _{larger off-chip} _cache, which have uniform access to a shared

memory and other resources such as the network interface. Figure 1-7 shows a block

[image:31.576.153.424.193.397.2]

diagramof asymmetric_{multiprocessing} system.

Figure 1-7:Block Diagramof anSMP Systemwith4 Processors

In this scenario there are four processors each of which

has

their own local L2

cache outside ofthe CPU in addition to the LI cache

inside

ofthe processor. In SMP

systems each processor sharesthe same_memory

image.

This means that

if

two

different

processors accessed the same _memory

location,

they

would receive

identical

values.

Some importantcharacteristics ofSMP's include:

High-Speed

Memory

Bus - Since _several

processors need to get access to main

memory, a

dedicated,

high-throughput memory

bus

is

required.

Design

ofthe _memory

bus

is

critical

in

_producinganefficient SMP architecture.

(32)

Separate

Secondary

Cache

-

Each

processor

in

the system

has

its

own _secondary

(level

2)

cache.

The

provision of separate caches

for

each processor requires complex

logic

in the cache controllerto make sure that a processor never works on

data

that

has

been

updated

in

another processor's cache. This problem

is

addressed through cache coherence protocols that make sure the most recent value in the processors cache correspondsto the

data in

memory. _{The primary} advantageofa dedicated-cache design

is

the _ability to

increase

the number ofprocessors in a _system, without _saturating the

memory

bus.

This approach seems to

be

the most popular for high-end multiprocessor servers

because it

ensures optimum performance even when a system

is

scaled to

its

maximum configuration. The size ofthe cache itself

is

also relevantto performance. As

a general _{rule, the} larger the _secondary _{cache, the} better an SMP system will scale as extra processors are added. Figure 1-8 showsthe effect in TPS

(transactions

per _second) of_on-chip cacheinanSMP systemwithtwoandfourprocessors.

800.000

700.000

600 000

500.000

-400.000

--300.000-

-200.000

100.000-p

-

-*--* ?

* _tr *

*-* * * M X * X * *_^K-* * A

-4CPU1mL2 -4CPU512KL2 -2CPU1mL2 -2CPU512KL2

0.000 H 1 1

1-X i J) JI 10 X Si

c = c H e c: e B

t>

2

CI CO O T CD CM <D O V CO1 CM* <u' o' MixName

Figure 1-8:Cacheeffecton aSMPsystem

[image:32.577.100.483.420.640.2]

(33)

I/O to

Memory

Bus Bridge - In

systems

today

the I/O bus interfaces with the

memory bus rather than

directly

to a CPU. This creates even more contention in SMP

systems since the CPU'smust go through

it

to access resources. Therefore ahigh speed

I/Oto

Memory

Bus Bridge

is

required.

Multiprocessor systems will be the main thrust once VLSI and architectural

advancements have reached the end oftheir lifetime. These systems will be found in

homes

and businesses alike.

The MTMD architecture seems to be the future ofmainstream high performance

computing.

However,

it involves

system design complexity. Amajor obstacle in these

systems

is

cache coherence. Since there are multiple processors_{working cooperatively}

on a single or multiple _{tasks, data}

is

_constantly

being

shared between the processors.

Whenone processor makes a changeto some piece of shared

data

(currently

stored in its

local cache) the other processors needto know about

it

in case

they

will need to use the

same piece ofdata. Ifother processors are not informed

immediately,

the value for the

datathat

they

use_may notbethemost current. Thisproblemis called cache consistency.

To alleviate this problem current systems use a cache coherence protocol to

insure

that

the data in a processors local cache is always _up to

date.

This extra _processing and

memory bus access results in a large amount of overhead for SMP systems.

Unfortunately,

there are no current

implementations

that can alleviate this problem.

Instead current systems aim to deal with the problem

in different

ways,

resulting in

a

largeoverheaddueto the coherenceprotocols used.

This thesiswill present a new architecture

for

multiple processor _systems, which

removesthe cache coherence protocol required

in

shared_{memory MTMD} architectures.

(34)

In chapter

2,

we will investigate cache coherence and the various approaches to

cope with it. We

follow

by focusing

chapter 3 on _analyzing the current MTMD

architectures versus the proposed architecture. Then in chapter 4 we will compare

benefits

gained _{from using} the new architecture _alongwith the changes that need to be

madefor ittobe

implemented

feasibly. Tothis end, adatarate model ofSMP systemsis

employed to

illustrate

performance of each ofthe architectures.

Finally,

chapter 5 will

concludethethesis and present

directions

for futurework.

(35)

2

Cache

Coherence

As

discussed in

section 1.2.4, L2 cache existsbetweentheCPU and main memory

in a computer system. The purpose ofL2 cache

is

to further reduce effective _memory

accesstime

by

_reducingthe LI cache miss_penalty, since main memory's speed is much

slowerthan thatoftheCPUs internalregisters andLI cache.

Inuniprocessor_systems, cacheis easily implementedwith_verylittle added design

overhead and complexity. Whenthe processor needs information that does not exist in

its internal registers orLI cache

it

checks for the datain theL2 cache. Ifthe data does

not existin L2 _cache,then thedata isreadfrom main _memory, which _{may in}turnneedto

gotoa mass storagedevicetoretrievethe

information,

_generating a pagefault. Ifa piece

of

data

is found inacacheit isconsidered a cache

hit,

otherwise

it is

a cache miss. Cache

misses are _simplyone minusthe cache

hit

ratio, which

is

the ratio ofthenumber of

items

that arefound in cache versus the number of

items

requested. Once the

data is found

in

memory it istransferred toL2 cacheintheform of a cacheblock. Whentheprocessor is

finished using a piece of

data,

it

is

updated in LI and L2 cache. _{Main memory}updates

depend onthe actual write _policy_used, either write-through or_write-back,

discussing

in

section 2.1.4. This method of program execution

is

the

backbone

of all uniprocessor's

following

thevonNeumann model.

A sufficiently

fast

_memory

bus

must

be

implemented

between

L2 cache and main

memorytomeetthe

demand,

for instructions

and

data

by

theprocessor.

(36)

2.1

Cache Basics

Beforethe cachecoherenceproblem can

be

detailed it

is

importantto have a good

understanding ofhow caches handle data. _{The memory}

hierarchy

of computers breaks

information

_up

into

toblocks ofdata. These blocksofdataare movedin and out of cache

as needed. An entireblockof

data (which includes

_{many memory}

locations)

is moved at

a

time, due

to the principle of spatial locality. Spatial

locality

states that items whose

addresses are near one anothertend tobe referenced closetogether intime [HENPAT96].

Therefore,

when a new block is brought

into

the cache it

is

beneficial to

bring

in the

surrounding data also, since

they

will most

likely

be needed in the near future. The

designof cache subsystems involves fourmajorissuesthatneedtobeaddressed.

2.1.1

Cache Organization

The organization of a cache dictates where a block can be placed when it is

brought in from main memory. There are three cache organizations used today: direct

mapped,

fully

associative, and set associative. Figure2-1 _{visually describes} each ofthree

organizations. Their descriptionsare containedinthe

following

section.

(37)

FuHyassociative:

block 12can go anywhere

Directmapped:

block 12can go

onlyintoblock 4 (12mod_8}

Setassociative;

block12canga anywherehset0

(12mod₄₎

Block 0 12 345 67 Block 0 12 34 5 67 Block 01 2 345 6 7

Cache

n

Set Set Set Set

0 12 3

Blockframeaddress

Block _{1111111111222222222233}

no. 01234567890123456789012345678901

[image:37.576.114.461.60.336.2]

Memory

Figure 2-1:CacheOrganizationSchemes

2.1.1.1 Direct MappedCache

In a directmapped _cache, each block has only one place where

it

can go

into

the

cache. _{The mapping for}ablock

in

adirectmapped cache

is

shownin Equation 1.

(block address) MOD (number of blocksin_cache)

Equation 1:

Mapping

forBlock inaDirect Mapped Cache

2.1.1.2

Fully

Associative

In a

fully

associative cache, each

block

can appear anywhere

in

the cache.

Fully

associative caches are _{extremely easy} to

implement,

because

ofthe minimal amount of

overheadinvolved.

(38)

2.1.1.3 Set Associative

Finally,

set associative _caches, limit the number of places ablock can be placed.

Blocks inthe cache arebroken offinto groups of sets. Each block inmemory

is

mapped

into

a single _set, _{generally using Equation 2.}

(block address) MOD (number ofsetsinthe_cache)

Equation2:

Mapping

for Block inaSet Associative Cache

Once a_{memory block has been}assignedto asetitcanbeplaced anywhere inside

theset.

2.1.2

Cache

Block

Lookup

Nowthat it is understood how caches get data into

them,

the process of_reading

from a cache will be detailed next. Each block in a cache has an address

tag

associated

with it. When a processor wishes to retrieve data from the cache it uses the block's

address

tag

to reference the data. The

tag

of each block (the actual number ofblocks

checked depends upon the organization of the _cache)

is

compared against the

tag

requested

by

theCPU. The figure belowshows layoutof anaddress forpiece of

data

in a

cache.

Blockaddress Block

offset

Tag

Index

Figure2-2:ATypicalCacheBlockAddress

The index portion ofthe address

is

used to select the set

in

the _cache, while the

block offset

is

usedto selectthe actual piece of

data

in the cacheblock. The

tag

portion

oftheaddress

is

comparedto theprocessors requestedtag.

(39)

A

fully

associative cache would

have

no index field since ablock is not restricted

to _any single set.

Note

that

in

a set associative cache the index field would be used to

selectthe setthatcontainsthedata. While

in

a

direct

mapped cachetheindex

field

would

selecttheactual

block

_containingthedata.

2.1.3

Replacement

Strategy

The replacement _strategy

dictates

which block

is

replaced when a cache miss

occurs. The actual process of _selecting the block to be replaced when a cache miss

occurs

is

done

by

the cache controller. A cache miss occurs when the

tag

requested

by

the CPU was not

found

inthe cache. When a miss _occurs, ablock inthe cache mustbe

replaced with a block from the next higher level of memory. In a direct mapped cache

thereis no need fora replacement _strategy sincethere _{is only}one locationthat ablock is

capable of_{going into.} There are _many replacement strategies that are used in cache

controllers. Threereplacement strategies are

Least-Recently

Used, Random,

and

First-In,

First-Out(FIFO).

2.1.3.1 Least

Recently

Used

(LRU)

The LRU replacement _strategy records all accesses to cache blocks. When a

cache miss occursthecacheblock that

is

replaced

is

theonethat

has

gone unusedforthe

longest amount of time. This

follows

_along the same

lines

as temporal

locality.

Temporal

locality

statesthata cacheblockthat

has been

_recentlyused

is

likely

to

be

used

again in the nearfuture. LRU replacement can

become

_{extremely expensive, especially}

in

largecaches, since all accesses needto

be

recorded

internally

in

thecache.

(40)

2.1.3.2 Random

The simplest _strategy to _employ

is

a random replacement _strategy that

is

spread

uniformly across the cache. When a cache miss occurs a random block number

is

selected and the selected block

is

replaced. Studies have found that while the random

replacement _strategy _may not be the most intuitive _strategy

its

results are quite

impressive.

The attractiveness in a random replacement _strategy

is

in the ease of

implementation.

2.1.3.3

First-In,

First-Out

(FIFO)

FIFO replacement _strategyreplacesthecacheblockthat has been in cacheforthe

longestperiod oftime. _{This strategy has} provento yield worse resultsthan theLRU and

random replacement strategies.

2.1.4 Write

Policy

The final aspect about caches

is

the write policy. When a data value

has

been

modified in the processor _registers, it is

immediately

written back to LI cache and L2

cache.

Updating

the data in main _{memory depends} on the particular write _policy

being

employed inthe system. Therearetwowritepoliciesthatare used incache

design

today.

2.1.4.1 Write-Through

In a write-through _cache, when data

is

written to L2 cache

it is

also written to

main _memory simultaneously.

Therefore,

main _memory always contains an exact _copy

ofthe datathat is in the L2 cache ofthe processor.

Write-through

cache's put a

large

amount of overhead onto the _{memory bus} since

it is

not always _necessary to

have

an

updated _copy ofthe

data in

main memory. Forthis _reason, write-through caches are not

widely used.

However,

writethrough caches are _{extremely easy} to

implement

since all

(41)

writes aresenttoL2 andmain _memoryatthesametimeand no additional logic

is

needed

in

the cache.

2.1.4.2

Write-Back

Inwrite-back _caches, _{data is}writtenback only to theL2 cache. Whenthe cache

becomes

full or the cache block is

being

replaced, the data is then updated in main

memory.

Therefore,

all _writing to main _{memory from L2} cache

is

done when a cache

miss is encountered. The advantage ofthe write back cache

is

that all writes from the

processor occur

locally

at the speed of the cache _{memory (much faster} than main

memory). Since main _memory

is

_only updated when the cache block is _needed, the

memory bandwidth requirements of a writeback policy are much more lenientthan that

ofthe write-through policy. This _{frees up bandwidth for} other devices in the _system,

most

importantly,

other processors inmultiprocessor systems.

2.2 Multiprocessor Cache

coherence

In a multiprocessor system, data

inconsistency

can occur in adjacent levels of

memory or within the same level.

Therefore,

it is possible for the current data in main

memory to differ for its most recent _value, sincethe most recent value would

be

stored

only in theprocessors local cache

(depending

onthewrite_policy,

it

_may also

be

in main

memory). In _addition, L2 caches of otherprocessors_may contain even older

data

values

ofthe same _{memory location.} Thisis not possible

in

uniprocessor systems sincethereis

only one processorin the systemthatwill ever _modify the

data.

There aretwo possible

waysinconsistent

data

can appearina_cache,

data

_sharingorprocess migration.

(42)

2.2.1

Data

Sharing

Since

data

in

multiprocessorsystem

is

_commonly sharedbetween manyprocesses

executingon

different

processors

it is

possiblefortheprivate caches on each processorto

contain

different

copies ofthe same shared data. Figure 2-3

[HWANG93]

shows how

inconsistency

can occur when

dealing

with shared

data.

Processors

Pi p2 Pi p2 Pi p2

1

Caches X X X'

X X"

X

\'-i

1

Shared

Memory x

X'

X

Beforeupdate Write-through Write-back

Bus

Figure 2-3: Data

Inconsistency

duetoData

Sharing

2.2.2 Process Migration

In multiprocessor _systems, processes

frequently

migrate

from

one processor to

another.

Unfortunately,

shared

data

that

is

_{residing in} aprocessor's local cache

does

not

migrate withthe process.

Therefore,

it is

possible foraprocessorto updatea shared

data

value in

its

L2 _cache, get

interrupted,

hand the process over to the OS which in turn

hands

it

overto anotherprocessor. Thisproblem

is

knownas process migration and

is

a

common occurrence _resulting

in data

inconsistencies.

Figure 2-4

[HWANG93]

_visually

depicts

how data becomes

inconsistent

due

toprocess migration.

(43)

Processors

Caches

Pi p2

X

Shared Memory

Pi p2

X X

rJ

X'

Pi P2

X X

,

i

rJ

X

Bus

[image:43.576.119.444.52.191.2]

BeforeMigration Write-through Write-back

Figure 2-4: Data

Inconsistency

duetoProcess Migration

2.3 Ways

to

Handle Cache Coherence

Multiprocessor systems have widely _varying ways to handle inconsistent data in

memory. In massively parallel processors

(MPP's)

a directory-based scheme

is

_used,

whilein SMP systems _{snoopy bus}protocols are used.

2.3.1

features

to

keep

data

coherent.

Since caches are used inthese

designs,

therewillbe

data

_consistencyproblems in

main _memory since themost recent datavalues wouldbe stored in caches. Tothis end,

SMP systems _employ a cache coherence mechanism to ensurethat the value a processor

is reading from its cache is the most current one. Cache coherence requires both

hardware and software supportto achieve acceptablelevels inperformance.

The hardware support for cache coherence comes in the form of a _snoopy

bus,

operating under a snoopy-bus protocol. When multiple private caches are tied together

on a single

bus,

the methods used to ensure _consistency entail changes to the write

policies. The snoopy bus allows all processors inthesystemto monitorthe traffic on the

memory bus. The processor is allowed to take appropriate action

depending

upon the

writepolicy. Figure 2-5 shows a SMP system with ashared _memoryvariable

loaded into

each oftheprocessor's local caches.

Shared

Memory

Bus

Caches

[image:44.576.180.395.565.652.2]

Processore

Figure 2-5: Initial Stateofthe

Memory

System

(45)

Inthenext sectionswe will seehowthe caches and _memoryare modified to cope

with cache coherence.

2.3.2.1 Write-Invalidate

Policy

In a write-invalidate_policywhen a processor writes a valueto a cacheblock in its

private cache it also sends an invalidation signal to all other caches which contain the

cache

block,

including

main memory. This signal notifies them that the data has

changed.

X"

*

Shared

Memory

1 1 1

Bus

1

X1

1 _Caches

[image:45.576.179.395.300.385.2]

A

fa)

d)

Processors

Figure 2-6: StateoftheSystemafteraWrite-Invalidation Operation

Figure 2-6

[HWANG93]

shows the state ofthe system after a write-invalidate

operation,

by

Pi.

Since the

Pi

has a write through cache, main _memory contains the

updated valuethat itreceived overthe _{snoopy bus from}

Pi.

2.3.2.2 Write-Update

Policy

In awrite-update _policy when aprocessor writes a value to a cache

block

in

its

private cache it also updates other caches (if

they

contain the cache

block)

and main

memorywith the new value. The update

is done

_using the

features

ofthe _snoopy

bus,

which allows other processors to monitor

bus

activity. Figure 2-7

[HWANG93]

shows

(46)

the state of a SMP system after the write-update _operation, in which all caches now

containtheupdated value.

X Shared

Memory

1 1

1

Bus

1

X X" X'

Caches

[image:46.576.180.404.161.247.2]

Processors

Figure 2-7: Stateofthe

Memory

Systemafter aWrite-UpdateOperation

The write-update _{policy is extremely} effective at _{ensuring data} consistency.

However,

it places an_{unnecessarily large}amount oftraffic onthe _{memory bus} since not

all processors _mayneedthe updated value.

2.3.2.3 Write-Once Protocol

James Goodman in 1983 proposed a cache coherence protocol for bus-based

multiprocessors. Inordertoreduce_{unnecessary bus}

traffic,

the_very

first

write of a cache

blockuses a write-through policy. Thiswill

keep

main _memory consistent with the local

cache afterthe first write. Afterthefirst write, memory

is

updated _using the write-back

policy [GOOD83]. Figure 2-8

[HWANG93]

details

Goodman's protocol, which uses 4

statesto describes

its

execution.

(47)

P-Read Write-lnv/Read-lnv

P-Write Read-lnv

[image:47.576.198.380.64.240.2]

P-Write

Figure 2-8: StateDiagram fortheWrite-Once Protocol

Each transaction in the figure represents extra overhead that is placed on the

memory bus. This traffic reduces the amount of utilization of the bus

by

other

processors,whichinturndegradestheoverallperformance ofthemachine.

2.4

Consistency

Models

Parallel applications _executing on a parallel machine require data to be used in

multiple CPU's. Because ofthis _{it is very important} to make other processors aware

about _any changes to data that

they

_may also have a copy.

Consistency

models _specify

the order

by

whichtheevents

from

one process shouldbe observed

by

other processes in

the machine [Hwang93]. The two main _consistency methods are sequential and

weakened consistency.

2.4.1

Sequential

Consistency

Sequential consistency

is

when"theresultof_anyexecution

[of

the_program]

is

the

same as

if

the operations of all processors were executed in some sequential _order, and

the operations of each individual processor appear

in

this sequence

in

the order specified

(48)

[image:48.576.167.413.143.260.2]

by

its

program"

[LAMP79].

Since data is loaded and stored

identically

to uniprocessor

systemsthe coherencemechanism mustbe abletorespond _rapidlyto changes.

Processors

switch

Single-Port_Memory Shared

Memory System

Figure 2-9: The Sequential

Consistency

Model

Figure 2-9

[HWANG93]

shows how the sequential _consistency model can be

described. Eachprocessor

is

connectedto_memorythroughthe same switch_ensuringthat

no processor can update main_memoryout of order.

Thesingle ported_memory ensuresthat there_{is only} one _memoryaccess operation

inprogress at _anyonetime. Therefore some _queuing mechanism is needed to order and

serializethe_memoryreferences while

they

waittobe serviced.

In 1992

Sindhu, Frailong,

and Cekleov specified that for sequential to exist the

following

five

axiomsmustbetrue [SINDHU92]:

1)

A load

by

a processor always returnsthevalue written

by

thelatest storetothesame location

by

otherprocessors.

2)

The memoryorder conformsto atotal

binary

order inwhich shared _memory

is

accessedinrealtimeover all

loads

and stores

withrespectto all processorpairs and locationpairs.

3)

Iftwo operations appear

in

a particular program_order, then

they

appearinthesame_memoryorder.

(49)

4)

The swap operation isatomic with respecttoother_stores,_meaning

thatno other store canintervene betweenthe load and store parts

of a swap.

5)

All stores and swaps must_eventuallyterminate.

The sequential _consistency model

is

enforced in

hardware,

on the fly. In this

model, all _memory accesses are atomic and

tightly

orderedto ensure the _accuracy ofthe

model.

Therefore,

all _memory accesses must be global and a processor cannot issue

another _memory access until the most recent shared _memory access

by

a processor has

been performed globally. These mechanisms ensure that the correct program order is

observed.

2.4.2 Weak

Consistency

The sequential _consistency model demands the most _{memory bandwidth} and

additional support (both hardware and _software) to ensure

its

accuracy. To remove the

amount of bandwidth and extra work _needed, various degrees of weaker _consistency

models have been created. The TSO weak _consistency model was develop