• No results found

A Shared memory multiprocessor system architecture utilizing a uniform

N/A
N/A
Protected

Academic year: 2019

Share "A Shared memory multiprocessor system architecture utilizing a uniform"

Copied!
82
0
0

Loading.... (view fulltext now)

Full text

(1)

Rochester Institute of Technology

RIT Scholar Works

Theses

Thesis/Dissertation Collections

8-1-1998

A Shared memory multiprocessor system

architecture utilizing a uniform

Frank Casilio

Follow this and additional works at:

http://scholarworks.rit.edu/theses

This Thesis is brought to you for free and open access by the Thesis/Dissertation Collections at RIT Scholar Works. It has been accepted for inclusion in Theses by an authorized administrator of RIT Scholar Works. For more information, please [email protected].

Recommended Citation

(2)

A Shared Memory Multiprocessor System Architecture

Utilizing a Uniformly Shared Level 2 Data-Only Cache

by

Frank Casilio

A Thesis Submitted

In

Partial Fulfillment of the

Requirements for the Degree of

MASTER OF SCIENCE

In

Computer Engineering

Approved by:

Committee Member:

Date:

~,

I

f?)

'9!J

l?

Roy S. Czernikowski, Professor and Department Head

Date:

Tony H. Chang, Professor

Committee

Member:---,---Department of Computer Engineering

College of Engineering

Rochester Institute of Technology

Rochester, New York

(3)

THESIS RELEASE PERMISSION FORM

Rochester Institute of Technology

College ofEngineering

A Shared Memory Multiprocessor System Architecture

Utilizing a Uniformly Shared Level 2 Data-Only Cache

I, Frank Casilio, hereby grant permission to any individual or organization to reproduce

this thesis in whole or in part for non-commercial and non-profit purposes only.

Frank Casilio

..

(4)

Abstract

Due to VLSI

lithography

problems and the limitation ofadditional architectural

enhancementsuniprocessor systems are nearingthe end oftheir life cycle.

Therefore,

it

is believed

that Symmetric

Multiprocessing

(SMP)

systems will be the next mainstream

computer. These systems allow multiple processors, accessing the same memory

image,

tocooperateon a number of computationaltasksasa single entity.

While multiprocessor systems can offer a substantial performance increase

compared to uniprocessor systems, major design considerations must be addressed to

achieve desired system efficiency levels.

Managing

cache coherence

is

a significant

problem in multiprocessor systems. Current

implementations

cope with this problem

by

utilizing a cache coherence protocol. This protocol puts a large amount ofoverhead on

the systembusto ensureproper program execution, effectively

decreasing

overall system

performance. This thesis approaches the cache coherence problem from a new angle.

Instead ofutilizinga cache coherenceprotocol, a newmemory system

is

proposed which

eliminatestheneed fora cache coherenceprotocol,

by

utilizing asharedlevel 2

data-only

cache. This new architecture allows for better utilization ofthe system and

improved

performance and scalability.

A data rate analysis

is

conducted to

demonstrate

the potential performance

increase from the proposed architecture over conventional approaches. The

data

rate

model clearly shows an increase

in

system performance and utilization when using the

architecture proposedinthis thesis.

(5)

To

My

Parents,

withouttheirconstantloveand

supportthismilestonein my career could not

have beenaccomplished

(6)

Acknowledgements

I would

like

to thank the

following

individuals for their support

during

the

completion of this thesis.

First,

and foremost I would like to thank my graduate

committee members, Dr.

Roy

S.

Cznernikowski,

Dr.

Tony Chang,

and especially Dr.

Muhammad Shaaban forthe

help

andinsighthe offeredintothis thesis.

Secondly,

I would like to thank all ofmy professors, managers, coworkers, and

peers who have given me the privilege to

learn,

experience, and grow with them

during

(7)

Trademarks

Intel,

Pentium aretrademarksofIntel Corporation
(8)

Table

of

Contents

Abstract iii

Acknowledgements v

Trademarks vi

TableofContents vii

ListofFigures ix

ListofTables x

ListofEquations ~ xi

Glossary

xii

1 Introduction 1

1.1 VLSI Advancements 2

1.2 Architectural Advancements 4

1.2.1

Pipelining

4

1.2.2 Branch Prediction 6

1.2.3 SuperscalarDesign 7

1.2.4 Cache 7

1.3 Flynn's ClassificationofComputer Architectures 11

7.3.7 SISD 12

1.3.2 MSD 12

1.3.3 SIMD 13

1.3.4 MLMD 13

1.4 TheQuestforaMainstreamSupercomputer Architecture 14

1.4.1 SMP 75

2 Cache Coherence 19

2.1 CacheBasics 20

2.7.7 CacheOrganization 20

2.1.1.1 Direct MappedCache 21

2.1.1.2FullyAssociative 21

2.1.1.3 Set Associative 22

2.7.2 CacheBlock

Lookup

22

2.1.3 Replacement

Strategy

23

2.1.3.1 LeastRecentlyUsed(LRU) 23

2.1.3.2Random 24

2.1.3.3Fiist-In,First-Out(FIFO)

'.'"'"'"".^24

2.1.4 Write

Policy

24

2.1.4.1 Write-Through 24

2.1.4.2Write-Back

ZZZ^25

2.2 MultiprocessorCachecoherence 25

2.2.7 Data

Sharing

26

2.2.2 ProcessMigration 26

2.3 WaystoHandleCache Coherence 27

2.3.7

Directory

Based Protocols 27

2.3.2

Snoopy

BusProtocols. 28

2.3.2.1 Write-InvalidatePolicy 29

(9)

2.3.2.2 Write-UpdatePolicy 29

2.3.2.3 Write-Once Protocol 30

2.4 Consistency Models 31

2.4.1 Sequential

Consistency

31

2.4.2 Weak

Consistency

33

3 DesignofMPArchitectures 35

3.1 Current MultiprocessorImplementation 35

3.7.7 TheChipset 36

3.1.1.1 Circuit Switched Buses 37

3.1.1.2 Split TransactionBus 38

3.7.2

Memory

Type 39

3.1.2.1 ReadingaCache BlockFromMemory 40

3.7.3 The MESICacheCoherence Protocol 41

3.2 ProposedMultiprocessor Architecture 43

3.2.7 Cache Arbitration Unit(CAU) 46

3.2.2 SharedL2*Cache 47

3.2.3 SharedL2*bus 48

3.2.4 Processor Requirements 48

4 Performance Analysis 50

4.1 Performance Analysis Methods 50

4.2 Data Rate Analysis 51

4.2.1 Data Rate Analysis for CurrentSMP

Memory

System's 52

4.2.2 Data Rate Analysis oftheProposedArchitecture 54

4.2.3 InvalidationOverhead. 54

4.2.4 Comparison ofData Rate Models 58

5 Conclusions 63

5.1 Future Work 64

6 References 65

(10)

List

of

Figures

Figure 1-1:Grand ChallengeApplications 1

Figure 1-2:

Chip Density

forIntel Microprocessors -2

Figure 1-3: ProjectedCPU

Frequency

fornext15years 3

Figure 1-4: Fivestagepipeline -5

Figure1-5:

Memory

Hierarchy. 9

Figure 1-6: Cache EffectonaSystems Performance 11

Figure 1-7: Block Diagram ofanSMPSystemwith4 Processors 15

Figure 1-8: CacheeffectonaSMPsystem 16

Figure 2-1: Cache Organization Schemes 21

Figure 2-2: A Typical Cache Block Address 22

Figure 2-3: Data

Inconsistency

duetoData

Sharing

26

Figure 2-4: Data

Inconsistency

duetoProcess Migration 27

Figure 2-5: Initial State ofthe

Memory

System 28

Figure 2-6:State oftheSystemafter a Write-lnvalidationOperation 29

Figure 2-7:State ofthe

Memory

Systemafter a Write-UpdateOperation 30

Figure 2-8:State Diagram forthe Write-Once Protocol 31

Figure 2-9: The Sequential

Consistency

Model 32

Figure 2-10: The TSO Weak

Consistency

Model 34

Figure 3-1: The Intel DualPentiumII Processor

Memory

System 36

Figure 3-2: A Circuit SwitchedBus 38

Figure 3-3: ASplit Transaction Bus 39

Figure 3-4: The MESI Write-Invalidate Protocolwith Write-Back 41

Figure 3-5:Modified Architecture toSupportaL2*Cache 44

Figure 3-6: Modificationto theCPU

Packaging

45

Figure 4-1: Data Rate ofaTypical Program 52

Figure 4-2:Program Data Rate ofCurrent SMP

Memory

Systems 53

Figure 4-3: Program Data Rate ofProposed

Memory

System 54

Figure 4-4: Effect ofInvalidationonCacheMisseswhile

Varying

BlockSize 56

Figure 4-5: Effect ofInvalidationonCache Misseswhile

Varying

Cache Size 57

Figure 4-6: Effect ofData

Sharing

onBus Utilization 58

Figure4-7: Performance ComparisonwhenProgramsExhibitFine Grain

Sharing

61

Figure 4-8: Performance DifferencewhenPrograms Exhibitper-processor

locality

62
(11)

List

of

Tables

(12)

List

of

Equations

Equation1:

Mapping

for BlockinaDirect Mapped Cache 21

Equation2:

Mapping

for BlockinaSetAssociativeCache 22

Equation3:Execution Time fortheCurrentArchitecture 59

Equation4:ExecutionTime fortheProposed Architecture 60

(13)

Glossary

Bus,

Aset ofconductors connecting varies functionalunits in a computer. A shared memory

bus

specifically

denotes

a

bus

connectingtheprocessorsto thechipset

Branch

Prediction,

Amethodto predict the destination of conditional branch instructions to reduce stalls in

the

instruction

pipeline

Cache,

A relatively small amount of high-speed memory that contains

frequently

used

instructions

and data. It is intendedto reducethe accesstimesto the next higherlevel of

thememory

hierarchy

Cache

Coherence,

A problem which occurs in multiprocessor systems when multiple private caches have

differentvalues ofthesame cacheblock

Cache

Hit,

The data blockrequested

by

theprocessor existsincache

Cache

Miss,

Thedatawordrequested

by

theprocessordoes not existin cache. Theentire cacheblock

containingthedataword mustbereadfromthenexthigher level ofmemory

Central

Processing

Unit

(CPU),

Responsible for processing

instructions

in the computer system and managing cache coherenceinmultiprocessor systems

Chipset,

Responsible for controlling all major

functions

in computer systems. The chipset controls allaccesstomemoryand controlsthesystembus

Circuit Switched

Bus,

Abus arbitration scheme which givesthebus master exclusivecontrol overthe

bus

until

itsrequest

has

been filled

Consistency

Model,

Specifies the order

by

which the events

from

one process should

be

observed

by

other

processes

in

themachine
(14)

Direct Mapped

Cache,

A cache organization that allows ablock to

be

placed in a specific location only inside

thecache

Dynamic Random Access

Memory

(DRAM),

Atype ofsemiconductormemory in which the information

is

stored in capacitors on a

integrated

circuit.

Typically

each bit is stored as an amount of electrical charge in a storage cell consistingof a capacitor and atransistor.

Extended Data Out Dynamic Random Access

Memory

(EDO

DRAM),

Allows the data outputs from memory to be kept active after the control signals have gone

inactive.

This can be used in pipelined systems foroverlapping accesses where the

next cycle

is

startedbeforethedata fromthelastcycle isremovedfromthebus.

First-In,

First-out

(FIFO)

Replacement,

Acacheblock replacement strategythatremovestheblockthathas beenthe cache forthe longestperiod oftime

Fully

Associative

Cache,

Acache organizationthatallows ablocktobe placed anywhereinsidethe cache

Least

Recently

Used

(LRU)

Replacement,

A cache block replacement strategy that removes the block which has not been used in

thelongestperiod oftime.

Level 2

(L2) Cache,

Asecond level of cachethat existsbetweentheprocessor and main memory

Massively

Parallel Processor

(MPP),

A computer system made from commodity processors that uses physically distributed

memorytoachieve ahigh levelof parallelismthroughahigh

bandwidth

interconnect

Multiprocessor

(MP),

See Symmetric

Multiprocessing

MultipleInstruction MultipleData

(MTMD),

Eachprocessorfetches its own instructionand operates onits own data

MultipleInstruction Single Data

(MISD),

Eachprocessorfetches

its

own

instruction,

but

all processors operateonthe same

data

Pipelining,

An architectural enhancement wheremultiple

instructions

areoverlapped

in

execution
(15)

Rambus DRAM

(RDRAM),

Intendedtoreplace

SDRAM in future

computer systems. Itoffers sustainedtransferrates

ofaround

1000Mbps,

so

faster buses

can

be

implemented.

Random

Access

Memory

(RAM),

A

data

storage

device for

whichthe order ofaccessto different locations

does

not affect

the speed of access

Reduced

Instruction

Set

Computer

(RISC),

A processor whose

design is based

on the rapid execution of a sequence of simple

instructions

ratherthanontheprovision of a

large

varietyof complexinstructions

Scalability,

The measureof

how

systemperformanceincreasesas system resources are

increased

Set

Associative

Cache,

Acache organization which

divides

the entire cache

into

separate setswhich can

house

a

specific setof

blocks

Single Instruction

Single

Data

(SISD),

See

Uniprocessor

Single Instruction Multiple Data

(SEVfD),

The sameinstruction

is

executed

by

multiple processorsusing

different

data

Snoopy

Bus,

A

bus based

protocol, commonly utilized

in

shared memory multiprocessor cache

coherence protocols

Split Transaction Bus

(STP),

Abusarbitration scheme

by

which a master

does

not

hold

ontothe

bus if

the slave

device

cannot respond immediately. Instead control

is

givento another

device

which can use

it

atthatmoment

Superscalar,

An architectural enhancement

for

microprocessors

by

which multiple

instructions

are

processed simultaneously using

dynamic scheduling along

withcompileroptimizations

Symmetric

Multiprocessing (SMP),

A system configuration

in

which all multiple

identical

processors are connectedtogether

viathe same shared

bus

and

have

equalaccesstoallresources

Synchronous Dynamic Random Access

Memory

(SDRAM),

A form ofDRAM which adds a separate clock signal to the control signals.

SDRAM

chips can contain morecomplexstatemachines,

allowing

them to support "burst" access

modesthatclock out aseries ofsuccessive

bits

(16)

Uniprocessor,

Acomputerthat

has

only

one processor

Write

Back,

A

caching

mechanism where cache

blocks

are written back to the next level

in

the

memory

hierarchy

onlywhenneeded

Write-Once

Protocol,

Acache coherence mechanismwhich

forces

a cacheblockto

be

writtenback to the next

higher level

ofmemory onlyafterthe

first

write

by

theprocessor

Write

Invalidate,

Atypeof cache coherenceprotocolthat

invalidates

all other copies ofthe cacheblock

in

otherprocessor'sL2cache

Write

Through,

A cachingmechanism

by

which cache

blocks

arewrittenbackto thenext higher

level

of

memoryafter each writetothecacheblock

Write

Update,

A type of cache coherence protocol that updates all other copies ofthe cache block

in

other processor'sL2 cache

Very

Large

Scale

Integration

(VLSI),

Aterm

describing

semiconductor

integrated

circuits composed of

hundreds

ofthousands

of

logic

elements ormemorycells.
(17)

1

Introduction

Forthepast 20years the

majority

ofimprovements in computing

has

come

from

more powerful processors. In today's information age computing power

is

being

challenged at all

levels,

from

multimedia applications to the grand challenge problems.

The President

instituted,

in

1992,

the

five-year federal

High Performance

Computing

and

Communications

Initiative. This

has

spurred the development of advanced processor

technology

and was

initially

focused

on the solution ofthe grand challenges shown in

Figure

1-1. These are fundamental problems in science and engineering, with

broad

economic and scientific

impact,

whose solution could be advanced

by

applying high

performancecomputingtechniquesand resources.

FirstTferaflopMachine

ntelfb hips)

g

op Ma oE-t IntelfbrDcE-TOO Ft

TERAFLOP (Trillion Operations persecond) P E R F O R M A N C E 100 GIGAFLOPS 10GIGAFLOPS GIGAFLOP (Billion Operations persecond) 100 MEGAFLOPS (MiHrtiOperation: peisecondj INTFXFAEAGOH(S7S8) Fujiisu.VPF500(80) CKj-nDdOM) TMCCM5C102* nCUBE-2 CRAYC-#>(Faialld)^ EMSF2C512) GRAND CHALLENGES Integratedfluidand structural airframesumnaiioD* Fluid Trubulence Pollution Dispersion Human &enome OceanCirculation Pharmaceutical Design QuantumChromod^iarrics SerriicoTMiuctarModeling Combustion Systems VisionandCognition

GJHOUR .WEflTHER . IPKBEK3KIN 72 HOUR WEATHER FREPKn'ION tiCUBE-1

High End WtksMimi

CRAVES

C-M CRAy-XM? CRAr-YMP 8IRKI

DJfciKJi

i*r' ' 'tis' "li1 " 'ilk1 " 3Dbo YEAR

[image:17.577.134.448.338.631.2]

MtssToeryPiraflel ? ModestlyParallel Sequential ^ficroprocessor -Supercomputer -Supercomputer

(18)

What

was once considered a supercomputer dedicated to solving particular

problems, now

functions

in

tiny

handheld devices.

Hence,

theneed

for

faster processors

will always exist.

The ability

to produce faster processors

has

been possible

due

to advances

in

VLSI

(Very

Large

Scale

Integration)

technology

andcomputer architectureoverthe past

20years.

1.1

VLSI

Advancements

In 1965 Gordon Mooreobservedthat thenumber oftransistorsper squareinchon

integrated

circuits

had doubled

every year since the integrated circuit was invented.

Moore predictedthat this trendwouldcontinue forthe

foreseeable

future. In subsequent

years, the pace slowed down a

bit,

but data

density

has doubled

approximately every 18

months. Tothispointthat

theory

has

heldthrough andin many casesthe actual

increase

has

exceeded Moore's prediction. Figure 1-2

[INTEL]

shows the transistor count for

Intelmicroprocessorssince the mid-1970's.

1 Billion _,

V

1.000,000 -,

\

Transistors

100.000 - ^^

10.000

-0**Pentium-Pro Processor 1,000 -^ idftfi Pentium- Processor 100 -*s I^*i386 30286 10 -1 -19 8086 75 1 1980 1 1985 1 1990

i i t i

1995 2000 2005 2010 2015

[image:18.577.147.431.447.617.2]

Projected *

(19)

At

the current rate ofgrowth, processors with 1 billiontransistors should surface

around 2010. Atthis point, clock

frequencies

of processors will

be

around 10GHz as

shown

in Figure

1-3

[INTEL].

MHz

uu.uuu-10.000-

10GHz^^-

1,000-

100

-Pro -Processor

^"HpPentium

-Processor 9***^*i486"'-'Processor

10-m J> i386"; rZ>rfr 80286

"

Processor

1 -8086

0.1

-r i 1 1 i 1 !

00 Projected

Figure 1-3: ProjectedCPU

Frequency

fornext 15years

Thisexponential

increase

has beenaccomplished

by

the

incredible

advancements

in VLSI technology. The

feature

size of modern computers

has

reached the 0.25Dm

mark and

is

dropping

further. Thissmaller

feature

size

has

allowed

designers

toproduce

smaller, cooler processorswithahigherclock

frequency.

However,

it is believed

thatcurrent

lithography

techniques

for

silicon will not

be

applicable at

feature

sizes

less

then 0.1 Dm. Evenwith a 0.1Dm

feature

size, 1

billion

transistors would occupy an enormous amount of space and consume a

large

amount of

power. It

is

predictedthat within the next 5 years current siliconVLSI

technology

will

reach

its limit.

Once

this point

is

reached an alternative to

Silicon,

such as Gallium

Arsenide

(GaAs),

will

be

needed.

However,

this new

technology

would require

comprehensive modification of current VLSI

technology

and a complete

retooling

of

fabrication

facilities,

which would take an outrageous amount of

time

and money.

This

(20)

leaves

architectural advancements as the onlyviable alternative to achieve higher levels

ofperformance.

1.2 Architectural Advancements

The second reason thatmicroprocessors were ableto

keep

up with the desire for

morepower

is due

to the advancementsin computer architecture. Computerarchitecture

deals

largely

withthe

instruction

set architecture, and performance enhancementissues of

CPU design.

There have beena number ofdramaticchangesto CPUarchitecture sincethefirst

IC-based CPU was created. The list below

is

by

no means a complete list of

advancements incomputer architecture,but it serves as a point of referenceto the impact

that architectural advancementshavemade.

1.2.1

Pipelining

Pipelining

has had the most dramatic

impact

on the performance of the CPU.

This architectural improvement

is

an implementation technique that exploits parallelism

among instructions in a sequential

instruction

stream. It

has

the substantial advantage

that, unlike some speedup techniques, it is not visibleto theprogrammer. Most modern

processors use sometype oflinearsynchronouspipelinewith added

features

such as data

forwarding

andbranchprediction.

A linear pipelined processor

is

a cascade of

processing

stages that are

linearly

connected to perform a

fixed function

over astream of

data

flowing

from

one endtothe

other. The

intent is

to

be

able to

introduce

anew

instruction into

the pipeline at every

clock cycle sothatno stage

in

the pipeline

is

every

left idle.

Ifthis

is

accomplished then
(21)

As stated previously, linear pipelined processors are constructed with k

processingstages. External

inputs

(operands)

arefed intothepipeline atthe first stageS;.

The processed results are passed

from

stage

Si

to stage

S;+l,

for all i=l,2,...,k-l. The

final

result emerges at the

last

stage, Sk. Each result

is

passed to the next stage based

uponthe clock cycle ofthe pipeline.

Ideally,

we expect the clock pulses to arrive at all

the stages at the same time.

However,

due to a problem known as clock skewing the

same clockmayarrive atdifferentstages with atimeoffset. To avoidthis theclock cycle

ofthepipeline mustbethecombined maximum ofthe executiontime ofthelongest stage

ofthe pipeline and

its

clock skewed offset. Theblock diagramof a five stage pipeline is

shown in Figure 1-4 [HENPAT96].

Instructionfstch

PC

Aed NiPC

tnstrurton memory

rosier?etofc

FteClsirc

-0--~tiH

<6 /&nr.\32

:xeai'e' gikkass

r Zero? Branch

Cond

u

)ALU ALU

ouSJUt

Memory

access

a

Dais

ternary MD

WiSe back

Figure 1-4:Fivestage pipeline

Each stage ofthe pipeline performs one part ofthe processing of an

instruction.

[image:21.576.86.493.335.590.2]
(22)

While

pipelining

has resulted

in

a tremendous increase in the throughput of a

CPU,

it

has a

large

amount ofoverhead associated with

its

implementation. In addition,

resource and

data

dependencies

among the instructions

being

processed in the pipeline

prevent

full

utilization ofthe pipeline. This manifests itself in terms of pipeline stall

cycles.

Therefore,

pipelining

complicates the traditional processor

by introducing

the

need for additional advanced architectural concepts such as data

forwarding

and branch

prediction.

1.2.2 Branch Prediction

Data

dependencies

and branch instructions limit the performance of pipelined

processors dueto the additional logicneeded tocope withthem. Branch instructions are

verycommonin any process dueto the behavioroftheprogram

being

executed. When a

pipeline

is

full and a branch is encountered the address of the next instruction to be

processed is not known until the previous instruction has finished executing since the

processor condition codes will not have been set correctly yet. When a branch is

encountered a branch prediction unit

is

responsible to pick the next

instruction

to

be

executed in the pipeline. There are many advanced algorithms for choosing the correct

instruction,

based

on the past

history

ofexecution. This architectural enhancement

is

essential to maintainthe throughputofthepipeline at an acceptable

level.

Some branch

prediction schemes

have been

ableto reach95% accuracy. Ifthewrong

branch is taken,

the pipeline must be

flushed

and execution

is

restarted

from

the

last

correct

instruction

(23)

1.2.3

Superscalar

Design

Superscalar designs

incorporate

additional functional units that are used to

process a number of

instructions

simultaneously. These processors aresometimes called

multiple-issue processors, since more than one instruction can

be

issued to functional

unitsin a single clock cycle. The processor

issues

avarying number ofinstructions per

clock, which may

be

statically scheduled

by

the compiler or

dynamically

scheduled.

Usually,

these

instructions

must

be independent

and will have to satisfy

dependency

constraints. Such

dependency

constraints include resource, control and data

dependencies. Ifsome instruction

in

the instruction stream

is

dependent or doesn't meet

theissuecriteria, onlythe instructions proceedingthatone inthe sequence willbe

issued,

hence the variability in issue rate. Most modern processors have a superscalar

design,

with some

being

able to issue up to 6

instructions

at once if the conditions are

appropriate.

1.2.4

Cache

Along

the same lines as pipelining cache has an enormous

impact

on the

throughput of a CPU. In the earliest microprocessor days

instructions

and data were

stored inmain memorywhich

is

notlocated

directly

ontheprocessor chip. This

involves

incurring latency

due to the memory access time. When

dealing

with main memory,

latency

can

lead

to a lot of processor idle time since an external memory

bus

access

request

is

issued.

To deal with

this,

a small,

fast

chunk ofmemory

is

placed close to the

CPU

to

hold intermediate datathatmightbe needed again soon. This small amount ofmemory

is

(24)

Locality

statesthatmost programs

do

notaccess all code or datauniformly. This

principle, plus theguideline that smallerhardware

is

faster,

ledto the

hierarchy

based on

memories of

different

speeds and sizes. Since fast memory

is

expensive, a memory

hierarchy

is

organized into several levelseach smaller, faster and more expensive per

byte

than thenextlevel. Thegoal

is

toprovide amemorysystem with cost almost as low

asthe cheapestlevel ofmemory and speed almost as

fast

as thefastest level. The levels

of

hierarchy

usually subset one another; all data in one level is also found in the level

below,

and alldata inthelowerlevel is found inthe onebelowit.

The memory

hierarchy

of a computer system starts atthe processor level with its

internal registers. These are the fastest and easiest forthe processor to access. Next in

lineistheLevel 1

(LI)

cache stored onthe same dieastheprocessor. This level of cache

also

has

avery low

latency

sinceit is operating atthe same speed ofthe chip. The Level

2 cache, which isthe nextlevel inthe

hierarchy,

canbe on the same package asthe CPU

or onthemainboard. Ineither case

it

has ahigher

latency

since data must comethrough

the memory bus into the CPU. Main memory is the next level and is much larger

(by

many orders of magnitudes) than L2 cache. Programs that are currently running are

stored in main memory and are accessed throughthememory bus when

they

areneeded.

Hard disk

is

generally considered the lowest level onthe memory

hierarchy

chain. This

level

is

the largest and

by

far the slowest level since

it involves

an actual physical

movement ofthereadhead. Figure 1-5 gives avisual

depiction

ofthe memory

hierarchy

(25)

Room

Box

Board

Begfie Chip

Cache

MainMemory

[image:25.577.180.394.48.398.2]

Secondary Memory

Figure 1-5:

Memory

Hierarchy

Table 1

[HENPAT96]

showsthe rangeofsizesand access timesof each

level in

(26)

Level 1 2 3 4

Called Registers Cache Main

Memory

Disk Storage

Typical

Size

<1KB <4MB <4GB >1 GB

Implementation

Technology

CMOS or

BiCMOS

CMOS

SRAM CMOS DRAM Magnetic disk

Access Time

(ns)

2-5 3-10 80-400 5,000,000

Bandwidth

(MB/sec)

4000-32,000

800-5000 400-2000 4-32

Managed

by

Compiler Hardware

Operating

System

Operating

System/User

Backed

by

Cache Main

Memory

[image:26.576.68.509.60.233.2]

Disk Tape/CD

Table 1: Rangeofsizes and accesstimesineachlevel inthememory

hierarchy

The need for cache

is

due to the factthe CPU performance has advanced faster

than memory performance. CPU performance has improved 35% per year until

1986,

55%per yearsince, whilememoryperformanceimproved only 7%per year. Withthis in

mind, cache has proven to be a very effective way to improve overall system

performance.

Figure 1-6 shows the speed ofthe

dotproduct,

oftwo vectors on the cache based

RS/6000-980. For vector lengths greaterthan 2000 the cache cannot accommodate all

relevant data and the performance drops as more data has to be transported to/from

memory.

(27)

70,0

60.0

60.0

I

40,0

30.0

20,0

Speedofdotproduct

onlBMRSraooOSW

200O.0 4000.0 6000.0 vcstftfl*

[image:27.576.145.423.62.295.2]

8000.0 10000.0

Figure 1-6: CacheEffecton aSystems Performance

While,

each of these enhancements

individually

can increase processor

performance, modern processors use them all to make an extremely advanced design.

These enhancements, coupled with VLSI

technology

advancements, have made it

possible for computingpowertogrow at an exponential rate.

1.3 Flynn's Classification

of

Computer Architectures

All uniprocessor systems follow the von Neumann model. The von Neumann

architectureis characterized

by

aCPU and central memory system, with

instructions

and

data

being

readfrom memory.

Afteraninstruction has beenreadthe

instruction is decoded

and thenany relevant

operands fetched from memory, the

instruction is

executed andthe result stored back in

memory. The single

data

path

between

the

CPU

and memory over which

both

instructions and

data

must pass, and the sequential nature of

instruction

execution
(28)

together limit the performance possible

from

the computer. This

is

sometimes known as

the von Neumann

bottleneck.

This

is

aided in uniprocessor's

by

pipelining and

superscalardesign enhancements.

One

form

of classificationforvonNeumann machines is based onthe number of

instructions

that can be executed at any one time and onthe number of chunks ofdata

that can be operated on at atime. In 1972 Michael Flynn introduced a classification of

various computer architectures based on notions ofinstruction and data streams. The

number of

instructions is

given as either SI for single instruction or MI for multiple

instructionand the number ofpieces ofdata is given as either SD forsingle data orMD

formultipledata. Machinescanthus beclassified as

SISD, MISD,

SIMDorMTMD.

1.3.1

SISD

The classical von Neumann machine can be regarded as a

single-instruction-single-data machine in that at any one time only a single instruction

is

being

executed,

and only a single piece ofdata

is

being

operated upon. Thisis wherepart ofthe problem

arises, since we often want to performthe same instruction on many different pieces of

data,

andthevonNeumannmachine requires ustofetchthe same instruction many

times,

once for each piece ofdata. In fact the situation is much worse since a von Neumann

machine will usually require us to create a

loop,

and so we will need to execute many

instructions for each piece of data. This can slow the machine down many times over

whatthearithmetic unit

is

capable ofperforming.

1.3.2

MISD

The multiple instruction single

data

(MISD)

architecture

is

the most uncommon

one. Inthis architecture,the same

data

stream

flows

througha

linear

arrayofprocessors,
(29)

executing

different instructions

onthe stream. This kindof architecture

is

also known as

a systolic array for pipelinedexecution of specific algorithms.

1.3.3

SIMD

Forproblems inwhich the same operation needs to be performedon many pieces

of

data,

particularly those

involving

vectors and arrays, SIMD (single-instruction

multiple-data) architectures are often capable ofhigh speeds. A single CPU controls

many arithmetic units, each of which operates on

its

own data. Each arithmetic unit

executesthe same instruction as determined

by

the

CPU,

but uses data found in

its

own

memory. Thus all the elements oftwo vectors could be added together simultaneously,

increasing

the speed oftheoperation manytimesover a SISDmachine.

Inpractice, theprovision ofmany arithmetic units

is

expensive, particularly since

many ofthemwill notbe in use atanygiventime. Evenifalarge number of arithmetic

units are provided, the size of vectors and arrays will rarely be amultiple ofthe number

ofarithmetic units and so some

inefficiency

intheuse ofthearithmetic units will arise.

A more effective use of hardware can be obtained

by

pipelining the arithmetic

unit. A hardware

floating

point accelerator will already contain dedicated hardware for

each part ofthe calculation of a

floating

point operation.

By

pipelining the use ofthis

hardware,

significant improvements can be made in processor performance. This

technique will not give as high a performance as a true SIMD machine,

but

the

improvementscanbesignificant.

1.3.4 MIMD

The most general

form

of vonNeumann architecture

is

the

multiple-instruction-multiple-data machine. A MTMD machine

is

usually a number of separate processors
(30)

connected together through some interconnection network. The actual format of

interconnection between

the processors can take many

forms, depending

on the type of

problem, whichthe machine

is

designedto solve. This

is

the most common architecture

chosen for multiple processor machines

because

modern processors have the control

logic for parallel systems

built

in.

Therefore,

this is attractive since software,

replacementparts and additionsto the system areeasilyaccessible.

1.4 The Quest for

a

Mainstream Supercomputer Architecture

As stated previously CPU performance advancements have come from two main

areas, VLSI and computer architecture improvements. It seems certain that the

advancements in VLSI technologies are

hitting

the limits.

Also,

most architectural

enhancementshavebeen implemented incurrentdesigns. Withthisinmind, thefutureof

mainstream computing

is

in need of an alternative computer platform. This alternative

lies inparallel processing. Parallel processing

involves

utilizingmorethan one CPU ina

computersystem; working cooperativelyto achieve

increased

performance. As shown in

the previous sectionthere arefour architectures that couldbe usedto

implement

parallel

machines. It is believed that the appropriate choice for future machines will

be

ofthe

shared memory MIMD "tightly-coupled" variety.

These

systems will usually contain

between 2 to

O(10)

CPU's on a single system

board

with uniformly shared memory

between the processors and an interconnection network on the

board.

Boards

ofthis

nature are referredto as

"tightly-coupled"

due to the

fact

that the processors

lie

closeto

each other on the same system board.

Systems

created

in

this nature are referred to as

SMPmachines.

(31)

1.4.1

SMP

An SMP node contains several identical processors, each

typically

with its own

on-chip cache and a larger off-chip cache, which have uniform access to a shared

memory and other resources such as the network interface. Figure 1-7 shows a block

[image:31.576.153.424.193.397.2]

diagramof asymmetricmultiprocessing system.

Figure 1-7:Block Diagramof anSMP Systemwith4 Processors

In this scenario there are four processors each of which

has

their own local L2

cache outside ofthe CPU in addition to the LI cache

inside

ofthe processor. In SMP

systems each processor sharesthe samememory

image.

This means that

if

two

different

processors accessed the same memory

location,

they

would receive

identical

values.

Some importantcharacteristics ofSMP's include:

High-Speed

Memory

Bus - Since several

processors need to get access to main

memory, a

dedicated,

high-throughput memory

bus

is

required.

Design

ofthe memory

bus

is

critical

in

producinganefficient SMP architecture.
(32)

Separate

Secondary

Cache

-

Each

processor

in

the system

has

its

own secondary

(level

2)

cache.

The

provision of separate caches

for

each processor requires complex

logic

in the cache controllerto make sure that a processor never works on

data

that

has

been

updated

in

another processor's cache. This problem

is

addressed through cache coherence protocols that make sure the most recent value in the processors cache correspondsto the

data in

memory. The primary advantageofa dedicated-cache design

is

the ability to

increase

the number ofprocessors in a system, without saturating the

memory

bus.

This approach seems to

be

the most popular for high-end multiprocessor servers

because it

ensures optimum performance even when a system

is

scaled to

its

maximum configuration. The size ofthe cache itself

is

also relevantto performance. As

a general rule, the larger the secondary cache, the better an SMP system will scale as extra processors are added. Figure 1-8 showsthe effect in TPS

(transactions

per second) ofon-chip cacheinanSMP systemwithtwoandfourprocessors.

800.000

700.000

600 000

500.000

-400.000

--300.000-

-200.000

100.000-p

-

-*--* ?

* tr *

*-* * * M X * X * *^K-* * A

-4CPU1mL2 -4CPU512KL2 -2CPU1mL2 -2CPU512KL2

0.000 H 1 1

1-X i J) JI 10 X Si

c = c H e c: e B

t>

2

CI CO O T CD CM <D O V CO1 CM* <u' o' MixName

Figure 1-8:Cacheeffecton aSMPsystem

[image:32.577.100.483.420.640.2]
(33)

I/O to

Memory

Bus Bridge - In

systems

today

the I/O bus interfaces with the

memory bus rather than

directly

to a CPU. This creates even more contention in SMP

systems since the CPU'smust go through

it

to access resources. Therefore ahigh speed

I/Oto

Memory

Bus Bridge

is

required.

Multiprocessor systems will be the main thrust once VLSI and architectural

advancements have reached the end oftheir lifetime. These systems will be found in

homes

and businesses alike.

The MTMD architecture seems to be the future ofmainstream high performance

computing.

However,

it involves

system design complexity. Amajor obstacle in these

systems

is

cache coherence. Since there are multiple processorsworking cooperatively

on a single or multiple tasks, data

is

constantly

being

shared between the processors.

Whenone processor makes a changeto some piece of shared

data

(currently

stored in its

local cache) the other processors needto know about

it

in case

they

will need to use the

same piece ofdata. Ifother processors are not informed

immediately,

the value for the

datathat

they

usemay notbethemost current. Thisproblemis called cache consistency.

To alleviate this problem current systems use a cache coherence protocol to

insure

that

the data in a processors local cache is always up to

date.

This extra processing and

memory bus access results in a large amount of overhead for SMP systems.

Unfortunately,

there are no current

implementations

that can alleviate this problem.

Instead current systems aim to deal with the problem

in different

ways,

resulting in

a

largeoverheaddueto the coherenceprotocols used.

This thesiswill present a new architecture

for

multiple processor systems, which

removesthe cache coherence protocol required

in

sharedmemory MTMD architectures.
(34)

In chapter

2,

we will investigate cache coherence and the various approaches to

cope with it. We

follow

by focusing

chapter 3 on analyzing the current MTMD

architectures versus the proposed architecture. Then in chapter 4 we will compare

benefits

gained from using the new architecture alongwith the changes that need to be

madefor ittobe

implemented

feasibly. Tothis end, adatarate model ofSMP systemsis

employed to

illustrate

performance of each ofthe architectures.

Finally,

chapter 5 will

concludethethesis and present

directions

for futurework.
(35)

2

Cache

Coherence

As

discussed in

section 1.2.4, L2 cache existsbetweentheCPU and main memory

in a computer system. The purpose ofL2 cache

is

to further reduce effective memory

accesstime

by

reducingthe LI cache misspenalty, since main memory's speed is much

slowerthan thatoftheCPUs internalregisters andLI cache.

Inuniprocessorsystems, cacheis easily implementedwithverylittle added design

overhead and complexity. Whenthe processor needs information that does not exist in

its internal registers orLI cache

it

checks for the datain theL2 cache. Ifthe data does

not existin L2 cache,then thedata isreadfrom main memory, which may inturnneedto

gotoa mass storagedevicetoretrievethe

information,

generating a pagefault. Ifa piece

of

data

is found inacacheit isconsidered a cache

hit,

otherwise

it is

a cache miss. Cache

misses are simplyone minusthe cache

hit

ratio, which

is

the ratio ofthenumber of

items

that arefound in cache versus the number of

items

requested. Once the

data is found

in

memory it istransferred toL2 cacheintheform of a cacheblock. Whentheprocessor is

finished using a piece of

data,

it

is

updated in LI and L2 cache. Main memoryupdates

depend onthe actual write policyused, either write-through orwrite-back,

discussing

in

section 2.1.4. This method of program execution

is

the

backbone

of all uniprocessor's

following

thevonNeumann model.

A sufficiently

fast

memory

bus

must

be

implemented

between

L2 cache and main

memorytomeetthe

demand,

for instructions

and

data

by

theprocessor.
(36)

2.1

Cache Basics

Beforethe cachecoherenceproblem can

be

detailed it

is

importantto have a good

understanding ofhow caches handle data. The memory

hierarchy

of computers breaks

information

up

into

toblocks ofdata. These blocksofdataare movedin and out of cache

as needed. An entireblockof

data (which includes

many memory

locations)

is moved at

a

time, due

to the principle of spatial locality. Spatial

locality

states that items whose

addresses are near one anothertend tobe referenced closetogether intime [HENPAT96].

Therefore,

when a new block is brought

into

the cache it

is

beneficial to

bring

in the

surrounding data also, since

they

will most

likely

be needed in the near future. The

designof cache subsystems involves fourmajorissuesthatneedtobeaddressed.

2.1.1

Cache Organization

The organization of a cache dictates where a block can be placed when it is

brought in from main memory. There are three cache organizations used today: direct

mapped,

fully

associative, and set associative. Figure2-1 visually describes each ofthree

organizations. Their descriptionsare containedinthe

following

section.
(37)

FuHyassociative:

block 12can go anywhere

Directmapped:

block 12can go

onlyintoblock 4 (12mod8}

Setassociative;

block12canga anywherehset0

(12mod4)

Block 0 12 345 67 Block 0 12 34 5 67 Block 01 2 345 6 7

Cache

n

Set Set Set Set

0 12 3

Blockframeaddress

Block 1111111111222222222233

no. 01234567890123456789012345678901

[image:37.576.114.461.60.336.2]

Memory

Figure 2-1:CacheOrganizationSchemes

2.1.1.1 Direct MappedCache

In a directmapped cache, each block has only one place where

it

can go

into

the

cache. The mapping forablock

in

adirectmapped cache

is

shownin Equation 1.

(block address) MOD (number of blocksincache)

Equation 1:

Mapping

forBlock inaDirect Mapped Cache

2.1.1.2

Fully

Associative

In a

fully

associative cache, each

block

can appear anywhere

in

the cache.

Fully

associative caches are extremely easy to

implement,

because

ofthe minimal amount of

overheadinvolved.

(38)

2.1.1.3 Set Associative

Finally,

set associative caches, limit the number of places ablock can be placed.

Blocks inthe cache arebroken offinto groups of sets. Each block inmemory

is

mapped

into

a single set, generally using Equation 2.

(block address) MOD (number ofsetsinthecache)

Equation2:

Mapping

for Block inaSet Associative Cache

Once amemory block has beenassignedto asetitcanbeplaced anywhere inside

theset.

2.1.2

Cache

Block

Lookup

Nowthat it is understood how caches get data into

them,

the process ofreading

from a cache will be detailed next. Each block in a cache has an address

tag

associated

with it. When a processor wishes to retrieve data from the cache it uses the block's

address

tag

to reference the data. The

tag

of each block (the actual number ofblocks

checked depends upon the organization of the cache)

is

compared against the

tag

requested

by

theCPU. The figure belowshows layoutof anaddress forpiece of

data

in a

cache.

Blockaddress Block

offset

Tag

Index

Figure2-2:ATypicalCacheBlockAddress

The index portion ofthe address

is

used to select the set

in

the cache, while the

block offset

is

usedto selectthe actual piece of

data

in the cacheblock. The

tag

portion

oftheaddress

is

comparedto theprocessors requestedtag.
(39)

A

fully

associative cache would

have

no index field since ablock is not restricted

to any single set.

Note

that

in

a set associative cache the index field would be used to

selectthe setthatcontainsthedata. While

in

a

direct

mapped cachetheindex

field

would

selecttheactual

block

containingthedata.

2.1.3

Replacement

Strategy

The replacement strategy

dictates

which block

is

replaced when a cache miss

occurs. The actual process of selecting the block to be replaced when a cache miss

occurs

is

done

by

the cache controller. A cache miss occurs when the

tag

requested

by

the CPU was not

found

inthe cache. When a miss occurs, ablock inthe cache mustbe

replaced with a block from the next higher level of memory. In a direct mapped cache

thereis no need fora replacement strategy sincethere is onlyone locationthat ablock is

capable ofgoing into. There are many replacement strategies that are used in cache

controllers. Threereplacement strategies are

Least-Recently

Used, Random,

and

First-In,

First-Out(FIFO).

2.1.3.1 Least

Recently

Used

(LRU)

The LRU replacement strategy records all accesses to cache blocks. When a

cache miss occursthecacheblock that

is

replaced

is

theonethat

has

gone unusedforthe

longest amount of time. This

follows

along the same

lines

as temporal

locality.

Temporal

locality

statesthata cacheblockthat

has been

recentlyused

is

likely

to

be

used

again in the nearfuture. LRU replacement can

become

extremely expensive, especially

in

largecaches, since all accesses needto

be

recorded

internally

in

thecache.
(40)

2.1.3.2 Random

The simplest strategy to employ

is

a random replacement strategy that

is

spread

uniformly across the cache. When a cache miss occurs a random block number

is

selected and the selected block

is

replaced. Studies have found that while the random

replacement strategy may not be the most intuitive strategy

its

results are quite

impressive.

The attractiveness in a random replacement strategy

is

in the ease of

implementation.

2.1.3.3

First-In,

First-Out

(FIFO)

FIFO replacement strategyreplacesthecacheblockthat has been in cacheforthe

longestperiod oftime. This strategy has provento yield worse resultsthan theLRU and

random replacement strategies.

2.1.4 Write

Policy

The final aspect about caches

is

the write policy. When a data value

has

been

modified in the processor registers, it is

immediately

written back to LI cache and L2

cache.

Updating

the data in main memory depends on the particular write policy

being

employed inthe system. Therearetwowritepoliciesthatare used incache

design

today.

2.1.4.1 Write-Through

In a write-through cache, when data

is

written to L2 cache

it is

also written to

main memory simultaneously.

Therefore,

main memory always contains an exact copy

ofthe datathat is in the L2 cache ofthe processor.

Write-through

cache's put a

large

amount of overhead onto the memory bus since

it is

not always necessary to

have

an

updated copy ofthe

data in

main memory. Forthis reason, write-through caches are not

widely used.

However,

writethrough caches are extremely easy to

implement

since all
(41)

writes aresenttoL2 andmain memoryatthesametimeand no additional logic

is

needed

in

the cache.

2.1.4.2

Write-Back

Inwrite-back caches, data iswrittenback only to theL2 cache. Whenthe cache

becomes

full or the cache block is

being

replaced, the data is then updated in main

memory.

Therefore,

all writing to main memory from L2 cache

is

done when a cache

miss is encountered. The advantage ofthe write back cache

is

that all writes from the

processor occur

locally

at the speed of the cache memory (much faster than main

memory). Since main memory

is

only updated when the cache block is needed, the

memory bandwidth requirements of a writeback policy are much more lenientthan that

ofthe write-through policy. This frees up bandwidth for other devices in the system,

most

importantly,

other processors inmultiprocessor systems.

2.2 Multiprocessor Cache

coherence

In a multiprocessor system, data

inconsistency

can occur in adjacent levels of

memory or within the same level.

Therefore,

it is possible for the current data in main

memory to differ for its most recent value, sincethe most recent value would

be

stored

only in theprocessors local cache

(depending

onthewritepolicy,

it

may also

be

in main

memory). In addition, L2 caches of otherprocessorsmay contain even older

data

values

ofthe same memory location. Thisis not possible

in

uniprocessor systems sincethereis

only one processorin the systemthatwill ever modify the

data.

There aretwo possible

waysinconsistent

data

can appearinacache,

data

sharingorprocess migration.
(42)

2.2.1

Data

Sharing

Since

data

in

multiprocessorsystem

is

commonly sharedbetween manyprocesses

executingon

different

processors

it is

possiblefortheprivate caches on each processorto

contain

different

copies ofthe same shared data. Figure 2-3

[HWANG93]

shows how

inconsistency

can occur when

dealing

with shared

data.

Processors

Pi p2 Pi p2 Pi p2

1

1

Caches X X X'

X X"

X

\'-i

1

Shared

Memory x

X'

X

Beforeupdate Write-through Write-back

Bus

Figure 2-3: Data

Inconsistency

duetoData

Sharing

2.2.2 Process Migration

In multiprocessor systems, processes

frequently

migrate

from

one processor to

another.

Unfortunately,

shared

data

that

is

residing in aprocessor's local cache

does

not

migrate withthe process.

Therefore,

it is

possible foraprocessorto updatea shared

data

value in

its

L2 cache, get

interrupted,

hand the process over to the OS which in turn

hands

it

overto anotherprocessor. Thisproblem

is

knownas process migration and

is

a

common occurrence resulting

in data

inconsistencies.

Figure 2-4

[HWANG93]

visually

depicts

how data becomes

inconsistent

due

toprocess migration.
(43)

Processors

Caches

Pi p2

X

Shared Memory

Pi p2

X X

rJ

X'

Pi P2

X X

,

i

rJ

X

Bus

[image:43.576.119.444.52.191.2]

BeforeMigration Write-through Write-back

Figure 2-4: Data

Inconsistency

duetoProcess Migration

2.3 Ways

to

Handle Cache Coherence

Multiprocessor systems have widely varying ways to handle inconsistent data in

memory. In massively parallel processors

(MPP's)

a directory-based scheme

is

used,

whilein SMP systems snoopy busprotocols are used.

2.3.1

Directory

Based Protocols

Directory

based coherence protocols are commonly used in large scale,

distributed memory systemswhere afast interconnectionnetwork exists between each of

the nodes. In a distributed

directory

scheme eachmemoryunit

has

a

directory

structure

which contains listings of which cache currently has copies of

its

memory blocks. When

a read miss occurs in a cache, a request message is sent to the memory unit that

it

received the cache block from. The memory unitthen updates

its

value

from

the cache

with the most current copy and sends a copy to therequesting cache. Central

directory

based schemes have a main

directory

which contains all the

information

relating to a

memory block's location

in

a processor's cache. Lookups are

done

using this central

directory

only. Contention onthe central

directory

has limited

adaptation ofthis scheme

in actual systems.

(44)

2.3.2

Snoopy

Bus Protocols

SMP systemsuse shared memoryconnectedto ahigh-speed memory bus. These

systems also

follow

the von Neumann model of execution, allowing all the CPU's to

access the main memory asynchronously with respect to each other. In parallel

applications data sharing between processes running on different CPU's requires

advanced

features

to

keep

data

coherent.

Since caches are used inthese

designs,

therewillbe

data

consistencyproblems in

main memory since themost recent datavalues wouldbe stored in caches. Tothis end,

SMP systems employ a cache coherence mechanism to ensurethat the value a processor

is reading from its cache is the most current one. Cache coherence requires both

hardware and software supportto achieve acceptablelevels inperformance.

The hardware support for cache coherence comes in the form of a snoopy

bus,

operating under a snoopy-bus protocol. When multiple private caches are tied together

on a single

bus,

the methods used to ensure consistency entail changes to the write

policies. The snoopy bus allows all processors inthesystemto monitorthe traffic on the

memory bus. The processor is allowed to take appropriate action

depending

upon the

writepolicy. Figure 2-5 shows a SMP system with ashared memoryvariable

loaded into

each oftheprocessor's local caches.

Shared

Memory

Bus

Caches

[image:44.576.180.395.565.652.2]

Processore

Figure 2-5: Initial Stateofthe

Memory

System
(45)

Inthenext sectionswe will seehowthe caches and memoryare modified to cope

with cache coherence.

2.3.2.1 Write-Invalidate

Policy

In a write-invalidatepolicywhen a processor writes a valueto a cacheblock in its

private cache it also sends an invalidation signal to all other caches which contain the

cache

block,

including

main memory. This signal notifies them that the data has

changed.

X"

*

Shared

Memory

1 1 1

Bus

1

1

1

X1

1 Caches

[image:45.576.179.395.300.385.2]

A

fa)

d)

Processors

Figure 2-6: StateoftheSystemafteraWrite-Invalidation Operation

Figure 2-6

[HWANG93]

shows the state ofthe system after a write-invalidate

operation,

by

Pi.

Since the

Pi

has a write through cache, main memory contains the

updated valuethat itreceived overthe snoopy bus from

Pi.

2.3.2.2 Write-Update

Policy

In awrite-update policy when aprocessor writes a value to a cache

block

in

its

private cache it also updates other caches (if

they

contain the cache

block)

and main

memorywith the new value. The update

is done

using the

features

ofthe snoopy

bus,

which allows other processors to monitor

bus

activity. Figure 2-7

[HWANG93]

shows
(46)

the state of a SMP system after the write-update operation, in which all caches now

containtheupdated value.

X Shared

Memory

1 1

1

Bus

1

1

1

X X" X'

Caches

[image:46.576.180.404.161.247.2]

Processors

Figure 2-7: Stateofthe

Memory

Systemafter aWrite-UpdateOperation

The write-update policy is extremely effective at ensuring data consistency.

However,

it places anunnecessarily largeamount oftraffic onthe memory bus since not

all processors mayneedthe updated value.

2.3.2.3 Write-Once Protocol

James Goodman in 1983 proposed a cache coherence protocol for bus-based

multiprocessors. Inordertoreduceunnecessary bus

traffic,

thevery

first

write of a cache

blockuses a write-through policy. Thiswill

keep

main memory consistent with the local

cache afterthe first write. Afterthefirst write, memory

is

updated using the write-back

policy [GOOD83]. Figure 2-8

[HWANG93]

details

Goodman's protocol, which uses 4

statesto describes

its

execution.
(47)

P-Read Write-lnv/Read-lnv

P-Write Read-lnv

[image:47.576.198.380.64.240.2]

P-Write

Figure 2-8: StateDiagram fortheWrite-Once Protocol

Each transaction in the figure represents extra overhead that is placed on the

memory bus. This traffic reduces the amount of utilization of the bus

by

other

processors,whichinturndegradestheoverallperformance ofthemachine.

2.4

Consistency

Models

Parallel applications executing on a parallel machine require data to be used in

multiple CPU's. Because ofthis it is very important to make other processors aware

about any changes to data that

they

may also have a copy.

Consistency

models specify

the order

by

whichtheevents

from

one process shouldbe observed

by

other processes in

the machine [Hwang93]. The two main consistency methods are sequential and

weakened consistency.

2.4.1

Sequential

Consistency

Sequential consistency

is

when"theresultofanyexecution

[of

theprogram]

is

the

same as

if

the operations of all processors were executed in some sequential order, and

the operations of each individual processor appear

in

this sequence

in

the order specified
(48)
[image:48.576.167.413.143.260.2]

by

its

program"

[LAMP79].

Since data is loaded and stored

identically

to uniprocessor

systemsthe coherencemechanism mustbe abletorespond rapidlyto changes.

Processors

switch

Single-PortMemory Shared

Memory System

Figure 2-9: The Sequential

Consistency

Model

Figure 2-9

[HWANG93]

shows how the sequential consistency model can be

described. Eachprocessor

is

connectedtomemorythroughthe same switchensuringthat

no processor can update mainmemoryout of order.

Thesingle portedmemory ensuresthat thereis only one memoryaccess operation

inprogress at anyonetime. Therefore some queuing mechanism is needed to order and

serializethememoryreferences while

they

waittobe serviced.

In 1992

Sindhu, Frailong,

and Cekleov specified that for sequential to exist the

following

five

axiomsmustbetrue [SINDHU92]:

1)

A load

by

a processor always returnsthevalue written

by

thelatest storetothesame location

by

otherprocessors.

2)

The memoryorder conformsto atotal

binary

order inwhich shared memory

is

accessedinrealtimeover all

loads

and stores

withrespectto all processorpairs and locationpairs.

3)

Iftwo operations appear

in

a particular programorder, then

they

appearinthesamememoryorder.

(49)

4)

The swap operation isatomic with respecttootherstores,meaning

thatno other store canintervene betweenthe load and store parts

of a swap.

5)

All stores and swaps musteventuallyterminate.

The sequential consistency model

is

enforced in

hardware,

on the fly. In this

model, all memory accesses are atomic and

tightly

orderedto ensure the accuracy ofthe

model.

Therefore,

all memory accesses must be global and a processor cannot issue

another memory access until the most recent shared memory access

by

a processor has

been performed globally. These mechanisms ensure that the correct program order is

observed.

2.4.2 Weak

Consistency

The sequential consistency model demands the most memory bandwidth and

additional support (both hardware and software) to ensure

its

accuracy. To remove the

amount of bandwidth and extra work needed, various degrees of weaker consistency

models have been created. The TSO weak consistency model was develop

Figure

Figure 1-1: Grand Challenge Applications
Figure 1-2: Chip Density for Intel Microprocessors
Figure 1-4: Five stage pipeline
Figure 1-5: Memory Hierarchy
+7

References

Related documents