EC6009 UNIT-1

(1)

EC6009 ADVANCED COMPUTER

ARCHITECTURE

Review of fundamentals of CPU, Memory

and IO - Trends in Technology, Power,

Energy

and

Cost,

e!enda"ility-Energy

and

Cost,

e!enda"ility-Performance Evaluation#

Performance Evaluation#

UNIT I FUNDAMENTALS OF

COMPUTER DESIGN

(2)

INTRODUCTION

•

• $%&'( years "ac) the *rst general !ur!ose electronic com!uter$%&'( years "ac) the *rst general !ur!ose electronic com!uter

was created#

•

• T Today oday less less than than +%((, +%((, mo"ile mo"ile com!uter com!uter that that has has moremore

!erformance, more main memory and more dis) storage than a

com!uter in .% for + million#

•

• This This ra!id ra!id im!rovement im!rovement has has come come "oth "oth from from advances advances in in thethe

technology used to "uild com!uters and from innovations in

com!uter design#

•

• RI/C "ased machine focused on two critical !erformanceRI/C "ased machine focused on two critical !erformance

techni0ues#

techni0ues# E1!loitation E1!loitation of of Instruction Instruction 2evel 2evel PaParallelism rallelism 3initially3initially

through

!i!elining and later through multi!le instruction issue4

Use of Caches#

•

• 5or many a!!lications, the highest !erformance micro!rocessors5or many a!!lications, the highest !erformance micro!rocessors

of today out!erform the su!ercom!uter of less than (

of today out!erform the su!ercom!uter of less than ( years ago#years ago#

•

• ramatic im!rovement in cost-!erformance leads to new classesramatic im!rovement in cost-!erformance leads to new classes

of com!uters#

(3)

INTRODUCTION

•

• $%&'( years "ac) the *rst general !ur!ose electronic com!uter$%&'( years "ac) the *rst general !ur!ose electronic com!uter

was created#

•

• T Today oday less less than than +%((, +%((, mo"ile mo"ile com!uter com!uter that that has has moremore

!erformance, more main memory and more dis) storage than a

com!uter in .% for + million#

•

• This This ra!id ra!id im!rovement im!rovement has has come come "oth "oth from from advances advances in in thethe

technology used to "uild com!uters and from innovations in

com!uter design#

•

• RI/C "ased machine focused on two critical !erformanceRI/C "ased machine focused on two critical !erformance

techni0ues#

techni0ues# E1!loitation E1!loitation of of Instruction Instruction 2evel 2evel PaParallelism rallelism 3initially3initially

through

!i!elining and later through multi!le instruction issue4

Use of Caches#

•

• 5or many a!!lications, the highest !erformance micro!rocessors5or many a!!lications, the highest !erformance micro!rocessors

of today out!erform the su!ercom!uter of less than (

of today out!erform the su!ercom!uter of less than ( years ago#years ago#

•

• ramatic im!rovement in cost-!erformance leads to new classesramatic im!rovement in cost-!erformance leads to new classes

of com!uters#

(4)

INTRODUCTION

•

• The The last last decade decade saw saw the the rise rise of of smart smart cell cell !hones !hones and and ta"let ta"let com!uters, com!uters, whichwhich

are many !eo!le are using as their !rimary com!uting !latform instead of PC

are many !eo!le are using as their !rimary com!uting !latform instead of PCs#s#

•

• These These mo"ile mo"ile client client devices devices are are increasingly increasingly using using the the internet internet to to accessaccess

warehouses containing tens of thousands of servers#

•

• Mainframe com!uters and high !erformance /u!ercom!uters all are collectionsMainframe com!uters and high !erformance /u!ercom!uters all are collections

of micro!rocessors#

•

• T Today oday the nathe nature of ture of a!!lication aa!!lication also changlso changes# /!eech, es# /!eech, sound, sound, images and images and videosvideos

are "ecoming increasingly im!ortant along with !redicta"le res!onse time that is

so critical to

so critical to the user e1!erience#the user e1!erience#

•

• An inspiring exap!e is G""g!e G"gg!es#An inspiring exap!e is G""g!e G"gg!es#

•

• This a!!lication lets This a!!lication lets you hold you hold u! your u! your cell !hone to cell !hone to !oint its camera !oint its camera at an o"at an o"6ect,6ect,

and the image is sent wirelessly over the internet to a 7/C that recogni8e the

o"6ect and tells you

o"6ect and tells you interesting information a"out it#interesting information a"out it#

•

• Read the "ar code on a "oo) cover to tell you if a "oo) is availa"le online and itsRead the "ar code on a "oo) cover to tell you if a "oo) is availa"le online and its

!rice#

•

• /ince 9((:, single-!rocessor !erformance im!rovement has dro!!ed to less than/ince 9((:, single-!rocessor !erformance im!rovement has dro!!ed to less than

99; !er year due to the twin hurdles of ma1imum !ower dissi!ation and the lac)

of more I2P#

•

• In 9((<, Intel canceled its high-!erformance uni!rocessor !ro6ects and 6oinedIn 9((<, Intel canceled its high-!erformance uni!rocessor !ro6ects and 6oined

others in declaring that the road to higher !erformance would "e via multi!le

!rocessors !er chi! rather than via faster un

(5)

REVIE$ OF FUNDAMENTALS OF CPU

The functional "loc)s in a com!uter are

%# ALU

&# C"n'r"! Uni' (# Me"r)

*# Inp+' Uni' ,# O+'p+' Uni'

• The =2U contains necessary electronic circuits to !erform arithmetic and logical

o!erations#

• The Control Unit analyses each instruction in the !rogram and sends the relevant

control /ignals to all other units & =2U, Memory, In!ut and Out!ut Unit#

• The !rogram is fed into the com!uter through the in!ut unit and stored in the

memory# In order to e1ecute the !rogram, the instructions have to "e fetched from memory one "y one# This fetching of instruction is done "y the control unit#

• =fter an instruction is fetched, the control unit decodes the instruction# =ccording

to the instruction, the control unit issues control signals to other units#

• =fter an instruction is e1ecuted, the result of the instruction is stored in memory

or stored tem!orarily in the control unit or =2U, so that this can "e used "y the ne1t instruction#

• The results of a !rogram are ta)en out of the com!uter through the out!ut unit# • The control unit and =2U are collectively )nown as Central Processing Unit 3CPU4#

(6)

REVIE$ OF FUNDAMENTALS OF CPU

• The !hysical units in a com!uter such as the CPU, Memory, In!ut and

Out!ut units form the >ardware#

• The Com!ilers as well as user !rograms 3high level language or machine

language4 form the software#

• >ardware wor)s as dictated "y the software# The o!erating system is a

s!ecial software that manages the >?7 and /?7#

Ari'-e'i. an/ L"gi. Uni'

The =2U has hardware circuits which !erform !rimitive arithmetic and logical o!erations# The >?7 sections in =2U are

# =dder

9# =ccumulator

:# @eneral Pur!ose Register <# Counters

%# /hifters

$# Com!lementer#

A//erA adds two num"ers and gives the result#

A..++!a'"rA Register which tem!orarily holds the results of a !revious o!eration in

(7)

REVIE$ OF FUNDAMENTALS OF CPU

Genera! P+rp"se Regis'er 1GPR2A 7hen an o!erand is stored in main memory, it

Ta)es time to retrieve it# If it is stored within the CPU, it is immediately availa"le to the CPU#

The @PRBs store dierent ty!es of information # O!erand

9# O!erand address :# Constant

/ince they are used for multi!le !ur!oses, these registers are )nown as @PRBs#

S.ra'.- Pa/ Me"r) "r Regis'ersA uring Com!le1 o!erations li)e multi!lication,

division etc#, it is necessary to store intermediate results tem!orarily# 5or this !ur!ose there are usually one or more scratch !ad registers# These are !urely internal >?7 resources and not addressa"le "y !rogram#

S-i3'er an/ C"p!een'erA The shifter !rovides left and right shift re0uired for various o!erations# The com!lementer !rovides 9Bs com!lement of "inary num"ers#

(8)

REVIE$ OF FUNDAMENTALS OF CPU

CONTROL UNITA

The control unit is the most com!le1 unit in a com!uter# Its main functions are # 5etching instructions

9# =naly8ing the OPCOE

:# @enerating control signals for !erforming various o!erations#

H4$ res"+r.es "3 a ."n'r"! +ni'

Pr"gra C"+n'er "r Ins'r+.'i"n A//ress C"+n'er 1IAC2

I=C contains the memory address of the ne1t instruction to "e fetched# 7hen an instruction is fetched, the I=C is incremented so that it !oints to the address of the ne1t instruction# Every instruction contains an o!code# In addition it may contain one or more of the

following# # O!erand

9# O!erand address :# Register address

PS$ Regis'er  It contains various status "its descri"ing the current condition of the CPU# These are )nown as Dags# Two such Dags are

*# In'err+p' Ena5!e 7hen this "it is , CPU will recogni8e interru!t re0uests# 7hen this "it is (, interru!t re0uests will "e ignored "y the CPU and they remain !ending# The MI is an e1ce!tion to this#

&# Oer7"8 7hen this "it is , it indicates there is an overDow condition in =2U in the !revious =rithmetic o!eration#

(9)

MEMOR AND IO

The Memory is organi8ed in to locations# Each memory location is )nown as one Memory 7ord#

Me"r) T)pes

O!/er ."p+'ers +se agne'i. ."re e"r) 8-i!e '-e presen' /a) 8e +se Sei."n/+.'"r Me"r)#

Core memory is non-volatile where semiconductor memory is volatile# semiconductor memory is of two ty!esA /R=M and R=M#

/R=M !reserves the contents of all the locations as long as the !ower su!!ly is !resent#

R=M memory can retain the content of any location only for a few milliseconds#

Ran/" A..ess an/ Se:+en'ia! A..ess Me"ries

In a R=M access time is same for all locations# 3Core and /emiconductor Memories are R=M4

In a se0uential access memory, the read or write access is se0uential# The time ta)en for accessing the *rst location is the shortest and the time ta)en for the last location is the 2ongest# 3Magnetic ta!e4

(10)

MEMOR AND IO

Memory Organi8ationA

The Memory unit consists of the following sectionsA # Memory =ddress Register 3M=R4

9# Memory ata Register 3MR4 :# Memory Control 2ogic

<# Memory cells

5or the read o!eration, the CPU does the following se0uenceA 3i4 /ends the address to M=R#

3ii4 /ends RE= signal to memory control unit#

The Memory control unit decodes the address "its and identi*es the location

to "e accessed# Then it initiates a read o!eration of the memory# The

memory ta)es some amount of time to !resent the contents of the location

in MR#

3iii4 =fter a suFcient time interval, the CPU transfers the information from

(11)

MEMOR AND IO

5or 7rite o!eration, the CPU does the following se0uenceA 3i4 /ends address to M=R#

3ii4 /ends data to MR#

3iii4 /ends 7RITE signal to memory control unit#

The Memory control unit decodes the address "its and identi*es the location 7here the write o!eration has to "e !erformed# It then routes the MR

Contents to memory and initiates the write o!eration#

Me"r) A..ess Tie The time ta)en "y the memory to su!!ly the contents of a location , from the time it receives GReadB is called the Memory =ccess time# Core Memory .((ns and semiconductor memory ((ns#

Me"r) C).!e Tie The memory access time !lus the additional recovery time 3memory is "usy due to internal o!eration4 is )nown as Memory Cycle time#

A+xi!iar) Me"r)

# 5lo!!y is) drive 9# >ard is) drive :# Magnetic ta!e drive <# C-ROM#

Inp+' 4 O+'p+' Uni's

Common in!ut units are Hey"oard, Do!!y dis), hard dis), magnetic ta!e, mouse, light !en, /canner, O!tical dis), etc#

Common Out!ut units are dis!lay terminal, !rinter, !lotter, Do!!y dis) drive , >ard dis) drive, magnetic ta!e drive and o!tical dis) drive, etc#

(12)

CLASSES OF COMPUTERS

PERSONAL MO;ILE DEVICES 1PMD2

•

Collection of wireless devices with multimedia user

interfaces - cell !hone and ta"let com!uters#

•

Price of a system is +((- +((( and !rice of ! +(-+((#

•

Energy and si8e re0uirements lead to use of Dash memory

for storage instead of magnetic dis)s#

•

Res!onsiveness and Predicta"ility are )ey characteristics for

media a!!lications#

•

5or e1am!le !laying a video on a PM, the time to !rocess

each video frame is limited, since the !rocessor must

access and !rocess the ne1t frame shortly#

•

The memory can "e su"stantial !ortion of the system cost

(13)

CLASSES OF COMPUTERS

DES<TOP COMPUTING

•

/!ans from low end net "oo)s sell for +:(( to high

end, heavily con*gured wor)stations that may sell for

+9%((#

•

/ince 9((. more than half of the des)to! com!uters

made each year have "een "attery o!erated la!to!

com!uters#

•

Com"ination of !erformance 3measured in terms of

com!uter !erformance and gra!hics !erformance4 and

!rice of a system, result in the newest high

!erformance ! and cost reduced ! often a!!ear in

des)to! systems#

•

es)to! com!uting tends to "e reasona"ly well

characteri8ed

in

terms

of

a!!lications

and

(14)

CLASSES OF COMPUTERS

SERVERS:

• Role of servers grew to provide large scale and more reliable file and

computing services.

• For servers different characteristics are important. First availability is critical. • Consider the servers running at ATM machines for banks or airline

reservation systems.

• Failure of such server system is more catastrophic than failure of a single

desktop. Since these servers must operate seven days a week !" hours a day.

• Second key feature of server system is scalability. The ability to scale up the

computing capacity the memory the storage and the #$% bandwidth of a server is crucial.

• Servers are designed for efficient throughput. The overall performance of the

server is in terms of transaction per minute or web pages served per second.

• %verall efficiency and cost effectiveness of a server & determined by how

(15)

CLASSES OF COMPUTERS

CLUSTERS4$AREHOUSE SCALE COMPUTERS

•

@rowth of /oftware =s = /ervice 3/==/4 for a!!lications

li)e

search,

social

networ)ing,

Jideo

sharing,

Multi!layer games, Online sho!!ing has led to the

growth of a class of com!uters & Clusters#

•

Clusters are collections of des)to! com!uters or servers

connected "y 2= to act as a single large com!uter#

•

Each node runs its own o!erating system and nodes

communicate using a networ)ing !rotocol#

•

2argest of the clusters are called 7arehouse /cale

Com!uters 37/C4 & designed so that tens of thousands

of server can act as one#

•

Price- Performance and !ower are critical to 7/CBs

(16)

CLASSES OF COMPUTERS

•

.(; of the cost of +(M 7arehouse is associated with

!ower and cooling of the com!uter inside#

•

etwor)ing gear cost another +'(M and they must "e

re!laced every few years#

•

7/CBs are related to servers, in that availa"ility is critical#

•

5or e1am!leA =ma8on#com

+: Killion in sales in the fourth 0uarter of 9((#

a"out 99(( hours in a 0uarter, average revenue !er hour

was almost +$M# uring

a !ea) hour for Christmas

sho!!ing, the !otential loss would "e many times higher#

•

/u!ercom!uters are related to 7/CBs & e0ually e1!ensive,

costing hundreds of million of dollars, "ut diers "y

em!hasi8ing Doating !oint !erformance#

(17)

TRENDS IN TECHNOLOG

INTEGRATED

CIRCUIT

LOGIC

TECHNOLOG

•

Transistor density increases "y :%; !er year

#

•

Increases in die si8e ranging from (; to

9(; !er year#

•

The com"ined eect is a growth in transistor

count on a chi! of a"out <(; to %%; !er

year or dou"ling every . to 9< months#

•

This trend is !o!ularly )nown as MooreBs

(18)

TRENDS IN TECHNOLOG

Year T r a n s_i s t o r s 4004 8008 8080 8086 80286 Intel386 Intel486 Pentium Pentium Pro

Pentium IIPentium III Pentium 4 1,000 10,000 100,000 1,000,000 10,000,000 100,000,000 1,000,000,000 1970 1975 1980 1985 1990 1995 2000

(19)

TRENDS IN TECHNOLOG

SEMICONDUCTOR DRAM

•

Ca!acity !er R=M chi! has increased "y a"out 9%; to

<(; !er year recently, dou"ling roughly every two to three

years#

ear

DRAM

gr"8'-ra'e

C-ara.'eri=a'i"n "3 ipa.' "n DRAM .apa.i')

%990 60> 4ear ?+a/r+p!ing eer) ( )ears

%996 60> 4ear ?+a/r+p!ing eer) ( )ears

&00( *0> @ 60> 4ear ?+a/r+p!ing eer) ( '" * )ears

&00 *0> 4ear D"+5!ing eer) & )ears

(20)

TRENDS IN TECHNOLOG

SEMICONDUCTOR FLASH 1E!e.'ri.a!!) Erasa5!e Pr"graa5!e Rea/@ On!) Me"r)2

• on-Jolatile semiconductor memory & standard storage device in

PMBs#

• Ca!acity !er 5lash chi! has increased "y a"out %(; to $(; !er year

recently, dou"ling roughly every two years #

• 5lash memory is % to 9( times chea!er !er "it than R=M#

MAGNETIC DIS< TECHNOLOG

• Prior to (, density increased "y a"out :(; !er year, dou"ling

in : years# Increased ((; !er year in $# /ince 9((<, it has dro!!ed "ac) to <(; !er year#

• is)s are % to 9% times chea!er !er "it than 5lash# • is)s are :(( to %(( times chea!er !er "it than R=M#

• This technology is central to server and warehouse scale storage#

NET$OR< TECHNOLOG

• etwor) !erformance de!ends on "oth on the !erformance of

(21)

PERFORMANCE TRENDS

;AND$IDTH OR THROUGHPUT

•

It is the total amount of wor) done in a given time,

such as mega"ytes !er second for a dis) transfer#

LATENC OR RESPONSE TIME

•

It is the time "etween the start and com!letion of an

event, such as milliseconds for a dis) access#

TRENDS IN PO$ER AND ENERG IN IC

•

5or CMO/ chi!s, the !rimary energy has "een in switching

transistors, also called dynamic energy#

•

The energy re0uired !er transistor is !ro!ortional to the

Product of the ca!acitive load driven "y the transistor and

the s0uare of the voltage#

(22)

TRENDS IN PO$ER AND ENERG

IN IC

•

The energy of !ulse of the logic transition (L  L ( or

 L ( L #

•

The energy of a single transition3(L  or  L (4 is then

Energ)

/)nai.

B



 Capa.i'ie !"a/ 

V"!'age

&

•

The !ower re0uired !er transistor is the !roduct of the

energy of a transition multi!lied "y the fre0uency of

transition#

P"8er /)nai. B   Capa.i'ie !"a/  V"!'age & 

Fre:+en.) s8i'.-e/

•

ynamic !ower and energy are greatly reduced "y

lowering the voltages# Joltages have dro!!ed from %J

to 6ust under J in 9( ears#

(23)

TRENDS IN PO$ER AND ENERG IN IC

D" n"'-ing 8e!!

• Most ! today turn o the cloc) of inactive modules to save energy

and dynamic !ower# 5or e1, if no Doating-!oint instructions are e1ecuting, the cloc) of the Doating !oint unit is disa"led# If some cores are idle, their cloc)s are sto!!ed#

D)nai. V"!'age@Fre:+en.) S.a!ing 1DVFS2

• PM, la!to!s and servers have !eriods of low activity where there is

no need to o!erate at the highest cloc) fre0uency and voltages#

• Modern !Bs oer a few cloc) fre0uencies and voltages & o!erate at

lower !ower and energy#

• Power savings via J5/ & a server may "e o!erated at : dierent

cloc) rates A 9#< @>8, #. @>8 and  @>8#

Design 3"r ')pi.a! .ase

• PMs and la!to!s are often idle, memory and storage oer low !ower

modes to save energy & e1tend "attery life time#

• On-chi! tem!erature sensors to detect when activity should "e

(24)

TRENDS IN PO$ER AND ENERG

IN IC

Oer!".ing

• Intel oered Tur"o mode in 9((. & chi! decides it is safe to run

at a higher cloc) rate for a short time & few cores until tem!erature starts to rise#

• 5or a single threaded code, these micro!rocessors can turn o

all cores "ut one and run it at an even higher cloc) rate#

• O!erating /ystem turn o Tur"o mode & no noti*cation once it

is ena"led- !rograms vary in !erformance due to room tem!erature#

• P"8er _s'a'i. B C+rren' _s'a'i.  V"!'age

• /tatic !ower is !ro!ortional to num"er of devices# Increasing

num"er of transistors, increases !ower even if they are idle#

• /R=M caches need !ower to maintain the storage values#

• Processor is a !ortion of the whole energy cost of a system &

use faster, less energy-eFcient !rocessor to allow the rest of the system to go into a slee! mode & race-to-halt#

(25)

TRENDS IN COST

C"s' "3 an IC

• Cost of IC N Cost of die  Cost of testing die  Cost of !ac)aging and *nal

test

5inal test ield

•

Cost of die N

Cost of wafer

ies !er wafer  ie yield

•

ies !er wafer N

Q  37afer diameter?949  Q  7afer diameter

ie area S39 ie area4

• Pr"5!e %

5ind the num"er of dies !er :(( mm 3:( cm4 wafer for a die

that is #% cm on a side and for a die that is #( cm on a side#

•

ies !er wafer #% cm 3ie area N #%  #% N9#9% cm

9

4 N 9'(

•

ies !er wafer #( cm 3ie area N    N cm

9

4 N $<(

(26)

DEPENDA;ILIT

•

e!enda"ility is a measure of system availa"ility, relia"ility,

and its maintaina"ility#

•

Infrastructure !roviders started oering /ervice 2evel

=greement 3/2=4 to guarantee that their networ)ing or

!ower service would "e de!enda"le#

•

5or e1am!le they would !ay the customer a !enalty if they

didnBt meet an agreement more than some hours !er month#

T8" ain eas+res "3 /epen/a5i!i') M"/+!e Re!ia5i!i')

•

Mean Time To 5ailure 3MTT54 & relia"ility measure & reci!rocal

of MTT5 is a rate of failures#

•

/ervice interru!tion is measured as Mean Time To Re!air

3MTTR4#

•

Mean Time Ketween 5ailures 3MTK54 N MTT5  MTTR#

(27)

MEASURING REPORTING AND SUMMARIING PERFORMANCE

•

=ma8on#com administrator may say a com!uter is faster

when it com!letes more transactions !er hour#

•

The com!uter user is interested in reducing res!onse

time &the time "etween the start and the com!letion of

an event - referred as e1ecution time#

•

The o!erator of warehouse scale com!uter may "e

interested in increasing through!ut & the total amount of

wor) done in a given time#

•

7e often want to relate the !erformance of two dierent

com!uters say  and # The !hrase  is faster than  i#e

the res!onse or e1ecution time is lower on  than  for

the given tas)# In !articular  is n time faster than #

E1ecution time  N n

E1ecution time 

•

/ince E1ecution time is reci!rocal of !erformance#

 N

Performance 

(28)

?UANTITATIVE PRINCIPLES OF COMPUTER

DESIGN

PRINCIPLE OF LOCALIT

• Programs tend to reuse data and instructions they have used recently#

The !rinci!le of locality a!!lies to data accesses, though not as strongly as to code accesses#

• Two dierent ty!es of locality have "een o"served#

• Tem!oral 2ocalityA Recently accessed items are li)ely to "e accessed in

the near future#

• /!atial 2ocalityA Items whose addresses are near one another tend to "e

referenced close together in time#

AMDAHLS LA$

• It states that the !erformance im!rovement can "e gained from using

faster mode of e1ecution is limited "y the fraction of the time faster mode can "e used#

• /!eedu! N !erformance for entire tas) using the enhancement when !ossi"le

!erformance for entire tas) without using the enhancement =lternatively

• /!eedu! N E1ecution time for entire tas) without using the enhancement

(29)

?UANTITATIVE PRINCIPLES OF COMPUTER

DESIGN

Exe.+'i"n 'ie ne8  Exe.+'i"n 'ie "!/  11% J Fra.'i"n en-an.e/2 K Fra.'i"n en-an.e/ 4 Spee/+p en-an.e/ 2

Spee/+p "era!!  E1ecution time old ? E1ecution time new

 % 4 1% J Fra.'i"n en-an.e/2 K Fra.'i"n en-an.e/ 4 Spee/+p en-an.e/

PRO;LEM &

• /u!!ose that we want to enhance the !rocessor used for we" serving# The

new !rocessor is ( times faster on com!utation in the we" serving a!!lication than the original !rocessor# =ssuming that the original !rocessor is "usy with com!utation <(; of the time and is waiting for I?O $(; of the time, what is the overall s!eedu! gained "y incor!orating the enhancement 5raction enhanced N (#<

/!eedu! enhanced N (

(30)

THE PROCESSOR PERFORMANCE E?UATION

•

Essentially all com!uters are constructed using a

cloc) running at a constant rate# These discrete

time events are called tic)s, cloc) tic)s, cloc)

!eriods, cloc)s, cycles or cloc) cycles#

•

CPU time N CPU cloc) cycles for a !rogram 

Cloc) cycle time

3or4

CPU time N CPU cloc) cycles for a !rogram ?

Cloc) rate

Cloc) cycles Per Instruction

CPI N CPU cloc) cycles for a !rogram ?

Instruction Count

(31)

CLASSES OF PARALLELISM

• Parallelism & driving force of com!uter design & energy and cost "eing the !rimary design

constraint#

• There are "asically two )inds of Parallelism in a!!lications#

%#Da'a@Lee! Para!!e!is 1DLP2 There are many data items that can "e o!erated on at the same time#

&#Tas@Lee! Para!!e!is 1TLP2 arises "ecause tas)s of wor) are created that can o!erate inde!endently and largely in !arallel#

Com!uter hardware can e1!loit these two )inds of a!!lication# Parallelism in ma6or four ways#

(#Ins'r+.'i"n Lee! Para!!e!is 1ILP2

•. E1!loits 2P with com!iler#

•. =ll Processors since a"out .% use !i!elining to overla! the e1ecution of instructions and

im!rove !erformance#

•. This !otential overla! among instructions is called Instruction 2evel Parallelism# •. The instructions can "e evaluated in !arallel#

&#Ve.'"r Ar.-i'e.'+res an/ Grap-i. Pr".ess"r Uni's 1GPU2

E1!loits 2P "y a!!lying a single instruction to a collection of data in !arallel#

(#T-rea/ Lee! Para!!e!is

E1!loits either 2P or T2P in a tightly cou!led hardware module that allows for interaction among !arallel threads#

*#Re:+es' Lee! Para!!e!is

E1!loits !arallelism among largely decou!led tas)s s!eci*ed "y the !rogrammer or the o!erating systems#

(32)

•

Michael 5lynn !laced all com!uters in to one of four

categoriesA

%# Sing!e Ins'r+.'i"ns Sing!e Da'a 1SISD2 s'rea

A

•.

Uni!rocessor category #

•.

/tandard se0uential com!uter, "ut it can e1!loit I2P#

•.

/I/ architectures that use I2P techni0ues such as

su!erscalar#

&# Sing!e Ins'r+.'i"ns M+!'ip!e Da'a 1SIMD2 s'rea

A

•.

In a /IM machine, the same instruction is e1ecuted "y

multi!le !rocessors using dierent data streams#

•.

Each !rocessor has its own data memory, "ut there is only

one instruction memory and control !rocessor, which

fetches and dis!atches instructions#

•.

/tandard se0uential com!uter, "ut it can e1!loit I2P#

•.

/I/ architectures that use I2P techni0ues such as

su!erscalar#

•.

It e1!loits 2P, "y a!!lying the same o!erations to

(33)

(# M+!'ip!e Ins'r+.'i"ns Sing!e Da'a 1MISD2 s'reaA

• o commercial multi!rocessor of this ty!e has "een "uilt to date#

*# M+!'ip!e Ins'r+.'i"ns M+!'ip!e Da'a 1MIMD2 s'reaA

• Each !rocessor fetches its own instructions and o!erates on its own data#

• These !rocessors either utili8e centrali8ed shared memory architecture or each has

its own memory and they communicate with each other through cross"ar networ)s#

 /IM !rocessors can e1!loit data !arallelism, "ut are not as De1i"le as MIM !rocessors# They are suita"le for algorithms with high data

!arallelism and little data de!endent control Dow#

 MIM !rocessors are more De1i"le, they can "e either function as single-user machines, focusing on high !erformance for one !articular a!!lication or as multi-!rogrammed machines running many tas)s simultaneously#

 >owever they are much more e1!ensive and com!licated due to

re!lication of control hardware, high instruction "andwidth re0uirement and /ynchroni8ation of data !ath#

 Kesides !ure /IM and MIM a!!roaches, a com"ination of "oth /IM

and MIM a!!roaches is also !ossi"le, e1!loiting the advantages of "oth /IM and MIM architectures#

Tightly cou!led MIM architectures e1!loits T2P , since multi!le coo!erating Threads o!erate in !arallel#

2oosely cou!led MIM architectures 3Clusters and 7/C4 e1!loits R2P, where many inde!endent tas)s can !roceed in !arallel with little need for communication and /ynchroni8ation#

(34)

MULTITHREADING

• M+!'i'-rea/ing /imultaneous e1ecution of two or more threads

"y the multi!le !rocessors#

• On a /ingle !rocessor, Multithreading generally occurs "y Time

ivision Multi!le1ing 3TM4# The !rocessor switches "etween dierent threads#

• On a Multi!rocessor the threads or tas)s will actually run at the

same time with each !rocessor or core running as !articular thread or tas)# T)pes # Coarse-grained Multithreading# 9# 5ine-grained Multithreading# :# /imultaneous Multithreading# A/an'ages "3 M+!'i'-rea/ing

# If a thread gets a lot of cache misses, the other thread can continue, ta)ing

advantage of unused com!uting resources, which thus can lead to faster overall

e1ecution, as these resources would have "een idle if only a single thread was

(35)

MULTITHREADING

Disa/an'ages 

# Multi!le threads can interfere with each other, when sharing hardware

resources such as caches or T2P#

9# E1ecution time of a single threads are not im!roved, due to slower fre0uency or

adding !i!eline stages that are necessary to accommodate thread switching >?7#

:# Re0uires more changes to "oth a!!lica"le !rograms and O/ than multi!rocessing#

C"arse@graine/ M+!'i'-rea/ing

• =lso )nown as Kloc) or coo!erative multithreading#

• /im!lest ty!e of multithreading, occurs when one thread

runs until, it is

"loc)ed "y a event that normally would create a long latency stall#

• /uch a stall might "e a cache miss, that have to access

o-chi! memory &

might ta)e huge num"er of CPU cycles, for the data to return#

(36)

MULTITHREADING

Fine@graine/ M+!'i'-rea/ing

• It is to remove all de!endencies stalls from the e1ecuting

!i!elining#

• /ince one thread is relatively inde!endent from other thread

there is less

chance of one instruction in one !i!eline stages needing an out!ut from an older

instruction in !i!eline#

Har/8are C"s'

• It has additional cost of each !i!eline stages trac)ing the

thread I of the Instruction it is !rocessing#

• /ince there are more threads "eing e1ecuted concurrently in

the !i!eline

shared resources increase# Caches need to "e larger to avoid threading "etween

the dierent threads#

Si+!'ane"+s M+!'i'-rea/ing