w6_internal_memory.pdf

(1)

INTERNAL MEMORY

(2)

2/40

L

ETS

R

EVIEW WHAT WE LEARNED

Pre

par

ed

by

Eng

inee

r S M

A

si

m

A

li

R

iz

v

(3)

3/40

R

EAD

O

NLY

M

EMORY

(ROM)

 Permanent storage

 Nonvolatile

 Microprogramming (see later)

 Library subroutines

 Systems programs (BIOS)

 Function tables

Pre

par

ed

by

Eng

inee

r S M

A

si

m

A

li

R

iz

v

(4)

4/40

R

EAD

O

NLY

M

EMORY

(ROM)

 A ROM is created with the data actually wired

into the chip

 Problems are:

 Data insertion step includes relatively large fixed

cost

 No room for error, if one bit is wrong, the whole ROM

must be thrown out.

Pre

par

ed

by

Eng

inee

r S M

A

si

m

A

li

R

iz

v

(5)

5/40

T

YPES OF

ROM

 Written during manufacture

 Very expensive for small runs

 Programmable (once)

 PROM

 Needs special equipment to program

 Read “mostly”

 Erasable Programmable (EPROM)

 Erased by UV

 Electrically Erasable (EEPROM)

 Takes much longer to write than read

 Flash memory

 Erase whole memory electrically

Pre

par

ed

by

Eng

inee

r S M

A

si

m

A

li

R

iz

v

(6)

6/40

R

EFRESHING

 Refresh circuit included on chip

 Disable chip

 Count through rows

 Read & Write back

 Takes time

 Slows down apparent performance

Pre

par

ed

by

Eng

inee

r S M

A

si

m

A

li

R

iz

v

(7)

7/40

T

YPICAL

16 M

B

DRAM (4M

X

4)

Pre

par

ed

by

Eng

inee

r S M

A

si

m

A

li

R

iz

v

(8)

P

ACKAGING

Pre

par

ed

by

Eng

inee

r S M

A

si

m

A

li

R

iz

v

(9)

9/40

I

NTERLEAVED

M

EMORY

 Collection of DRAM chips

 Grouped into memory bank

 Banks independently service read or write

requests

 K banks can service k requests simultaneously

Pre

par

ed

by

Eng

inee

r S M

A

si

m

A

li

R

iz

v

(10)

10/40

E

RROR

C

ORRECTION

 Hard Failure

 Permanent defect

 Soft Error

 Random, non-destructive

 No permanent damage to memory

 Detected using Hamming error correcting code

Pre

par

ed

by

Eng

inee

r S M

A

si

m

A

li

R

iz

v

(11)

11/40

E

RROR

C

ORRECTING

C

ODE

F

UNCTION

Pre

par

ed

by

Eng

inee

r S M

A

si

m

A

li

R

iz

v

(12)

12/40

A

DVANCED

DRAM O

RGANIZATION

 Basic DRAM same since first RAM chips

 Enhanced DRAM

 Contains small SRAM as well

 SRAM holds last line read (c.f. Cache!)

 Cache DRAM

 Larger SRAM component  Use as cache or serial buffer

Pre

par

ed

by

Eng

inee

r S M

A

si

m

A

li

R

iz

v

(13)

13/40

S

YNCHRONOUS

DRAM (SDRAM)

 Access is synchronized with an external clock

 Address is presented to RAM

 RAM finds data (CPU waits in conventional

DRAM)

 Since SDRAM moves data in time with system

clock, CPU knows when data will be ready

 CPU does not have to wait, it can do something

else

 Burst mode allows SDRAM to set up stream of

data and fire it out in block

 DDR-SDRAM sends data twice per clock cycle

(leading & trailing edge)

Pre

par

ed

by

Eng

inee

r S M

A

si

m

A

li

R

iz

v

(14)

14/40

SDRAM R

EAD

T

IMING

Pre

par

ed

by

Eng

inee

r S M

A

si

m

A

li

R

iz

v

(15)

15/40

RAMBUS

 Adopted by Intel for Pentium & Itanium

 Main competitor to SDRAM

 Vertical package – all pins on one side

 Data exchange over 28 wires < cm long

 Bus addresses up to 320 RDRAM chips at

1.6Gbps

 Asynchronous block protocol

 480ns access time  Then 1.6 Gbps

Pre

par

ed

by

Eng

inee

r S M

A

si

m

A

li

R

iz

v

(16)

16/40

C

ACHE

DRAM

 Mitsubishi

 Integrates small SRAM cache (16 kb) onto

generic DRAM chip

 Used as true cache

 64-bit lines

 Effective for ordinary random access

 To support serial access of block of data

 E.g. refresh bit-mapped screen

Pre

par

ed

by

Eng

inee

r S M

A

si

m

A

li

R

iz

v

(17)

17/40

A

SSIGNMENT

# 2

 Distinguish between SD & RD RAM & types

Pre

par

ed

by

Eng

inee

r S M

A

si

m

A

li

R

iz

v

(18)

18/40

CLASS 2 THURSDAY

Pre

par

ed

by

Eng

inee

r S M

A

si

m

A

li

R

iz

v

(19)

19/40

C

ACHE

C

OHERENCY

 Cache coherency refers to the consistency of data stored

in local caches of a shared resource.

 In a shared memory multiprocessor system with a separate

cache memory for each processor, it is possible to have

many copies of any one instruction operand: one copy in the main memory and one in each cache memory.

 When one copy of an operand is changed, the other copies

of the operand must be changed also.

 Cache coherence is the discipline that ensures that changes

in the values of shared operands are propagated throughout the system in a timely fashion.

Pre

par

ed

by

Eng

inee

r S M

A

si

m

A

li

R

iz

v

(20)

20/40

C

ACHE

C

OHERENCY

 There are three distinct levels of cache coherence:

1. Every write operation appears to occur instantaneously 2. All processors see exactly the same sequence of changes of

values for each separate operand

3. Different processors may see an operation and assume

different sequences of values (this is considered non-coherent behavior)

Pre

par

ed

by

Eng

inee

r S M

A

si

m

A

li

R

iz

v

(21)

21/40

C

ACHE

/M

AIN

M

EMORY

S

TRUCTURE

Pre

par

ed

by

Eng

inee

r S M

A

si

m

A

li

R

iz

v

(22)

22/40

C

ACHE

R

EAD

O

PERATION

-

F

LOWCHART

Pre

par

ed

by

Eng

inee

r S M

A

si

m

A

li

R

iz

v

(23)

23/40

T

YPICAL

C

ACHE

O

RGANIZATION

Pre

par

ed

by

Eng

inee

r S M

A

si

m

A

li

R

iz

v

(24)

24/40

C

ACHE

D

ESIGN

 Cache Size

 Mapping Function

 Replacement Algorithm

 Write Policy

 Block Size

 Number of Caches

Pre

par

ed

by

Eng

inee

r S M

A

si

m

A

li

R

iz

v

(25)

25/40

C

ACHE

D

ESIGN

 Cache Size

 The size of the cache to be small enough so that

the overall average cost per bit is close to that of main memory

 The size of the cache is large enough so that the

overall average access time is close to that of the cache

Pre

par

ed

by

Eng

inee

r S M

A

si

m

A

li

R

iz

v

(26)

26/40

M

APPING

F

UNCTION

 Because there are fewer lines than main memory

blocks, an algorithm is needed for mapping main memory blocks into cache lines

 A means is needed for determining which main

memory block currently occupies a cache line

 Direct , associative, set associative

Pre

par

ed

by

Eng

inee

r S M

A

si

m

A

li

R

iz

v

(27)

27/40

M

APPING

F

UNCTION

 Cache of 64kByte

 Cache block of 4 bytes

 cache is 16k (214) lines of 4 bytes each

 16 MBytes main memory

 24 bit address (224=16M)

 4M Blocks of 4 bytes each

Pre

par

ed

by

Eng

inee

r S M

A

si

m

A

li

R

iz

v

(28)

28/40

D

IRECT

M

APPING

 Each block of main memory maps to only one cache

line

 i.e. if a block is in cache, it must be in one specific place

 Address is in two parts

 Least Significant (w) bits identify unique word  Most Significant (s) bits specify one memory block  The MSBs are split into a cache line field r and a tag

of s-r (most significant)

Pre

par

ed

by

Eng

inee

r S M

A

si

m

A

li

R

iz

v

(29)

D

IRECT

M

APPING

A

DDRESS

S

TRUCTURE

 24 bit address

 2 bit word identifier (4 byte block)  22 bit block identifier

 8 bit tag (=22-14)

 14 bit slot or line

 No two blocks in the same line have the same Tag field  Check contents of cache by finding line and checking Tag

Tag s-r Line or Slot r Word w

(30)

30/40

D

IRECT

M

APPING PROS

&

CONS

 Pros

 Inexpensive

 Fixed location for given block

 Cons

 If a program accesses 2 blocks that map to the same

line repeatedly, cache misses are very high

Pre

par

ed

by

Eng

inee

r S M

A

si

m

A

li

R

iz

v

(31)

31/40

A

SSOCIATIVE

M

APPING

 A main memory block can load into any line of

cache

 Memory address is interpreted as tag and word

 Tag uniquely identifies block of memory

 Every line’s tag is examined for a match

 Complex circuitry required to examine tags of all

cache lines

Pre

par

ed

by

Eng

inee

r S M

A

si

m

A

li

R

iz

v

(32)

32/40

D

IRECT

AND

S

ET

A

SSOCIATIVE

C

ACHE

P

ERFORMANCE

D

IFFERENCES

 Cache complexity increases with associativity

 Not justified against increasing cache to 8kB or

16kB (not with increase in size)

 Above 32kB gives no improvement

 (simulation results)

Pre

par

ed

by

Eng

inee

r S M

A

si

m

A

li

R

iz

v

(33)

33/40

S

ET

A

SSOCIATIVE

M

APPING

 Cache is divided into v sets, each consists of k

line

m = v * k i = j mod v

i = cache set number

j = main memory block number

m = number of lines in the cache

Pre

par

ed

by

Eng

inee

r S M

A

si

m

A

li

R

iz

v

(34)

34/40

S

ET

A

SSOCIATIVE

M

APPING

 Block B_j can be mapped into any of the lines of

set i.

 Memory address interprets as tag, set and word

Pre

par

ed

by

Eng

inee

r S M

A

si

m

A

li

R

iz

v

(35)

35/40

R

EPLACEMENT

A

LGORITHMS

(1)

D

IRECT MAPPING

 No choice

 Each block only maps to one line

 Replace that line

Pre

par

ed

by

Eng

inee

r S M

A

si

m

A

li

R

iz

v

(36)

36/40

R

EPLACEMENT

A

LGORITHMS

(2)

A

SSOCIATIVE

& S

ET

A

SSOCIATIVE

 Hardware implemented algorithm (speed)

 Least Recently used (LRU)

 Replace that block that has been in the cache longest

with no reference to it (USE bit for 2 – way set associative)

 First in first out (FIFO)

 replace block that has been in cache longest,

implemented by round robin or circular buffer

 Least frequently used

 replace block which has had fewest hits-counter

 Random

Pre

par

ed

by

Eng

inee

r S M

A

si

m

A

li

R

iz

v

(37)

37/40

W

RITE

P

OLICY

 Must not overwrite a cache block unless main

memory is up to date

 Multiple CPUs may have individual caches

 I/O may address main memory directly Pre_par

ed

by

Eng

inee

r S M

A

si

m

A

li

R

iz

v

(38)

38/40

W

RITE THROUGH

 All writes go to main memory as well as cache

 Multiple CPUs can monitor main memory traffic

to keep local (to CPU) cache up to date

 Lots of traffic

 Slows down writes

Pre

par

ed

by

Eng

inee

r S M

A

si

m

A

li

R

iz

v

(39)

39/40

W

RITE BACK

 Updates initially made in cache only

 Update bit for cache slot is set when update

occurs

 If block is to be replaced, write to main memory

only if update bit is set

 I/O must access main memory through cache

 Complex circuitry

 Potential bottleneck

Pre

par

ed

by

Eng

inee

r S M

A

si

m

A

li

R

iz

v

(40)

40/40

W

RITE BACK

 Cache coherency problem

 Possible approaches,

 Bus watching with write through  Hardware transparency

 Noncacheable memory

Pre

par

ed

by

Eng

inee

r S M

A

si

m

A

li

R

iz

v

(41)

41/40

L

INE

S

IZE

 Retrieve not only desired word but a number of

adjacent words as well

 Increased block size will at first increase hit ratio

(principle of locality)

 The hit ratio will begin to decrease as block becomes

even bigger

 Larger blocks reduce the number of blocks that fit into a

cache

 8 to 64 bytes seems reasonable

Pre

par

ed

by

Eng

inee

r S M

A

si

m

A

li

R

iz

v

(42)

42/40

N

UMBER OF

C

ACHES

 Multilevel Caches

 High logic density enables caches on chip

 Faster than bus access

 Frees bus for other transfers

 Common to use both on and off chip cache

 L1 and L2 on chip, L3 off chip in static RAM  L3 access much faster than DRAM or ROM  L3 often uses separate data path

Pre

par

ed

by

Eng

inee

r S M

A

si

m

A

li

R

iz

v

(43)

43/40

HIT RATIO (L1 & L2) FOR 8 KBYTES AND 16 KBYTE L1

Pre

par

ed

by

Eng

inee

r S M

A

si

m

A

li

R

iz

v

(44)

44/40

C

ACHES

FOR

D

ATA

AND

I

NSTRUCTION

U

NIFIED

OR

S

PLIT

 One cache for data and instructions or one for data

and one for instructions

 Advantages of unified cache

 Higher hit rate

 Balances load of instruction and data fetch  Only one cache to design & implement

 Advantages of split cache

 Eliminates cache contention between instruction

fetch/decode unit and execution unit

 Important in parallel instruction execution (pipelining) and pre

fetching of predicted future instructions

Pre

par

ed

by

Eng

inee

r S M

A

si

m

A

li

R

iz

v

(45)

45/40

P

ENTIUM

4 C

ACHE

 80386 – no cache on chip

 80486 – 8k using 16 byte lines and four way set

associative organization

 Pentium (all versions) – two on chip L1 caches one for

Data and one for instructions

 Pentium III – L3 cache added off chip  Pentium 4

 L1 caches

 8k bytes

 64 byte lines

 four way set associative

 L2 cache

 Feeding both L1 caches  256k

 128 byte lines

 8 way set associative

 L3 cache on chip

Pre

par

ed

by

Eng

inee

r S M

A

si

m

A

li

R

iz

v

(46)

I

NTEL

C

ACHE

E

VOLUTION

Problem Solution

Processor on which feature first appears

External memory slower than the system bus. Add external cache using faster _{memory technology.} 386

Increased processor speed results in external bus becoming a bottleneck for cache access.

Move external cache on-chip, operating at the same speed as the processor.

486

Internal cache is rather small, due to limited space on chip

Add external L2 cache using faster technology than main memory

486

Contention occurs when both the Instruction Prefetcher and the Execution Unit simultaneously require access to the cache. In that case, the Prefetcher is stalled while the Execution Unit’s data access takes place.

Create separate data and instruction caches.

Pentium

Increased processor speed results in external bus becoming a bottleneck for L2 cache access.

Create separate back-side bus that runs at higher speed than the main (front-side) external bus. The BSB is dedicated to the L2 cache.

Pentium Pro

Move L2 cache on to the processor chip.

Pentium II

Some applications deal with massive databases and must have rapid access to large amounts of data. The on-chip caches are too small.

Add external L3 cache. Pentium III

Move L3 cache on-chip. Pentium 4

Pre par ed by Eng inee

r S M

(47)

47/40

P

ENTIUM

4 B

LOCK

D

IAGRAM

Pre

par

ed

by

Eng

inee

r S M

A

si

m

A

li

R

iz

v

(48)

48/40

P

ENTIUM

4 C

ORE

P

ROCESSOR

 Fetch/Decode Unit

 Fetches instructions from L2 cache  Decode into micro-ops

 Store micro-ops in L1 cache

 Out of order execution logic

 Schedules micro-ops

 Based on data dependence and resources  May execute speculatively

 Execution units

 Execute micro-ops  Data from L1 cache  Results in registers

 Memory subsystem

 L2 cache and systems bus

Pre

par

ed

by

Eng

inee

r S M

A

si

m

A

li

R

iz

v

(49)

49/40

P

ENTIUM

4 D

ESIGN

R

EASONING

 Decodes instructions into RISC like micro-ops before L1

cache

 Micro-ops fixed length

 Superscalar pipelining and scheduling

 Pentium instructions long & complex

 Performance improved by separating decoding from

scheduling & pipelining

 (More later – ch14)

 Data cache is write back

 Can be configured to write through

 L1 cache controlled by 2 bits in register

 CD = cache disable

 NW = not write through

 2 instructions to invalidate (flush) cache and write back then

invalidate

 L2 and L3 8-way set-associative

 Line size 128 bytes

Pre

par

ed

by

Eng

inee

r S M

A

si

m

A

li

R

iz

v