INTERNAL MEMORY
2/40
L
ETSR
EVIEW WHAT WE LEARNEDPre
par
ed
by
Eng
inee
r S M
A
si
m
A
li
R
iz
v
3/40
R
EADO
NLYM
EMORY(ROM)
Permanent storage
Nonvolatile
Microprogramming (see later)
Library subroutines
Systems programs (BIOS)
Function tables
Pre
par
ed
by
Eng
inee
r S M
A
si
m
A
li
R
iz
v
4/40
R
EADO
NLYM
EMORY(ROM)
A ROM is created with the data actually wired
into the chip
Problems are:
Data insertion step includes relatively large fixed
cost
No room for error, if one bit is wrong, the whole ROM
must be thrown out.
Pre
par
ed
by
Eng
inee
r S M
A
si
m
A
li
R
iz
v
5/40
T
YPES OFROM
Written during manufacture
Very expensive for small runs
Programmable (once)
PROM
Needs special equipment to program
Read “mostly”
Erasable Programmable (EPROM)
Erased by UV
Electrically Erasable (EEPROM)
Takes much longer to write than read
Flash memory
Erase whole memory electrically
Pre
par
ed
by
Eng
inee
r S M
A
si
m
A
li
R
iz
v
6/40
R
EFRESHING Refresh circuit included on chip
Disable chip
Count through rows
Read & Write back
Takes time
Slows down apparent performance
Pre
par
ed
by
Eng
inee
r S M
A
si
m
A
li
R
iz
v
7/40
T
YPICAL16 M
BDRAM (4M
X4)
Pre
par
ed
by
Eng
inee
r S M
A
si
m
A
li
R
iz
v
P
ACKAGINGPre
par
ed
by
Eng
inee
r S M
A
si
m
A
li
R
iz
v
9/40
I
NTERLEAVEDM
EMORY Collection of DRAM chips
Grouped into memory bank
Banks independently service read or write
requests
K banks can service k requests simultaneously
Pre
par
ed
by
Eng
inee
r S M
A
si
m
A
li
R
iz
v
10/40
E
RRORC
ORRECTION Hard Failure
Permanent defect
Soft Error
Random, non-destructive
No permanent damage to memory
Detected using Hamming error correcting code
Pre
par
ed
by
Eng
inee
r S M
A
si
m
A
li
R
iz
v
11/40
E
RRORC
ORRECTINGC
ODEF
UNCTIONPre
par
ed
by
Eng
inee
r S M
A
si
m
A
li
R
iz
v
12/40
A
DVANCEDDRAM O
RGANIZATION Basic DRAM same since first RAM chips
Enhanced DRAM
Contains small SRAM as well
SRAM holds last line read (c.f. Cache!)
Cache DRAM
Larger SRAM component Use as cache or serial buffer
Pre
par
ed
by
Eng
inee
r S M
A
si
m
A
li
R
iz
v
13/40
S
YNCHRONOUSDRAM (SDRAM)
Access is synchronized with an external clock
Address is presented to RAM
RAM finds data (CPU waits in conventional
DRAM)
Since SDRAM moves data in time with system
clock, CPU knows when data will be ready
CPU does not have to wait, it can do something
else
Burst mode allows SDRAM to set up stream of
data and fire it out in block
DDR-SDRAM sends data twice per clock cycle
(leading & trailing edge)
Pre
par
ed
by
Eng
inee
r S M
A
si
m
A
li
R
iz
v
14/40
SDRAM R
EADT
IMINGPre
par
ed
by
Eng
inee
r S M
A
si
m
A
li
R
iz
v
15/40
RAMBUS
Adopted by Intel for Pentium & Itanium
Main competitor to SDRAM
Vertical package – all pins on one side
Data exchange over 28 wires < cm long
Bus addresses up to 320 RDRAM chips at
1.6Gbps
Asynchronous block protocol
480ns access time Then 1.6 Gbps
Pre
par
ed
by
Eng
inee
r S M
A
si
m
A
li
R
iz
v
16/40
C
ACHEDRAM
Mitsubishi
Integrates small SRAM cache (16 kb) onto
generic DRAM chip
Used as true cache
64-bit lines
Effective for ordinary random access
To support serial access of block of data
E.g. refresh bit-mapped screen
Pre
par
ed
by
Eng
inee
r S M
A
si
m
A
li
R
iz
v
17/40
A
SSIGNMENT# 2
Distinguish between SD & RD RAM & types
Pre
par
ed
by
Eng
inee
r S M
A
si
m
A
li
R
iz
v
18/40
CLASS 2 THURSDAY
Prepar
ed
by
Eng
inee
r S M
A
si
m
A
li
R
iz
v
19/40
C
ACHEC
OHERENCY Cache coherency refers to the consistency of data stored
in local caches of a shared resource.
In a shared memory multiprocessor system with a separate
cache memory for each processor, it is possible to have
many copies of any one instruction operand: one copy in the main memory and one in each cache memory.
When one copy of an operand is changed, the other copies
of the operand must be changed also.
Cache coherence is the discipline that ensures that changes
in the values of shared operands are propagated throughout the system in a timely fashion.
Pre
par
ed
by
Eng
inee
r S M
A
si
m
A
li
R
iz
v
20/40
C
ACHEC
OHERENCY There are three distinct levels of cache coherence:
1. Every write operation appears to occur instantaneously 2. All processors see exactly the same sequence of changes of
values for each separate operand
3. Different processors may see an operation and assume
different sequences of values (this is considered non-coherent behavior)
Pre
par
ed
by
Eng
inee
r S M
A
si
m
A
li
R
iz
v
21/40
C
ACHE/M
AINM
EMORYS
TRUCTUREPre
par
ed
by
Eng
inee
r S M
A
si
m
A
li
R
iz
v
22/40
C
ACHER
EADO
PERATION-
F
LOWCHART
Pre
par
ed
by
Eng
inee
r S M
A
si
m
A
li
R
iz
v
23/40
T
YPICALC
ACHEO
RGANIZATIONPre
par
ed
by
Eng
inee
r S M
A
si
m
A
li
R
iz
v
24/40
C
ACHED
ESIGN Cache Size
Mapping Function
Replacement Algorithm
Write Policy
Block Size
Number of Caches
Pre
par
ed
by
Eng
inee
r S M
A
si
m
A
li
R
iz
v
25/40
C
ACHED
ESIGN Cache Size
The size of the cache to be small enough so that
the overall average cost per bit is close to that of main memory
The size of the cache is large enough so that the
overall average access time is close to that of the cache
Pre
par
ed
by
Eng
inee
r S M
A
si
m
A
li
R
iz
v
26/40
M
APPINGF
UNCTION Because there are fewer lines than main memory
blocks, an algorithm is needed for mapping main memory blocks into cache lines
A means is needed for determining which main
memory block currently occupies a cache line
Direct , associative, set associative
Pre
par
ed
by
Eng
inee
r S M
A
si
m
A
li
R
iz
v
27/40
M
APPINGF
UNCTION Cache of 64kByte
Cache block of 4 bytes
cache is 16k (214) lines of 4 bytes each
16 MBytes main memory
24 bit address (224=16M)
4M Blocks of 4 bytes each
Pre
par
ed
by
Eng
inee
r S M
A
si
m
A
li
R
iz
v
28/40
D
IRECTM
APPING Each block of main memory maps to only one cache
line
i.e. if a block is in cache, it must be in one specific place
Address is in two parts
Least Significant (w) bits identify unique word Most Significant (s) bits specify one memory block The MSBs are split into a cache line field r and a tag
of s-r (most significant)
Pre
par
ed
by
Eng
inee
r S M
A
si
m
A
li
R
iz
v
D
IRECTM
APPINGA
DDRESSS
TRUCTURE 24 bit address
2 bit word identifier (4 byte block) 22 bit block identifier
8 bit tag (=22-14)
14 bit slot or line
No two blocks in the same line have the same Tag field Check contents of cache by finding line and checking Tag
Tag s-r Line or Slot r Word w
30/40
D
IRECTM
APPING PROS&
CONS Pros
Inexpensive
Fixed location for given block
Cons
If a program accesses 2 blocks that map to the same
line repeatedly, cache misses are very high
Pre
par
ed
by
Eng
inee
r S M
A
si
m
A
li
R
iz
v
31/40
A
SSOCIATIVEM
APPING A main memory block can load into any line of
cache
Memory address is interpreted as tag and word
Tag uniquely identifies block of memory
Every line’s tag is examined for a match
Complex circuitry required to examine tags of all
cache lines
Pre
par
ed
by
Eng
inee
r S M
A
si
m
A
li
R
iz
v
32/40
D
IRECT
AND
S
ET
A
SSOCIATIVE
C
ACHE
P
ERFORMANCE
D
IFFERENCES
Cache complexity increases with associativity
Not justified against increasing cache to 8kB or
16kB (not with increase in size)
Above 32kB gives no improvement
(simulation results)
Pre
par
ed
by
Eng
inee
r S M
A
si
m
A
li
R
iz
v
33/40
S
ETA
SSOCIATIVEM
APPING Cache is divided into v sets, each consists of k
line
m = v * k i = j mod v
i = cache set number
j = main memory block number
m = number of lines in the cache
Pre
par
ed
by
Eng
inee
r S M
A
si
m
A
li
R
iz
v
34/40
S
ETA
SSOCIATIVEM
APPING Block Bj can be mapped into any of the lines of
set i.
Memory address interprets as tag, set and word
Pre
par
ed
by
Eng
inee
r S M
A
si
m
A
li
R
iz
v
35/40
R
EPLACEMENTA
LGORITHMS(1)
D
IRECT MAPPING No choice
Each block only maps to one line
Replace that line
Pre
par
ed
by
Eng
inee
r S M
A
si
m
A
li
R
iz
v
36/40
R
EPLACEMENT
A
LGORITHMS
(2)
A
SSOCIATIVE
& S
ET
A
SSOCIATIVE
Hardware implemented algorithm (speed)
Least Recently used (LRU)
Replace that block that has been in the cache longest
with no reference to it (USE bit for 2 – way set associative)
First in first out (FIFO)
replace block that has been in cache longest,
implemented by round robin or circular buffer
Least frequently used
replace block which has had fewest hits-counter
Random
Pre
par
ed
by
Eng
inee
r S M
A
si
m
A
li
R
iz
v
37/40
W
RITEP
OLICY Must not overwrite a cache block unless main
memory is up to date
Multiple CPUs may have individual caches
I/O may address main memory directly Prepar
ed
by
Eng
inee
r S M
A
si
m
A
li
R
iz
v
38/40
W
RITE THROUGH All writes go to main memory as well as cache
Multiple CPUs can monitor main memory traffic
to keep local (to CPU) cache up to date
Lots of traffic
Slows down writes
Pre
par
ed
by
Eng
inee
r S M
A
si
m
A
li
R
iz
v
39/40
W
RITE BACK Updates initially made in cache only
Update bit for cache slot is set when update
occurs
If block is to be replaced, write to main memory
only if update bit is set
I/O must access main memory through cache
Complex circuitry
Potential bottleneck
Pre
par
ed
by
Eng
inee
r S M
A
si
m
A
li
R
iz
v
40/40
W
RITE BACK Cache coherency problem
Possible approaches,
Bus watching with write through Hardware transparency
Noncacheable memory
Pre
par
ed
by
Eng
inee
r S M
A
si
m
A
li
R
iz
v
41/40
L
INES
IZE Retrieve not only desired word but a number of
adjacent words as well
Increased block size will at first increase hit ratio
(principle of locality)
The hit ratio will begin to decrease as block becomes
even bigger
Larger blocks reduce the number of blocks that fit into a
cache
8 to 64 bytes seems reasonable
Pre
par
ed
by
Eng
inee
r S M
A
si
m
A
li
R
iz
v
42/40
N
UMBER OFC
ACHES Multilevel Caches
High logic density enables caches on chip
Faster than bus access
Frees bus for other transfers
Common to use both on and off chip cache
L1 and L2 on chip, L3 off chip in static RAM L3 access much faster than DRAM or ROM L3 often uses separate data path
Pre
par
ed
by
Eng
inee
r S M
A
si
m
A
li
R
iz
v
43/40
HIT RATIO (L1 & L2) FOR 8 KBYTES AND 16 KBYTE L1
Pre
par
ed
by
Eng
inee
r S M
A
si
m
A
li
R
iz
v
44/40
C
ACHES
FOR
D
ATA
AND
I
NSTRUCTION
U
NIFIED
OR
S
PLIT
One cache for data and instructions or one for data
and one for instructions
Advantages of unified cache
Higher hit rate
Balances load of instruction and data fetch Only one cache to design & implement
Advantages of split cache
Eliminates cache contention between instruction
fetch/decode unit and execution unit
Important in parallel instruction execution (pipelining) and pre
fetching of predicted future instructions
Pre
par
ed
by
Eng
inee
r S M
A
si
m
A
li
R
iz
v
45/40
P
ENTIUM4 C
ACHE 80386 – no cache on chip
80486 – 8k using 16 byte lines and four way set
associative organization
Pentium (all versions) – two on chip L1 caches one for
Data and one for instructions
Pentium III – L3 cache added off chip Pentium 4
L1 caches
8k bytes
64 byte lines
four way set associative
L2 cache
Feeding both L1 caches 256k
128 byte lines
8 way set associative
L3 cache on chip
Pre
par
ed
by
Eng
inee
r S M
A
si
m
A
li
R
iz
v
I
NTELC
ACHEE
VOLUTIONProblem Solution
Processor on which feature first appears
External memory slower than the system bus. Add external cache using faster memory technology. 386
Increased processor speed results in external bus becoming a bottleneck for cache access.
Move external cache on-chip, operating at the same speed as the processor.
486
Internal cache is rather small, due to limited space on chip
Add external L2 cache using faster technology than main memory
486
Contention occurs when both the Instruction Prefetcher and the Execution Unit simultaneously require access to the cache. In that case, the Prefetcher is stalled while the Execution Unit’s data access takes place.
Create separate data and instruction caches.
Pentium
Increased processor speed results in external bus becoming a bottleneck for L2 cache access.
Create separate back-side bus that runs at higher speed than the main (front-side) external bus. The BSB is dedicated to the L2 cache.
Pentium Pro
Move L2 cache on to the processor chip.
Pentium II
Some applications deal with massive databases and must have rapid access to large amounts of data. The on-chip caches are too small.
Add external L3 cache. Pentium III
Move L3 cache on-chip. Pentium 4
Pre par ed by Eng inee
r S M
47/40
P
ENTIUM4 B
LOCKD
IAGRAMPre
par
ed
by
Eng
inee
r S M
A
si
m
A
li
R
iz
v
48/40
P
ENTIUM4 C
OREP
ROCESSOR Fetch/Decode Unit
Fetches instructions from L2 cache Decode into micro-ops
Store micro-ops in L1 cache
Out of order execution logic
Schedules micro-ops
Based on data dependence and resources May execute speculatively
Execution units
Execute micro-ops Data from L1 cache Results in registers
Memory subsystem
L2 cache and systems bus
Pre
par
ed
by
Eng
inee
r S M
A
si
m
A
li
R
iz
v
49/40
P
ENTIUM4 D
ESIGNR
EASONING Decodes instructions into RISC like micro-ops before L1
cache
Micro-ops fixed length
Superscalar pipelining and scheduling
Pentium instructions long & complex
Performance improved by separating decoding from
scheduling & pipelining
(More later – ch14)
Data cache is write back
Can be configured to write through
L1 cache controlled by 2 bits in register
CD = cache disable
NW = not write through
2 instructions to invalidate (flush) cache and write back then
invalidate
L2 and L3 8-way set-associative
Line size 128 bytes
Pre
par
ed
by
Eng
inee
r S M
A
si
m
A
li
R
iz
v