ECEN 676
A
dvanced Computer Architecture
Prof. Michel A. Kinsy
The course has 4 modules
Module 1
– Instruction Set Architecture (ISA) – Simple Pipelining and Hazards – Branch Prediction Module 2 – Superscalar Architectures – Vector machines – VLIW – Multithreading – GPU Module 3 – Caches
– Memory Models & Synchronization
– Cache Coherence Protocols
Module 4
– On-Chip networks
Architecture Taxonomy
Processor Organizations
Single instruction, single data stream
(SISD)
Uniprocessor
Single instruction multiple
data stream (SIMD) Multiple instruction, single data stream (MISD)
Multiple instruction, multiple data stream
(MIMD)
Vector Processor Array Processor Shared Memory
(Tightly Coupled) Distributed Memory (Loosely Coupled Cluster Symmetric Multiprocessor (SMP) Nonuniformed Memory Access (NUMA)
CPU-Memory Bottleneck
§ Performance of high-speed computers is usually
limited by memory bandwidth & latency
§ Latency (time for a single access) Memory access time >> Processor cycle time
§ Bandwidth (number of accesses per unit time) if fraction m of instructions access memory,
§ 1+m memory references / instruction
§ Ghost of the stored-program architecture
Processor- Memory Gap
§ Performance gap: CPU (55% each year) vs. DRAM (7% each year)
§ Processor operations take of the order of 1 ns
§ Memory access requires 10s or even 100s of ns
§ Each instruction executed involves at least one memory access
Time µProc 60%/year 1980 1981 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 DRAM 7%/year DRAM CPU 1982 Processor-Memory Performance Gap: (grows 50% / year) 1 10 100 1000 Pe rf or m an ce Moore s Law
Processor-DRAM Gap (latency)
§ Four-issue 2GHz superscalar accessing 100ns DRAM could execute 800 instructions during time for one memory access!
Time µProc 60%/year 1980 1981 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 DRAM 7%/year DRAM CPU 1982 Processor-Memory Performance Gap: (grows 50% / year) 1 10 100 1000 Pe rf or m an ce “Moore’s Law”
Memory Trends
§
The fastest memories are expensive and thus
not very large
Reg L1 $ Ln $ Main Memory Secondary Memory 1 to 4 blocks 4-8 bytes (word) 8-32 bytes (block) 1,024+ bytes
(disk sector = page)
Capacity Access Time Cost (per GB)
100s B ns $Millions 10s KB few ns $100s Ks
MBs 10s ns $10s Ks 100s MB 100s ns $1000s
Illustrative View of Memory Organization
§ A fast memory can help bridge the CPU-memory gap
Memory Technology
§
Early machines used a variety of memory
technologies
§ Manchester Mark I used CRT Memory Storage § EDVAC used a mercury delay line
§
Core memory was first large scale reliable main
memory
§ Invented by Forrester in late 40s at MIT for Whirlwind project
§ Bits stored as magnetization polarity on small
Memory Technology
§
First commercial DRAM was Intel 1103
§ 1Kbit of storage on single chip
§ Charge on a capacitor used to hold value
§
Semiconductor memory quickly replaced core in
1970s
§ Intel formed to exploit market for semiconductor memory
§
Phase change memory (PCM) looking promising
for the future
Memory Technology
§
Random Access Memory (RAM)
§ Any byte of memory can be accessed without touching the preceding bytes
§ RAM is the most common type of memory found in computers and other digital devices
§ There are two main types of RAM
§ DRAM (Dynamic Random Access Memory)
§ Needs to be “refreshed” regularly (~ every 8 ms)
§ 1% to 2% of the active cycles of the DRAM
§ Used for Main Memory
Memory Technology
§
Random Access Memory (RAM)
§ Any byte of memory can be accessed without touching the preceding bytes
§ RAM is the most common type of memory found in computers and other digital devices
§ There are two main types of RAM
§ DRAM (Dynamic Random Access Memory)
§ SRAM (Static Random Access Memory)
§ Content will last until power turned off
§ Low density (6 transistor cells), high power, expensive, fast
RAM Organization
§
One memory row holds a block of data, so the
column address selects the requested bit or
word from that block
Ro w A d d re ss De co d er Col. 1 Col.2M Row 1 Row 2N
DRAM Architecture
§ Modern chips have around 4 logical banks on each chip
§ Each logical bank physically implemented as many smaller arrays
Ro w A d d re ss De co d er Col. 1 Col.2M Row 1 Row 2N
RAM Organization
§
One memory row holds a block of data, so the
column address selects the requested bit or word
from that block
§
RAS or Row Access Strobe triggering row decoder
§
CAS or Column Access Strobe triggering column
RAM Organization
§
Latency: Time to access one word
§ Access time: time between the request and when the data is available (or written)
§ Cycle time: time between requests § Usually cycle time > access time
§
Bandwidth: How much data from the memory
can be supplied to the processor per unit time
Typical Memory Reference Patterns
Address Time Instruction fetches Stack accesses Data accesses n loop iterationssubroutine call subroutine return
argument access
vector access
A Typical Memory Hierarchy
L1 Data Cache L1 Instructio n Cache Unified L2 Cache RF Memory Memory Memory Memory Multi-ported register file (part of CPU) Split instructionDefinition of a Cache
§
A cache is simply a copy of a small data
segment residing in the main memory
§ Fast but small extra memory
§ Hold identical copies of main memory § Lower latency
§ Higher bandwidth
Cache Structures
CACHE Processor Main Memory Address Address Data Data Address Tag DataCaching & Cache Structures
CACHE Processor Main Memory Address Address Data Data Address Tag DataCaching & Cache Structures
CACHE Processor Main Memory Address Address Data Data Address Tag Data Block DataMultilevel Caches
§
Cache is transparent to user (happens
automatically)
CPUCache
Memory
Main
Memory
Reg File Word Line Data is in the cache fraction hMultilevel Caches
§
Cache is transparent to user (happens
automatically)
CPUCache
Memory
Main
Memory
Reg File Word Line Data is in the cache fraction hof the time Go to main 1 – h of the time
For a cache with hit rate h, effective access time is:
Caches
§
This organization works because most programs
exhibit locality
§ The principle of temporal locality says that if a
program accesses one memory address, there is a good chance that it will access the same address in the near future
§ The principle of spatial locality says that if a
program accesses one memory address, there is a good chance that it will also access other nearby addresses
Caching Principles
§
Cache contains copies of some of Main Memory
§ Those storage locations recently used
§ When Main Memory address A is referenced in CPU
§ Cache checked for a copy of contents of A
§ If found, cache hit
§ Copy used
§ No need to access Main Memory
§ If not found, cache miss
§ Main Memory accessed to get contents of A
Caching principles
§
Cache size (in bytes or words)
§ Total cache capacity
§ A larger cache can hold more of the program’s useful data but is more costly and likely to be slower
§
Block or cache-line size
§ Unit of data transfer between cache and main
Caching principles
§
Placement policy
§ Determining where an incoming cache line is stored § More flexible policies imply higher hardware cost
and may or may not have performance benefits (due to more complex data location)
§
Replacement policy
§ Determining which of several existing cache blocks (into which a new cache line can be mapped) should be overwritten
Caching Principles
§
Compulsory misses
§ With on-demand fetching, first access to any item is a miss
§
Capacity misses
§ We have to evict some items to make room for others § This leads to misses that are not incurred with an
infinitely large cache
§
Conflict misses
§ The placement scheme may force us to displace useful items to bring in other items
Caching principles
§ Line width (2W)
§ Too small a value for W causes a lot of main memory accesses
§ Too large a value increases the miss penalty and may tie up cache space with low-utility items that are
replaced before being used
§ Set size or associativity (2S)
§ Direct mapping (S = 0) is simple and fast
Cache Algorithm (Read)
§
Look at Processor Address, search cache tags to
find match. Then either
Found in cache a.k.a. HIT Return copy of data from cache Not in cache a.k.a. MISS
Read block of data from Main Memory
Wait …
Return data to processor and update cache
Caches
§
Local miss rate = misses in cache / accesses to
cache
§
Global miss rate = misses in cache / CPU memory
accesses
§
Misses per instruction = misses in cache / number
of instructions
35
Cache Performance Metrics
§
Cache miss rate
§ Number of cache misses divided by number of accesses
§
Cache hit time
§ Time between sending address and data returning from cache
§
Cache miss latency
§ Time between sending address and data returning from next-level cache/memory
§
Cache miss penalty
Average Memory Access Time
§ Average Memory Access Time (AMAT)
§ AMAT = Hit time + (Miss rate x Miss penalty)
§ Memory stall cycles = Memory accesses x miss rate x miss penalty
§ CPU time = (CPU execution cycles + Memory stall cycles) x Cycle time
§ CPI = ideal CPI + average stalls per instruction
§ Having L1 and L2 Caches
§ AMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1
§ Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2
§ AMAT = Hit TimeL1 + Miss RateL1 x (Hit TimeL2 + Miss
Placement Policy
0 1 2 3 4 5 6 7 0 1 2 3
Set Number
Cache
Fully (2-way) Set Direct Associative Associative Mapped anywhere anywhere in only into
Address Bit-Field Partitioning
§ The address (e.g., 32-bit) issued by the CPU is generally
divided into 3 fields
§ Tag
§ Serves as the unique identifier for a group of data
§ Different regions of memory may be mapped to the same cache location/block
§ The tag is used to differentiate between them
§ Index
§ It is used to index into the cache structure
§ Block Offset
§ The least significant bits are used to determine the exact data word § If the block size is B then b = log2B bits will be needed in the address
to specify data word
Block Offset
Tag Index
Address
Direct-Mapped Cache
Tag Data Block
V = Block Offset Tag Index t k b t
HIT Data Word or Byte
2k
Direct Map Address Selection
Tag Data Block
V = Block Offset Index Tag k t b t
HIT Data Word or Byte
2k
Hashed Address Selection
Tag Data Block
V = Block Offset Address t b t
HIT Data Word or Byte
2k
2-Way Set-Associative Cache
Tag Data Block V = Block Offset Tag Index t k b HIT
Fully Associative Cache
Write Performance
Tag Data V = Block Offset Tag Index t k b tHIT Data Word or Byte
2k
lines
Improving Cache Performance
§
Average memory access time =
Hit time + Miss rate x Miss penalty
§
To improve performance:
§ Reduce the hit time
§ Reduce the miss rate (e.g., larger cache) § Reduce the miss penalty (e.g., L2 cache)
§
What is the simplest design strategy?
Effect of Cache on Performance
§
Larger cache size
§ Reduces conflict misses § Hit time will increase
§
Higher associativity
§ Reduces conflict misses
§ May increase hit time
§
Larger block size
§ Reduces compulsory misses
Replacement Policy
§
Which block from a set should be evicted?
§ Random
§ Least Recently Used (LRU)
§ LRU cache state must be updated on every access
§ True implementation only feasible for small sets (2-way)
§ Pseudo-LRU binary tree often used for 4-8 way
§ First In, First Out (FIFO) a.k.a. Round-Robin
§ Used in highly associative caches
§ Not Least Recently Used (NLRU)
§ FIFO with exception for most recently used block or
Reducing Write Hit Time
§
Problem: Writes take two cycles in memory stage,
one cycle for tag check plus one cycle for data
write if hit
§
Solutions
§ Design data RAM that can perform read and write in one cycle, restore old value after tag miss
§ Fully-associative (CAM Tag) caches: Word line only enabled if hit
Victim Caches
§ Victim cache is a small associative back up cache, added to a direct L1 Data Cache Unified L2 Cache RF CPU
Evicted data from L1
Victim Caches
§
Victim cache is a small associative back up
cache, added to a direct
§ Mapped cache, which holds recently evicted lines
1. First look up in direct mapped cache 2. If miss, look in victim cache
3. If hit in victim cache, swap hit line with line now evicted from L1
4. If miss in victim cache, L1 victim -> VC, VC victim->?
Pipelining Cache Writes
§
Data from a store hit written into data portion of
cache during tag access of subsequent store
Tags Data
Tag Index Store Data
Address and Store Data From CPU
Delayed Write Data Delayed Write Addr.
=?
=?
Load Data to CPU Load/Store
L S
1 0
Write Policy Choices
§
Cache hit:
§ Write through
§ Write both cache & memory
§ Generally higher traffic but simplifies cache coherence
§ Write back
§ Write cache only (memory is written only when the
entry is evicted)
Write Policy Choices
§
Cache miss:
§ No write allocate: only write to main memory
§ Write allocate (aka fetch on write): fetch into cache
§
Common combinations:
Reducing Read Miss Penalty
Data Cache Unified L2 Cache
RF CPU
Write buffer
Evicted dirty lines for writeback cache OR
Reducing Read Miss Penalty
§
Problem:
§ Write buffer may hold updated value of location needed by a read miss – RAW data hazard
§
Stall:
§ On a read miss, wait for the write buffer to go empty
§
Bypass:
Prefetching
§
Speculate on future instruction and data
accesses and fetch them into cache(s)
§ Instruction accesses easier to predict than data accesses
§
Varieties of prefetching
§ Hardware prefetching § Software prefetching § Mixed schemes
Issues in Prefetching
§ Usefulness – should produce hits
§ Timeliness – not late and not too early § Cache and bandwidth pollution
§ Most recent
§ Security / side-channel issues
Hardware Instruction Prefetching
§ Instruction prefetch in Alpha AXP 21064
§ Fetch two blocks on a miss; the requested block (i) and the next consecutive block (i+1)
§ Requested block placed in cache, and next block in instruction stream buffer
§ If miss in cache but hit in stream buffer, move stream buffer block into cache and prefetch next block (i+2)
Hardware Data Prefetching
§
Prefetch-on-miss:
§ Prefetch b + 1 upon miss on b
§
One Block Lookahead (OBL) scheme
§ Initiate prefetch for block b + 1 when block b is accessed
§ Why is this different from doubling block size? § Can extend to N block lookahead
§
Strided prefetch
Itanium-2 On-Chip Caches
§ Level 1, 16KB, 4-way s.a., 64B line, quad-port (2 load+2 store), single cycle latency
§ Level 2, 256KB, 4-way s.a,128B line, quad-port (4 load or 4 store), five cycle latency