ECEN 676 Advanced Computer Architecture

(1)

ECEN 676

A

dvanced Computer Architecture

Prof. Michel A. Kinsy

(2)

The course has 4 modules

Module 1

– Instruction Set Architecture (ISA) – Simple Pipelining and Hazards – Branch Prediction Module 2 – Superscalar Architectures – Vector machines – VLIW – Multithreading – GPU Module 3 – Caches

– Memory Models & Synchronization

– Cache Coherence Protocols

Module 4

– On-Chip networks

(3)

Architecture Taxonomy

Processor Organizations

Single instruction, single data stream

(SISD)

Uniprocessor

Single instruction multiple

data stream (SIMD) Multiple instruction, single data stream (MISD)

Multiple instruction, multiple data stream

(MIMD)

Vector Processor Array Processor _{Shared Memory}

(Tightly Coupled) Distributed Memory (Loosely Coupled Cluster Symmetric Multiprocessor (SMP) Nonuniformed Memory Access (NUMA)

(4)

CPU-Memory Bottleneck

§ Performance of high-speed computers is usually

limited by memory bandwidth & latency

§ Latency (time for a single access) Memory access time >> Processor cycle time

§ Bandwidth (number of accesses per unit time) if fraction m of instructions access memory,

§ 1+m memory references / instruction

§ _{Ghost of the stored-program architecture}

(5)

Processor- Memory Gap

§ Performance gap: CPU (55% each year) vs. DRAM (7% each year)

§ Processor operations take of the order of 1 ns

§ Memory access requires 10s or even 100s of ns

§ Each instruction executed involves at least one memory access

Time µProc 60%/year 1980 1981 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 DRAM 7%/year DRAM CPU 1982 Processor-Memory Performance Gap: (grows 50% / year) 1 10 100 1000 Pe rf or m an ce Moore s Law

(6)

Processor-DRAM Gap (latency)

§ Four-issue 2GHz superscalar accessing 100ns DRAM could execute 800 instructions during time for one memory access!

Time µProc 60%/year 1980 1981 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 DRAM 7%/year DRAM CPU 1982 Processor-Memory Performance Gap: (grows 50% / year) 1 10 100 1000 Pe rf or m an ce “Moore’s Law”

(7)

Memory Trends

§

The fastest memories are expensive and thus

not very large

Reg L1 $ Ln $ Main Memory Secondary Memory 1 to 4 blocks 4-8 bytes (word) 8-32 bytes (block) 1,024+ bytes

(disk sector = page)

Capacity Access Time Cost (per GB)

100s B ns $Millions 10s KB few ns $100s Ks

MBs 10s ns $10s Ks 100s MB 100s ns $1000s

(8)

Illustrative View of Memory Organization

§ A fast memory can help bridge the CPU-memory gap

(9)

(10)

(11)

Memory Technology

§

Early machines used a variety of memory

technologies

§ Manchester Mark I used CRT Memory Storage § EDVAC used a mercury delay line

§

Core memory was first large scale reliable main

memory

§ Invented by Forrester in late 40s at MIT for Whirlwind project

§ Bits stored as magnetization polarity on small

(12)

Memory Technology

§

First commercial DRAM was Intel 1103

§ 1Kbit of storage on single chip

§ Charge on a capacitor used to hold value

§

Semiconductor memory quickly replaced core in

1970s

§ Intel formed to exploit market for semiconductor memory

§

Phase change memory (PCM) looking promising

for the future

(13)

Memory Technology

§

Random Access Memory (RAM)

§ Any byte of memory can be accessed without touching the preceding bytes

§ RAM is the most common type of memory found in computers and other digital devices

§ There are two main types of RAM

§ DRAM (Dynamic Random Access Memory)

§ Needs to be “refreshed” regularly (~ every 8 ms)

§ 1% to 2% of the active cycles of the DRAM

§ Used for Main Memory

(14)

Memory Technology

§

Random Access Memory (RAM)

§ Any byte of memory can be accessed without touching the preceding bytes

§ RAM is the most common type of memory found in computers and other digital devices

§ There are two main types of RAM

§ DRAM (Dynamic Random Access Memory)

§ SRAM (Static Random Access Memory)

§ Content will last until power turned off

§ _{Low density (6 transistor cells), high power, expensive, fast}

(15)

RAM Organization

§

One memory row holds a block of data, so the

column address selects the requested bit or

word from that block

Ro w A d d re ss De co d er Col. 1 Col.2M Row 1 Row 2N

(16)

DRAM Architecture

§ Modern chips have around 4 logical banks on each chip

§ Each logical bank physically implemented as many smaller arrays

Ro w A d d re ss De co d er Col. 1 Col.2M Row 1 Row 2N

(17)

RAM Organization

§

One memory row holds a block of data, so the

column address selects the requested bit or word

from that block

§

RAS or Row Access Strobe triggering row decoder

§

CAS or Column Access Strobe triggering column

(18)

RAM Organization

§

Latency: Time to access one word

§ Access time: time between the request and when the data is available (or written)

§ Cycle time: time between requests § Usually cycle time > access time

§

Bandwidth: How much data from the memory

can be supplied to the processor per unit time

(19)

Typical Memory Reference Patterns

Address Time Instruction fetches Stack accesses Data accesses n loop iterations

subroutine call _{subroutine return}

argument access

vector access

(20)

A Typical Memory Hierarchy

L1 Data Cache L1 Instructio n Cache _Unified L2 Cache RF _Memory Memory Memory Memory Multi-ported register file (part of CPU) Split instruction

(21)

Definition of a Cache

§

A cache is simply a copy of a small data

segment residing in the main memory

§ Fast but small extra memory

§ Hold identical copies of main memory § Lower latency

§ Higher bandwidth

(22)

Cache Structures

CACHE Processor Main Memory Address Address Data Data Address Tag Data

(23)

Caching & Cache Structures

CACHE Processor Main Memory Address Address Data Data Address Tag Data

(24)

Caching & Cache Structures

CACHE Processor Main Memory Address Address Data Data Address Tag Data Block Data

(25)

Multilevel Caches

§

Cache is transparent to user (happens

automatically)

CPU

Cache

Memory

Main

Memory

Reg File Word Line Data is in the cache fraction h

(26)

Multilevel Caches

§

Cache is transparent to user (happens

automatically)

CPU

Cache

Memory

Main

Memory

Reg File Word Line Data is in the cache fraction h

of the time Go to main 1 – h of the time

For a cache with hit rate h, effective access time is:

(27)

Caches

§

This organization works because most programs

exhibit locality

§ The principle of temporal locality says that if a

program accesses one memory address, there is a good chance that it will access the same address in the near future

§ The principle of spatial locality says that if a

program accesses one memory address, there is a good chance that it will also access other nearby addresses

(28)

Caching Principles

§

Cache contains copies of some of Main Memory

§ Those storage locations recently used

§ When Main Memory address A is referenced in CPU

§ Cache checked for a copy of contents of A

§ If found, cache hit

§ Copy used

§ No need to access Main Memory

§ If not found, cache miss

§ Main Memory accessed to get contents of A

(29)

Caching principles

§

Cache size (in bytes or words)

§ Total cache capacity

§ A larger cache can hold more of the program’s useful data but is more costly and likely to be slower

§

Block or cache-line size

§ Unit of data transfer between cache and main

(30)

Caching principles

§

Placement policy

§ Determining where an incoming cache line is stored § More flexible policies imply higher hardware cost

and may or may not have performance benefits (due to more complex data location)

§

Replacement policy

§ Determining which of several existing cache blocks (into which a new cache line can be mapped) should be overwritten

(31)

Caching Principles

§

Compulsory misses

§ With on-demand fetching, first access to any item is a miss

§

Capacity misses

§ We have to evict some items to make room for others § This leads to misses that are not incurred with an

infinitely large cache

§

Conflict misses

§ The placement scheme may force us to displace useful items to bring in other items

(32)

Caching principles

§ Line width (2W₎

§ Too small a value for W causes a lot of main memory accesses

§ Too large a value increases the miss penalty and may tie up cache space with low-utility items that are

replaced before being used

§ Set size or associativity (2S)

§ Direct mapping (S = 0) is simple and fast

(33)

Cache Algorithm (Read)

§

Look at Processor Address, search cache tags to

find match. Then either

Found in cache a.k.a. HIT Return copy of data from cache Not in cache a.k.a. MISS

Read block of data from Main Memory

Wait …

Return data to processor and update cache

(34)

Caches

§

Local miss rate = misses in cache / accesses to

cache

§

Global miss rate = misses in cache / CPU memory

accesses

§

Misses per instruction = misses in cache / number

of instructions

(35)

35

Cache Performance Metrics

§

Cache miss rate

§ Number of cache misses divided by number of accesses

§

Cache hit time

§ Time between sending address and data returning from cache

§

Cache miss latency

§ Time between sending address and data returning from next-level cache/memory

§

Cache miss penalty

(36)

Average Memory Access Time

§ Average Memory Access Time (AMAT)

§ AMAT = Hit time + (Miss rate x Miss penalty)

§ Memory stall cycles = Memory accesses x miss rate x miss penalty

§ CPU time = (CPU execution cycles + Memory stall cycles) x Cycle time

§ CPI = ideal CPI + average stalls per instruction

§ Having L1 and L2 Caches

§ AMAT = Hit Time_L1 + Miss Rate_L1 x Miss Penalty_L1

§ Miss Penalty_L1 = Hit Time_L2 + Miss Rate_L2 x Miss Penalty_L2

§ AMAT = Hit Time_L1 + Miss Rate_L1 x (Hit Time_L2 + Miss

(37)

Placement Policy

0 1 2 3 4 5 6 7 0 1 2 3

Set Number

Cache

Fully (2-way) Set Direct Associative Associative Mapped anywhere anywhere in only into

(38)

Address Bit-Field Partitioning

§ The address (e.g., 32-bit) issued by the CPU is generally

divided into 3 fields

§ Tag

§ Serves as the unique identifier for a group of data

§ Different regions of memory may be mapped to the same cache location/block

§ The tag is used to differentiate between them

§ Index

§ It is used to index into the cache structure

§ Block Offset

§ The least significant bits are used to determine the exact data word § _{If the block size is B then b = log}₂_{B bits will be needed in the address}

to specify data word

Block Offset

Tag Index

Address

(39)

Direct-Mapped Cache

Tag Data Block

V = Block Offset Tag Index t k b t

HIT _{Data Word or Byte}

2k

(40)

Direct Map Address Selection

Tag Data Block

V = Block Offset Index Tag k t b t

2k

(41)

Hashed Address Selection

Tag Data Block

V = Block Offset Address t b t

2k

(42)

2-Way Set-Associative Cache

Tag Data Block V = Block Offset Tag Index t k b HIT

(43)

Fully Associative Cache

(44)

Write Performance

Tag Data V = Block Offset Tag Index t k b t

2k

lines

(45)

Improving Cache Performance

§

Average memory access time =

Hit time + Miss rate x Miss penalty

§

To improve performance:

§ Reduce the hit time

§ Reduce the miss rate (e.g., larger cache) § Reduce the miss penalty (e.g., L2 cache)

§

What is the simplest design strategy?

(46)

Effect of Cache on Performance

§

Larger cache size

§ Reduces conflict misses § Hit time will increase

§

Higher associativity

§ Reduces conflict misses

§ May increase hit time

§

Larger block size

§ Reduces compulsory misses

(47)

Replacement Policy

§

Which block from a set should be evicted?

§ Random

§ Least Recently Used (LRU)

§ LRU cache state must be updated on every access

§ True implementation only feasible for small sets (2-way)

§ Pseudo-LRU binary tree often used for 4-8 way

§ First In, First Out (FIFO) a.k.a. Round-Robin

§ Used in highly associative caches

§ Not Least Recently Used (NLRU)

§ FIFO with exception for most recently used block or

(48)

Reducing Write Hit Time

§

Problem: Writes take two cycles in memory stage,

one cycle for tag check plus one cycle for data

write if hit

§

Solutions

§ Design data RAM that can perform read and write in one cycle, restore old value after tag miss

§ Fully-associative (CAM Tag) caches: Word line only enabled if hit

(49)

Victim Caches

§ Victim cache is a small associative back up cache, added to a direct L1 Data Cache Unified L2 Cache RF CPU

Evicted data from L1

(50)

Victim Caches

§

Victim cache is a small associative back up

cache, added to a direct

§ Mapped cache, which holds recently evicted lines

1. First look up in direct mapped cache 2. If miss, look in victim cache

3. If hit in victim cache, swap hit line with line now evicted from L1

4. If miss in victim cache, L1 victim -> VC, VC victim->?

(51)

(52)

Pipelining Cache Writes

§

Data from a store hit written into data portion of

cache during tag access of subsequent store

Tags _Data

Tag Index Store Data

Address and Store Data From CPU

Delayed Write Data Delayed Write Addr.

=?

Load Data to CPU Load/Store

L S

1 0

(53)

Write Policy Choices

§

Cache hit:

§ Write through

§ Write both cache & memory

§ Generally higher traffic but simplifies cache coherence

§ Write back

§ Write cache only (memory is written only when the

entry is evicted)

(54)

Write Policy Choices

§

Cache miss:

§ No write allocate: only write to main memory

§ Write allocate (aka fetch on write): fetch into cache

§

Common combinations:

(55)

Reducing Read Miss Penalty

Data Cache Unified L2 _Cache

RF CPU

Write buffer

Evicted dirty lines for writeback cache OR

(56)

Reducing Read Miss Penalty

§

Problem:

§ Write buffer may hold updated value of location needed by a read miss – RAW data hazard

§

Stall:

§ On a read miss, wait for the write buffer to go empty

§

Bypass:

(57)

Prefetching

§

Speculate on future instruction and data

accesses and fetch them into cache(s)

§ Instruction accesses easier to predict than data accesses

§

Varieties of prefetching

§ Hardware prefetching § Software prefetching § Mixed schemes

(58)

Issues in Prefetching

§ Usefulness – should produce hits

§ Timeliness – not late and not too early § Cache and bandwidth pollution

§ Most recent

§ Security / side-channel issues

(59)

Hardware Instruction Prefetching

§ Instruction prefetch in Alpha AXP 21064

§ Fetch two blocks on a miss; the requested block (i) and the next consecutive block (i+1)

§ Requested block placed in cache, and next block in instruction stream buffer

§ If miss in cache but hit in stream buffer, move stream buffer block into cache and prefetch next block (i+2)

(60)

Hardware Data Prefetching

§

Prefetch-on-miss:

§ Prefetch b + 1 upon miss on b

§

One Block Lookahead (OBL) scheme

§ Initiate prefetch for block b + 1 when block b is accessed

§ Why is this different from doubling block size? § Can extend to N block lookahead

§

Strided prefetch

(61)

Itanium-2 On-Chip Caches

§ Level 1, 16KB, 4-way s.a., 64B line, quad-port (2 load+2 store), single cycle latency

§ Level 2, 256KB, 4-way s.a,128B line, quad-port (4 load or 4 store), five cycle latency

(62)