Main Memory. Memories. Hauptspeicher. SIMM: single inline memory module 72 Pins. DIMM: dual inline memory module 168 Pins

(1)

1

Main

Memory

_Memory

23

2

Memories

SIMM:

single inline memory module 72 Pins

DIMM:

dual inline memory module 168 Pins

Hauptspeicher

/

Statische RAMs

schnell aber teuer

/

_{Dynamische RAMs}

(2)

5

Speicherzyklus

Speicherzelle adressieren

Datum auslesen/laden

DRAM: Speicher erholen

Dynamische RAMs (DRAM): Speicherinhalt muß regelmäßig aufgefrischt werden. Zykluszeit Zugriffszeit 6 "

Speicher-Bänke

CPU Cache Bus Memory C P U C a che Bus Me mo r y b a n k 1 Me m o r y b a nk2 M e m o r y b a n k 3 Me mo r y b a n k 0 . Speicher-Verschränkung 7

Interleaving

8

Speicher-Referenz-Lokalität

Speicherzugriffe erfolgen in den überwiegenden Fällen auf sukzessive Adressen.

⇒ zeitlich aufeinander folgende Zugriffe beziehen sich dann auf

verschiedene Modeln. Kein Zugriffskonflikt entsteht.

Adressraum

(3)

9

Verschränkte Speicherzyklen

Adressieren Lesen/SchreibenAdressieren Lesen/Schreiben Erholen Erholen “verborgen” Modul1 Modul2 10

Virtueller Speicher

" _{Die Organisation des}

Gesamtspeichers : Hierarchie von Speicherebenen

" _{Diese enthält Speicher}

unterschiedlicher Größe und unterschiedlicher Zugriffsgeschwindigkeit Register Massenspeicher Hauptspeicher Caches Speicherverwaltung notwendig Virtueller Speicher

Virtueller Speicher

•

Der Hauptspeicher kann als Cache für den Sekundärspeicher dienen. Vorteile:

–Illusion eines großen Physikalischen Speichers

–Programm- Relokation

–Speicherschutz

(4)

13

Paging:Seitenverwaltung des Virtuellen Speichers

Physic al memory Disk storage Valid 1 1 1 1 0 1 1 0 1 1 0 1 Page table Virtual page number Physical page or disk address Page Tables 14

Pages: virtual memory blocks

•

Page faults: the data is not in memory, retrieve it from disk

–huge miss penalty, thus pages should be fairly large (e.g., 4KB)

–reducing page faults is important (LRU is worth the price)

–can handle the faults in software instead of hardware

–using write-through is too expensive so we use writeback

3 2 1 0 11 10 9 8

15 14 13 12 31 30 29 28 27

Page offset Virtual page number

Virtual address 3 2 1 0 11 10 9 8 15 14 13 12 29 28 27 Page offset Physic al page number

Physical address Translation

15

Page Tables

•

Vir tual page number Page off set Virtual address

Page off set Physica l pa ge num ber

Physical add ress Physica l pa ge num ber Valid If 0 then pag e is no t pr esent in me mory Pa ge table r egister Page table 20 12 18 31 30 29 28 2 7 15 14 13 12 11 1 0 9 8 3 2 1 0 29 28 2 7 15 1 4 13 12 1 1 10 9 8 3 2 1 0 16

What if the data is on disk?

Load the page off the disk into a free block of memory, using a DMA transfer.

Meantime switch to some other process waiting to be run.

When the DMA is complete, get an interrupt and update the process's page table

(5)

17

Virtual Memory vrs. Cache

Virtual memory terms compared to cache terms: Cache block --- Page or segment

Cache Miss --- Page fault or address fault How is virtual memory different from caches?

What Controls Replacement: HW for cache misses, operating system for page faults

Size of processor address determines the size of virtual memory, cache size is independent of the processor address size

secondary storage used as swap space and for storing file system

18

Typical Parameter Ranges for Virtual Memory and

Cache

" _Parameter _{First-level cache} _VM

" _{block (page) size} _{16-128 bytes} _{4 K- 64}

Kbytes

" _{hit time} _{1-2 cycles} _{40-100 cycles}

" _{miss penalty} _{8-100 cycles} _70000-6000

000 cycles

" _{miss rate} _0.5-10%

0.00001-0.001%

" _{data memory size} _{8 K-64 K} _{16 MB-8 GB}

Virtual Memory

Four questions for VM?

Q1: Where can a block be placed in the main memory? Anywhere because of the exorbitant miss penalty Q2: Which block should be replaced on a miss?

LRU (needs use bits) Q3: What happens on a write?

Write Back (write through to secondary storage is not feasible) Q4: How is a block found if it is in the upper level?

Page table: can be large

Inverted page table: hashing virtual address (p.t. only size of physical memory) reduce address translation time by a translation look-aside buffer TLB

Virtual Memory Problem

Page Table too big!

4GB Virtual Memory ÷ 4 KB page ~ 1 million Page Table Entries 4 MB just for Page Table

(6)

21

2-Level Page Table

Physical Memory Virtual Memory Code Static Heap Stack

...

2nd Level

Page Tables Super_Page Table

22

Virtual to Physical Address Translation

Each program operates in its own virtual address space Each is protected from the other

OS can decide where each goes in memory

Hardware (MMU) provides virtual -> physical mapping

virtual address (inst. fetch load, store) Program operates in its virtual address space HW mapping _physical address (inst. fetch load, store) Physical memory (incl. caches) 23

MMU

Memory Management Unit: Speicher-Verwaltungseinheit

CPU MMU Hauptspeicher virtuell physical. Adressen Adressen Ausnahme 24

Technique for Fast Address Translation:

Translation Lookaside Buffer TLB

Small (32-128 entries) cache of recently translated page addresses

often fully associative TLB entry:

„tag“ holds portion of virtual address,

„data portion“ holds a physical page frame number, protection field valid bit, usually a use bit and dirty bit

(7)

25

Beschleunigung der Adreßumsetzung

•

Cache für Adreß- ‘Translationen’: Translation Lookaside Buffer: TLB

Valid 1 1 1 1 0 1 1 0 1 1 0 1 Page table Physical page address Valid TLB 1 1 1 1 0 1 Tag Vir tual page

number

Physical page or disk addr ess

Physical memor y

Disk stor age

26

Typical TLB Format

Virtual Physical Dirty Ref Valid Access Address Address Bit Rights

TLB just a cache on the page table mappings

TLB access time comparable to cache (much less than main memory access time)

Ref: Used to help calculate LRU on replacement

Dirty Bit: since use write back, need to know whether or not to write page to disk when replaced

What if TLB - Miss?

Option 1: Hardware checks page table and loads new Page Table Entry into TLB

Option 2: Hardware traps to OS, up to OS to decide what to do

Example:MIPS follows Option 2: Hardware knows nothing about page table format

A - evicting an old entry from the TLB

Cache Addressing

Conventional Physical Cache Organization Virtually Addressed Cache

Translate only on miss

Overlap Cache access with VA translation: requires cache index to remain invariant CPU TLB Cache MEM VA PA PA CPU Cache TLB MEM VA VA PA CPU Cache TB MEM VA PA Tag s _PA VA Tag s L2 Cache

(8)

29

TLB and Physical Cache

Y e s D e l i v e r d a t a t o t h e C P U W r i t e ? T r y t o r e a d d a t a f r o m c a c h e W r i t e d a t a i n t o c a c h e , u p d a t e t h e t a g , a n d p u t t h e d a t a a n d t h e a d d r e s s i n t o t h e w r i t e b u ff e r C a c h e h i t ? C a c h e mIs s No s s s ta l l T L B h i t? T L B a c c e s s V ir t u a l a d d r e s s T L B m i s s e x c e p ti o n Y e s Y e s W r i t e a c c e s s b i t o n ? Y e s N o W r i t e p r o t e c t i o n e x c e p t i o n P h y s i c a l a d d r e s s no No 30

Real or Physical Cache

31

Memory Stage: Physical Cache

Index is part of the displacement

32

Memory Stage: Virtual Cache

(9)

33

"

_{Very complicated memory systems:}

Characteristic Intel Pentium Pro PowerPC 604 Virtual address 32 bits 52 bits

Physical address 32 bits 32 bits

Page size 4 KB, 4 MB 4 KB, selectable, and 256 MB TLB organization A TLB for instructions and a TLB for data A TLB for instructions and a TLB for data

Both four-way set associative Both two-way set associative Pseudo-LRU replacement LRU replacement

Instruction TLB: 32 entries Instruction TLB: 128 entries Data TLB: 64 entries Data TLB: 128 entries TLB misses handled in hardware TLB misses handled in hardware

Characteristic Intel Pentium Pro PowerPC 604 Cache organization Split instruction and data caches Split intruction and data caches Cache size 8 KB each for instructions/data 16 KB each for instructions/data Cache associativity Four-way set associative Four-way set associative Replacement Approximated LRU replacement LRU replacement Block size 32 bytes 32 bytes

Write policy Write-back Write-back or write-through 34

Conclusion

Apply Principle of Locality Recursively

Reduce Miss Penalty? add a (L2) cache Manage memory to disk? Treat as cache

- Use Page Table of mappings vs. tag/data in cache

Virtual memory to Physical Memory Translation too slow? Add a cache of Virtual to Physical Address Translations,

called a TLB

Conclusion

Virtual Memory allows protected sharing of memory between processes with less swapping to disk, less fragmentation than always swap or base/bound.

Spatial Locality means Working Set of Pages is all that must be in memory for process to run fairly well.

TLB to reduce performance cost of VM

Need more compact representation to reduce memory size cost of simple

1-level page table (especially 32- 64-bit address)

(10)

37

Caches are Critical for Performance

•

Reduce average latencey

•

Reduce average bandwidth

P P P

"_{What happens when store & load are executed} on different processors?

"_Many_{processor can shared data} efficiently

38

Caches and Cache Coherence

•

Caches play key role in all cases

–Reduce average data access time

–Reduce bandwidth demands placed on shared interconnect

•

private processor caches create a problem

–Copies of a variable can be present in multiple caches

–A write by one processor may not become visible to others

»They’ll keep accessing stale value in their caches

=> Cache coherence problem

•

What do we do about it?

–Organize the mem hierarchy to make it go away

–Detect and take actions to eliminate the problem

39

Snooping Caches

40

Contention for Cache Tags

"

_{Cache controller must monitor bus and processor}

– _{Can view as two controllers: bus-side, and}

processor-side

– _{With single-level cache: dual tags (not data) or}

dual-ported tag RAM

"_{must reconcile when updated, but usually only looked up}

– Respond to bus transactions

Tags Cached Data Tags

Tags used by the bus snooper Tags used by

(11)

41

Snoopy Cache-Coherence Protocols

" _{Bus is a broadcast medium & Caches know what they have}

" _{Cache Controller “snoops” all transactions on the shared bus}

– _{relevant transaction if for a block it contains}

– _{take action to ensure coherence}

"invalidate, update, or supply value

– _{depends on state of the block and the protocol}

State Address Data I/O devices Mem P1 $ Bus snoop $ Pn Cache-memory transaction 42

MESI

Reporting Snoop Results:

" _{MESI protocol, need to know}

– _{Is block dirty; i.e. should memory respond or not?}

– Is block shared; i.e. transition to E or S state on read miss?

–

" _{Three wired-OR signals}

– _{Shared: asserted if any cache has a copy}

– _{Dirty: asserted if some cache has a dirty copy}

"_{needn’t know which, since it will do what’s necessary}

– Snoop-valid: asserted when OK to check other two signals

"actually inhibit until OK to check

Design Choices

•

Controller updates state of blocks in

response to processor and snoop events and generates bus transactions

•

Snoopy protocol –set of states –state-transition diagram –actions

•

Basic Choices –Write-through vs Write-back –Invalidate vs. Update Snoop

State Tag Data ° ° ° Cache Controller

Processor

(12)

45

Basic Design

Ad dr Cmd

Snoop stat e Data buffer Write-b ack b uffer

Cache d ata RA M Comp arato r Comp arato r P Tag A ddr Cmd D ata Ad dr Cm d To cont rolle r System bu s Bus-sid e contr oller To co ntro ller Tags a nd state fo r snoo p Tags and state for P Processor-side co ntroller 46

Multilevel Cache Hierarchies

" Independent snoop hardware for each level?

– processor pins for shared bus

– contention for processor cache access ?

" Snoop only at L2 and propagate relevant transactions

" Inclusion property

(1) contents L1 is a subset of L2

(2) any block in modified state in L1 is in modified state in L2 1 => all transactions relevant to L1 are relevant to L2 2 => on BusRd L2 can wave off memory access and inform L1

P L1 L2 P L1 L2 ° ° ° P L1 L2 47

Shared Cache

•

Cache placement identical to single cache

–only one copy of any cached block

•

fine-grain sharing

•

Potential for positive interference

–one processor prefetches data for another

•

Smaller total storage

–only one copy of code/data used by both processors.

•

Can share data within a line without “ping-pong”

P1 Pn Switch (Interleaved) Cache (Interleaved) Main Memory 48

Disadvantages

•

Fundamental bandwidth limitation

•

Increases latency of all accesses

–X-bar

–Larger cache

– hit time determines processor cycle time !!!

•

Potential for negative interference

–one processor flushes data needed by another

P1 Pn Switch (Interleaved) Cache (Interleaved) Main Memory