1
Main
Main
Memory
Memory
23
2
Memories
SIMM:
single inline memory module 72 Pins
DIMM:
dual inline memory module 168 Pins
Hauptspeicher
/
Statische RAMs
schnell aber teuer
/
Dynamische RAMs
5
Speicherzyklus
Speicherzelle adressieren
Datum auslesen/laden
DRAM: Speicher erholen
Dynamische RAMs (DRAM): Speicherinhalt muß regelmäßig aufgefrischt werden. Zykluszeit Zugriffszeit 6 "
Speicher-Bänke
CPU Cache Bus Memory C P U C a che Bus Me mo r y b a n k 1 Me m o r y b a nk2 M e m o r y b a n k 3 Me mo r y b a n k 0 . Speicher-Verschränkung 7Interleaving
8Speicher-Referenz-Lokalität
Speicherzugriffe erfolgen in den überwiegenden Fällen auf sukzessive Adressen.
⇒ zeitlich aufeinander folgende Zugriffe beziehen sich dann auf
verschiedene Modeln. Kein Zugriffskonflikt entsteht.
Adressraum
9
Verschränkte Speicherzyklen
Adressieren Lesen/SchreibenAdressieren Lesen/Schreiben Erholen Erholen “verborgen” Modul1 Modul2 10Virtueller Speicher
" Die Organisation des
Gesamtspeichers : Hierarchie von Speicherebenen
" Diese enthält Speicher
unterschiedlicher Größe und unterschiedlicher Zugriffsgeschwindigkeit Register Massenspeicher Hauptspeicher Caches Speicherverwaltung notwendig Virtueller Speicher
Virtueller Speicher
•
Der Hauptspeicher kann als Cache für den Sekundärspeicher dienen. Vorteile:–Illusion eines großen Physikalischen Speichers
–Programm- Relokation
–Speicherschutz
13
Paging:Seitenverwaltung des Virtuellen Speichers
Physic al memory Disk storage Valid 1 1 1 1 0 1 1 0 1 1 0 1 Page table Virtual page number Physical page or disk address Page Tables 14
Pages: virtual memory blocks
•
Page faults: the data is not in memory, retrieve it from disk–huge miss penalty, thus pages should be fairly large (e.g., 4KB)
–reducing page faults is important (LRU is worth the price)
–can handle the faults in software instead of hardware
–using write-through is too expensive so we use writeback
3 2 1 0 11 10 9 8
15 14 13 12 31 30 29 28 27
Page offset Virtual page number
Virtual address 3 2 1 0 11 10 9 8 15 14 13 12 29 28 27 Page offset Physic al page number
Physical address Translation
15
Page Tables
•
Vir tual page number Page off set Virtual addressPage off set Physica l pa ge num ber
Physical add ress Physica l pa ge num ber Valid If 0 then pag e is no t pr esent in me mory Pa ge table r egister Page table 20 12 18 31 30 29 28 2 7 15 14 13 12 11 1 0 9 8 3 2 1 0 29 28 2 7 15 1 4 13 12 1 1 10 9 8 3 2 1 0 16
What if the data is on disk?
Load the page off the disk into a free block of memory, using a DMA transfer.
Meantime switch to some other process waiting to be run.
When the DMA is complete, get an interrupt and update the process's page table
17
Virtual Memory vrs. Cache
Virtual memory terms compared to cache terms: Cache block --- Page or segment
Cache Miss --- Page fault or address fault How is virtual memory different from caches?
What Controls Replacement: HW for cache misses, operating system for page faults
Size of processor address determines the size of virtual memory, cache size is independent of the processor address size
secondary storage used as swap space and for storing file system
18
Typical Parameter Ranges for Virtual Memory and
Cache
" Parameter First-level cache VM
" block (page) size 16-128 bytes 4 K- 64
Kbytes
" hit time 1-2 cycles 40-100 cycles
" miss penalty 8-100 cycles 70000-6000
000 cycles
" miss rate 0.5-10%
0.00001-0.001%
" data memory size 8 K-64 K 16 MB-8 GB
Virtual Memory
Four questions for VM?
Q1: Where can a block be placed in the main memory? Anywhere because of the exorbitant miss penalty Q2: Which block should be replaced on a miss?
LRU (needs use bits) Q3: What happens on a write?
Write Back (write through to secondary storage is not feasible) Q4: How is a block found if it is in the upper level?
Page table: can be large
Inverted page table: hashing virtual address (p.t. only size of physical memory) reduce address translation time by a translation look-aside buffer TLB
Virtual Memory Problem
Page Table too big!
4GB Virtual Memory ÷ 4 KB page ~ 1 million Page Table Entries 4 MB just for Page Table
21
2-Level Page Table
Physical Memory Virtual Memory Code Static Heap Stack
...
2nd LevelPage Tables SuperPage Table
22
Virtual to Physical Address Translation
Each program operates in its own virtual address space Each is protected from the other
OS can decide where each goes in memory
Hardware (MMU) provides virtual -> physical mapping
virtual address (inst. fetch load, store) Program operates in its virtual address space HW mapping physical address (inst. fetch load, store) Physical memory (incl. caches) 23
MMU
Memory Management Unit: Speicher-Verwaltungseinheit
CPU MMU Hauptspeicher virtuell physical. Adressen Adressen Ausnahme 24
Technique for Fast Address Translation:
Translation Lookaside Buffer TLB
Small (32-128 entries) cache of recently translated page addresses
often fully associative TLB entry:
„tag“ holds portion of virtual address,
„data portion“ holds a physical page frame number, protection field valid bit, usually a use bit and dirty bit
25
Beschleunigung der Adreßumsetzung
•
Cache für Adreß- ‘Translationen’: Translation Lookaside Buffer: TLBValid 1 1 1 1 0 1 1 0 1 1 0 1 Page table Physical page address Valid TLB 1 1 1 1 0 1 Tag Vir tual page
number
Physical page or disk addr ess
Physical memor y
Disk stor age
26
Typical TLB Format
Virtual Physical Dirty Ref Valid Access Address Address Bit Rights
TLB just a cache on the page table mappings
TLB access time comparable to cache (much less than main memory access time)
Ref: Used to help calculate LRU on replacement
Dirty Bit: since use write back, need to know whether or not to write page to disk when replaced
What if TLB - Miss?
Option 1: Hardware checks page table and loads new Page Table Entry into TLB
Option 2: Hardware traps to OS, up to OS to decide what to do
Example:MIPS follows Option 2: Hardware knows nothing about page table format
A - evicting an old entry from the TLB
Cache Addressing
Conventional Physical Cache Organization Virtually Addressed CacheTranslate only on miss
Overlap Cache access with VA translation: requires cache index to remain invariant CPU TLB Cache MEM VA PA PA CPU Cache TLB MEM VA VA PA CPU Cache TB MEM VA PA Tag s PA VA Tag s L2 Cache
29
TLB and Physical Cache
Y e s D e l i v e r d a t a t o t h e C P U W r i t e ? T r y t o r e a d d a t a f r o m c a c h e W r i t e d a t a i n t o c a c h e , u p d a t e t h e t a g , a n d p u t t h e d a t a a n d t h e a d d r e s s i n t o t h e w r i t e b u ff e r C a c h e h i t ? C a c h e mIs s No s s s ta l l T L B h i t? T L B a c c e s s V ir t u a l a d d r e s s T L B m i s s e x c e p ti o n Y e s Y e s W r i t e a c c e s s b i t o n ? Y e s N o W r i t e p r o t e c t i o n e x c e p t i o n P h y s i c a l a d d r e s s no No 30
Real or Physical Cache
31
Memory Stage: Physical Cache
Index is part of the displacement
32
Memory Stage: Virtual Cache
33
"
Very complicated memory systems:
Characteristic Intel Pentium Pro PowerPC 604 Virtual address 32 bits 52 bits
Physical address 32 bits 32 bits
Page size 4 KB, 4 MB 4 KB, selectable, and 256 MB TLB organization A TLB for instructions and a TLB for data A TLB for instructions and a TLB for data
Both four-way set associative Both two-way set associative Pseudo-LRU replacement LRU replacement
Instruction TLB: 32 entries Instruction TLB: 128 entries Data TLB: 64 entries Data TLB: 128 entries TLB misses handled in hardware TLB misses handled in hardware
Characteristic Intel Pentium Pro PowerPC 604 Cache organization Split instruction and data caches Split intruction and data caches Cache size 8 KB each for instructions/data 16 KB each for instructions/data Cache associativity Four-way set associative Four-way set associative Replacement Approximated LRU replacement LRU replacement Block size 32 bytes 32 bytes
Write policy Write-back Write-back or write-through 34
Conclusion
Apply Principle of Locality Recursively
Reduce Miss Penalty? add a (L2) cache Manage memory to disk? Treat as cache
- Use Page Table of mappings vs. tag/data in cache
Virtual memory to Physical Memory Translation too slow? Add a cache of Virtual to Physical Address Translations,
called a TLB
Conclusion
Virtual Memory allows protected sharing of memory between processes with less swapping to disk, less fragmentation than always swap or base/bound.
Spatial Locality means Working Set of Pages is all that must be in memory for process to run fairly well.
TLB to reduce performance cost of VM
Need more compact representation to reduce memory size cost of simple
1-level page table (especially 32- 64-bit address)
37
Caches are Critical for Performance
•
Reduce average latencey•
Reduce average bandwidthP P P
" What happens when store & load are executed on different processors?
"Many processor can shared data efficiently
38
Caches and Cache Coherence
•
Caches play key role in all cases–Reduce average data access time
–Reduce bandwidth demands placed on shared interconnect
•
private processor caches create a problem–Copies of a variable can be present in multiple caches
–A write by one processor may not become visible to others
»They’ll keep accessing stale value in their caches
=> Cache coherence problem
•
What do we do about it?–Organize the mem hierarchy to make it go away
–Detect and take actions to eliminate the problem
39
Snooping Caches
40
Contention for Cache Tags
"
Cache controller must monitor bus and processor
– Can view as two controllers: bus-side, and
processor-side
– With single-level cache: dual tags (not data) or
dual-ported tag RAM
"must reconcile when updated, but usually only looked up
– Respond to bus transactions
Tags Cached Data Tags
Tags used by the bus snooper Tags used by
41
Snoopy Cache-Coherence Protocols
" Bus is a broadcast medium & Caches know what they have
" Cache Controller “snoops” all transactions on the shared bus
– relevant transaction if for a block it contains
– take action to ensure coherence
"invalidate, update, or supply value
– depends on state of the block and the protocol
State Address Data I/O devices Mem P1 $ Bus snoop $ Pn Cache-memory transaction 42
MESI
Reporting Snoop Results:
" MESI protocol, need to know
– Is block dirty; i.e. should memory respond or not?
– Is block shared; i.e. transition to E or S state on read miss?
–
" Three wired-OR signals
– Shared: asserted if any cache has a copy
– Dirty: asserted if some cache has a dirty copy
"needn’t know which, since it will do what’s necessary
– Snoop-valid: asserted when OK to check other two signals
"actually inhibit until OK to check
Design Choices
•
Controller updates state of blocks inresponse to processor and snoop events and generates bus transactions
•
Snoopy protocol –set of states –state-transition diagram –actions•
Basic Choices –Write-through vs Write-back –Invalidate vs. Update SnoopState Tag Data ° ° ° Cache Controller
Processor
45
Basic Design
Ad dr Cmd
Snoop stat e Data buffer Write-b ack b uffer
Cache d ata RA M Comp arato r Comp arato r P Tag A ddr Cmd D ata Ad dr Cm d To cont rolle r System bu s Bus-sid e contr oller To co ntro ller Tags a nd state fo r snoo p Tags and state for P Processor-side co ntroller 46
Multilevel Cache Hierarchies
" Independent snoop hardware for each level?
– processor pins for shared bus
– contention for processor cache access ?
" Snoop only at L2 and propagate relevant transactions
" Inclusion property
(1) contents L1 is a subset of L2
(2) any block in modified state in L1 is in modified state in L2 1 => all transactions relevant to L1 are relevant to L2 2 => on BusRd L2 can wave off memory access and inform L1
P L1 L2 P L1 L2 ° ° ° P L1 L2 47
Shared Cache
•
Cache placement identical to single cache–only one copy of any cached block
•
fine-grain sharing•
Potential for positive interference–one processor prefetches data for another
•
Smaller total storage–only one copy of code/data used by both processors.
•
Can share data within a line without “ping-pong”P1 Pn Switch (Interleaved) Cache (Interleaved) Main Memory 48
Disadvantages
•
Fundamental bandwidth limitation•
Increases latency of all accesses–X-bar
–Larger cache
– hit time determines processor cycle time !!!
•
Potential for negative interference–one processor flushes data needed by another
P1 Pn Switch (Interleaved) Cache (Interleaved) Main Memory