UNIT-3 Cache Memory.ppt

(1)

UNIT-3

(2)

Capacity



Word size

 _{The natural unit of organisation}



Number of words

(3)

Unit of Transfer



Internal

 _{Usually governed by data bus width}



External

 _{Usually a block which is much larger than a word}



Addressable unit

(4)

Access Methods (1)



Sequential

 Start at the beginning and read through in order

 _{Access time depends on location of data and} previous location

 _{e.g. tape}



Direct

 Individual blocks have unique address

 Access is by jumping to vicinity plus sequential search

 _{Access time depends on location and previous} location

(5)

Access Methods (2)



Random

 _{Individual addresses identify locations exactly}

 _{Access time is independent of location or previous} access

 _{e.g. RAM}



Associative

 _{Data is located by a comparison with contents of} a portion of the store

 _{Access time is independent of location or previous} access

(6)

Memory Hierarchy



Registers

 _{In CPU}



Internal or Main memory

 _{May include one or more levels of cache}

 _“RAM”



External memory

(7)

(8)

Performance



Access time

 _{Time between presenting the address and getting} the valid data



Memory Cycle time

 _{Time may be required for the memory to} “recover” before next access

 _{Cycle time is access + recovery}



Transfer Rate

(9)

Physical Characteristics

(10)

Organisation

(11)

Memory hierarchy

 _{Memory Hierarchy is to obtain the highest possible access speed while minimizing the total cost of}

the memory system.

(12)

Locality of reference

principle

 _{Locality of Reference}

 _{- The references to memory at any given time interval tend to be confined}

within a localized areas

 _{- This area contains a set of information and the membership changes}

gradually as time goes by

 _-_{Temporal Locality}

 _{The information which will be used in near future is likely to be in use}

already( e.g. Reuse of information in loops)

 _-_{Spatial Locality}

 _{If a word is accessed, adjacent(near) words are likely accessed soon (e.g.}

(13)

CACHE

 _Cache

 _{- The property of Locality of Reference makes the Cache} memory systems work

 _{- Cache is a fast small capacity memory that should hold those} information which are most likely to be accessed

Main memory

Cache memory

(14)

Cache



Small amount of fast memory

(15)

(16)

Cache operation –

overview



CPU requests contents of memory location



Check cache for this data



If present, get from cache (fast)



If not present, read required block from main

memory to cache



Then deliver from cache to CPU



Cache includes tags to identify which block of

(17)

(18)

Size does matter



Cost

 _{More cache is expensive}



Speed

 _{More cache is faster (up to a point)}

(19)

(20)

Direct Mapping



Each block of main memory maps to only one

cache line

 _{i.e. if a block is in cache, it must be in one specific} place



Address is in two parts



Least Significant w bits identify unique word



Most Significant s bits specify one memory

(21)

(22)

CACHE- ASSOCIATIVE

MAPPING

• A main memory block can load into any line of cache

• Memory address is interpreted as 2 fields: tag and word

• Tag uniquely identifies block of memory

• Every line’s tag is examined simultaneously for a match

(23)

(24)

(25)

Basic terminologies

 Hit: CPU finding contents of memory address in cache

 Hit rate (h) is probability of successful lookup in cache by CPU.

 Miss: CPU failing to find what it wants in cache (incurs trip to deeper levels of memory hierarchy

 Miss rate (m) is probability of missing in cache and is equal to 1-h.

 Miss penalty: Time penalty associated with servicing a miss at any particular level of memory hierarchy

 Effective Memory Access Time (EMAT): Effective access time experienced by the CPU when accessing memory.

 Time to lookup cache to see if memory location is already there

 Upon cache miss, time to go to deeper levels of memory hierarchy

EMAT = Tc + m * Tm

(26)

Write Access to Cache

from CPU



Two choices

 _{Write through policy}

 _{Write allocate}

 _{No-write allocate}

(27)

Write Through Policy



Each write goes to cache. Tag is set and valid

bit is set



Each write also goes to write buffer (see next

(28)

Write Through Policy



Each write goes to cache. Tag is set and valid

bit is set

 _{This is write allocate}

 _{There is also a no-write allocate where the cache} is not written to if there was a write miss



Each write also goes to write buffer



Write buffer writes data into main memory

(29)

Write back policy



CPU writes data to cache setting dirty bit

(30)

Write back policy



We write to the cache



We don't bother to update main memory



Is the cache consistent with main memory?



Is this a problem?

(31)

Comparison of the Write

Policies



Write Through

 _{Cache logic simpler and faster}

 _{Creates more bus traffic}



Write back

 _{Requires dirty bit and extra logic}



Multilevel cache processors may use both

 _{L1 Write through}

(32)

Write Policy



Must not overwrite a cache block unless main

memory is up to date

(33)

Write through



All writes go to main memory as well as cache



Multiple CPUs can monitor main memory traffic

to keep local (to CPU) cache up to date



Lots of traffic



Slows down writes

(34)

Write back



Updates initially made in cache only



Update bit for cache slot is set when update

occurs



If block is to be replaced, write to main memory

only if update bit is set



Other caches get out of sync

(35)

Types of Cache



Mainly Cache is of three types

(36)

Static Cache

 _{This has session lifetime and once the session is} complete,the cache is deleted. It brings entire

(37)

Dynamic cache

 _{It should be used when they databases are large.} The principle behind it is,it will bring records one by one into the cache from the database. If the record is present already(Identified using key columns),then the particular record is not brought into the cache. It will help when the database has lot of redundant

(38)

Persistent cache



The lifetime of this kind of cache is entire

workflow. When a particular table is being

used in many sessions across the workflow,

the table can be made as persistent cache in

the lookup table properties. It will be

(39)

Levels in cache

A computer can have several different levels of cache memory. The level numbers refers to distance from CPU where Level 1 is the closest. All levels of cache

memory are faster than RAM. The cache closest to CPU is always faster but generally costs more and stores less data then other level of cache.

A computer can have several different levels of cache memory. The level numbers refers to distance from CPU where Level 1 is the closest. All levels of cache

(40)

Levels of cache

 _{Level 1 (L1) Cache}

It is also called primary or internal cache. It is built directly into the processor chip. It has small capacity from 8 Km to 128 Kb.

It is slower than L1 cache. Its storagecapacity is more, i-e. From 64 Kb to 16 MB. The current processors contain advanced transfer cache on processor chip that is a type of L2 cache. The common size of this cache is from 512 kb to 8 Mb.

This cache is separate from processor chip on the motherboard. It exists on the computer that uses L2 advanced transfer cache. It is slower than L1 and L2

(41)

Split i and d Cache

 _{High-performance processors invariably have 2 separate L1} caches, the instruction cache and the data cache (I-cache and D-cache). This "split cache" has several advantages over a unified cache:[8]

 _{Wiring simplicity: the decoder and scheduler are only hooked} to the I-cache; the registers and ALU and FPU are only

hooked to the D-cache.

 _{Speed: the CPU can be reading data from the D-cache, while} simultaneously loading the next instruction(s) from the

I-cache.

 _{Multi-CPU systems typically have a separate L1 I-cache and} L1 D-cache for each CPU, each one direct-mapped for speed. On the other hand, in a high-performance processor, other

(42)

Unified vs Split I and D

(Instruction and Data) Caches

 _{Given a fixed total size (in bytes) for the cache, is it better to have}

two caches, one for instructions and one for data; or is it better to have a single unified cache?

 _{Unified is better because it automatically performs load balancing.}

If the current program needs more data references than instruction references, the cache will accommodate. Similarly if more

instruction references are needed.

 _{Split is better because it can do two references at once (one}

instruction reference and one data reference).

 _{The winner is ...}

 _{split I and D (at least for L1).}

 _{But unified has the better (i.e. higher) hit ratio.}

 _{So hit ratio is not the ultimate measure of good cache}

(43)

Multilevel Caches



Ubiquitous in high-performance processors

 _{Gap between L1 (core frequency) and main memory} too high

 _{Level 2 usually on chip, level 3 on or off-chip, level 4} off chip



Inclusion in multilevel caches

 _{Multi-level inclusion holds if L2 cache is superset of L1}

 _{Can handle virtual address synonyms}

 _{Filter coherence traffic: if L2 misses, L1 needn’t see} snoop

 _{Makes L1 writes simpler}

(44)

Replacement Policy



When a line must be evicted from a cache to

(45)

Direct Mapped

Replacement



There is no choice about which line to evict,

(46)

Replacement Goals



The general goal of the replacement policy is

to minimize future cache misses by evicting a

line that will not be referenced often in the

future

(47)

Least-recently used

(LRU) replacement policy



The cache ranks each of the lines in a set

(48)

Random replacement

policy



A randomly selected line from the

(49)

Virtual to real translation

The cache is addressed with the real memory addressed the addressed translated by TLB or mechanism used by the physical memory.

There are atleast three important performance aspect that directly relate to vrtual to real address translation. 1.Improperply organize or insufficiently sized TLBs may create access not n TLB faults, adding time to

execution.

2.For real cache,the TLB time must occur before the cache access effectively extending the cache access time.

(50)



What is Virtual to physical translation?



In a virtual memory system, the program

memory is divided into fixed sized pages and

allocated in fixed sized physical memory frames.

The pages do not have to be contiguous in

memory. A page table keeps track of where each

page is located in physical memory. This allows

the operating system to load a program of any

size into any available frames. Only the currently

used pages need to be loaded. Unused pages

can remain on disk until they are referenced.

(51)

What are Flags?



The page table also includes several other flags

to keep track of memory usage.



A

resident

flag in the page table indicates

whether or not the page is in memory.



A

use

flag is set whenever the page is

(52)



The addresses that appear in programs are the

virtual addresses or program addresses. For

every memory access, either to fetch an

instruction or data, the CPU must translate the

virtual address to a real physical address. A

virtual memory address can be considered to

be composed of two parts: a page number and

an offset into the page. The page number

determines which page contains the

(53)

(54)

TLB(Translation look

aside buffer)

(55)

(56)

Overlapping the Tcycle in

V->R translation

There are three general approaches to avoiding the serial

translation step in cache access.In order to avoid the

sequential translation , the translation must be arranged so

that it can be performed simultaneously with data access in

the cache array.This can be done by three means:

1.

Using high degrees of set associativity ,so that the

directory index bits are not affected by the translation.

2.

Using a virtual code.

(57)

Cache Write Policy

and Replacement at

hit

(58)

Need of Write Policy

• A block in cache might have been updated, but

corresponding updation in main memory might not have been done

• Multiple CPU’s have individual cache’s, thereby invalidating the data in other processor’s cache

(59)

Cache Write Policy

• Write through

The value is written to both the cache line and to the lower level memory.

• Write back

(60)

(61)

Write Through

• In this technique, all the write operations are made to main memory as well as to cache, ensuring MM is

always valid.

• Any other processor-cache module, may monitor traffic to MM to maintain consistency

DISADVANTAGE

• It generates memory traffic and may create bottleneck.

(62)

Pseudo Write Through

• Also called Write Buffer

• Processor writes data into the cache and the write buffer

• Memory controller writes contents of the buffer to memory

• FIFO (typical number of entries 4)

(63)

(64)

Write Back

• In this technique, the updates are made only in cache.

• When an update is made, a dirty bit or use bit, associated with the line is set

• Then when a block is replaced, it is written back into the main memory, iff the dirty bit is set

• Thus it minimizes memory writes

DISADVANTAGE

• Portions of MM are still invalid, hence I/O should be allowed access only through cache

(65)

Cache Replacement Policy

• Random

Replace a randomly chosen line

• FIFO

Replace the oldest line

• LRU (Least Recently Used)

Replace the least recently used line

• NRU (Not Recently Used)

(66)

:[8]