Cache Memory

(1)

Computer Architecture

Chapter 5

(2)

Chapter Overview

5.1 Introduction

5.2 The ABCs of Caches

5.3 Reducing Cache Misses

5.4 Reducing Cache Miss Penalty

5.5 Reducing Hit Time

5.6 Main Memory

5.7 Virtual Memory

(3)

Introduction

5.1 Introduction

5.2 The ABCs of Caches

5.3 Reducing Cache Misses

5.4 Reducing Cache Miss

Penalty

5.5 Reducing Hit Time

5.6 Main Memory

5.7 Virtual Memory

5.8 Protection and Examples

of Virtual Memory

The Big Picture: Where are

We Now?

The Five Classic Components of a Computer

Control

Datapath

Memory

Processor

Input

Output

Topics In This Chapter:

SRAM Memory Technology

DRAM Memory Technology

Memory Organization

(4)

Levels of the Memory Hierarchy

CPU Registers

100s Bytes

1s ns

Cache

K Bytes

4 ns

1-0.1

cents/bit

Main Memory

M Bytes

100ns- 300ns

$.0001-.00001 cents

/bit

Disk

G Bytes, 10 ms

(10,000,000 ns)

10 - 10 cents/bit

-5

-6

Capacity

Access

Time

Cost

Tape

infinite

sec-min

10 -8

Register

s

Cach

e

Memor

y

Dis

k

Tap

e

Instr.

Operands

Block

s

Page

s

File

s

Staging

Xfer Unit

prog.

/compiler

1-8 bytes

cache cntl

8-128

bytes

OS

512-4K

bytes

user/operato

r

Mbytes

Upper

Level

Lower

Level

faste

r

Large

r

Introduction

The Big Picture: Where are

(5)

The ABCs of Caches

In this section we will:

Learn lots of definitions about caches – you

can’t talk about something until you

understand it (this is true in computer science

at least!)

Answer some fundamental questions about

caches:

Q1: Where can a block be placed in the

upper level?

(Block placement)

Q2: How is a block found if it is in the

upper level?

(Block identification)

Q3: Which block should be replaced on a

miss?

(Block replacement)

Q4: What happens on a write?

(Write strategy)

5.1 Introduction

5.2 The ABCs of Caches

5.3 Reducing Cache Misses

5.4 Reducing Cache Miss

Penalty

5.5 Reducing Hit Time

5.6 Main Memory

5.7 Virtual Memory

5.8 Protection and Examples

of Virtual Memory

(6)

Cache Memory

The purpose of cache memory is to speed up

accesses by storing recently used data closer to

the CPU, instead of storing it in main memory.

Although cache is much smaller than main

memory, its access time is a fraction of that of

main memory.

Unlike main memory, which is accessed by

address, cache is typically accessed by content;

hence, it is often called content addressable

memory .

Because of this, a single large cache memory

isn’t always desirable-- it takes longer to search.

(7)

Cache

Small amount of fast memory

Sits between normal main memory

and CPU

May be located on CPU chip or

module

(8)

Cache/Main Memory

Structure

(9)

Cache operation – overview

CPU requests contents of memory location

Check cache for this data

If present, get from cache (fast)

If not present, read required block from main

memory to cache

Then deliver from cache to CPU

Cache includes tags to identify which block of

main memory is in each cache slot

(10)

(11)

Comparison of Cache Sizes

a Two values seperated by a slash refer to instruction and data caches b Both caches are instruction only; no data caches

Processor Type Year of Introduction

L1 cachea L2 cache L3 cache

IBM 360/85 Mainframe 1968 16 to 32 KB — — PDP-11/70 Minicomputer 1975 1 KB — — VAX 11/780 Minicomputer 1978 16 KB — — IBM 3033 Mainframe 1978 64 KB — — IBM 3090 Mainframe 1985 128 to 256 KB — — Intel 80486 PC 1989 8 KB — — Pentium PC 1993 8 KB/8 KB 256 to 512 KB — PowerPC 601 PC 1993 32 KB — — PowerPC 620 PC 1996 32 KB/32 KB — — PowerPC G4 PC/server 1999 32 KB/32 KB 256 KB to 1 MB 2 MB IBM S/390 G4 Mainframe 1997 32 KB 256 KB 2 MB IBM S/390 G6 Mainframe 1999 256 KB 8 MB — Pentium 4 PC/server 2000 8 KB/8 KB 256 KB —

IBM SP High-end server/ supercomputer

2000 64 KB/32 KB 8 MB —

CRAY MTAb Supercomputer 2000 8 KB 2 MB —

Itanium PC/server 2001 16 KB/16 KB 96 KB 4 MB

SGI Origin 2001 High-end server 2001 32 KB/32 KB 4 MB —

Itanium 2 PC/server 2002 32 KB 256 KB 6 MB

IBM POWER5 High-end server 2003 64 KB 1.9 MB 36 MB

(12)

The Principle of Locality

The Principle of Locality:

Program access a relatively small portion of the address space at

any instant of time.

Three Different Types of Locality:

Temporal Locality (Locality in Time): If an item is referenced, it

will tend to be referenced again soon (e.g., loops, reuse)

Spatial Locality (Locality in Space): If an item is referenced, items

whose addresses are close by tend to be referenced soon

(e.g., straightline code, array access)

Sequential Locality : Sequential order of program execution

except branch instructions.

(13)

A few terms

Inclusion Property

Coherence Property

Access frequency

Access time

Cycle time

Latency

Bandwidth

Capacity

Unit of transfer

(14)

Memory Hierarchy: Terminology

Hit

: data appears in some block in the upper level (example: Block X)

Hit Rate

: the fraction of memory access found in the upper level

Hit Time

: Time to access the upper level which consists of

Upper level access time + Time to determine hit/miss

Miss

: data needs to be retrieve from a block in the lower level (Block Y)

Miss Rate

= 1 - (Hit Rate)

Miss Penalty

: Time to replace a block in the upper level +

Time to deliver the block the processor

Consider a memory with three levels

Average memory access time (assuming hit at 3

rd

level)

h1 * t1 + (1 – h1) [t1 + h2 * t2 + (1 – h2) * ( t2 + t3)] where t1, t2 and t3 are access times at

the three levels

Access frequency of level Mi: fi = (1- h1) (1- h2)…(1-hi)hi

Effective Access time =

**(fi * ti)**

(15)

Cache Measures

Hit rate

: fraction found in that level

So high that usually talk about

Miss rate

Average memory-access time

= Hit time + Miss rate x Miss penalty

(ns or clocks)

Miss penalty

: time to replace a block from lower level,

including time to replace in CPU

access time

: time to lower level

= f(latency to lower level)

transfer time

: time to transfer block

=f(Bandwidth between upper & lower levels)

(16)

Measures

CPU Execution time = (CPU Clock Cycles + Memory Stall Cycles) *

Clock Cycle Time

CPU clock cycles includes cache hit and CPU is stalled during miss

Memory Stall cycles

= Number of misses * Miss penalty

= IC * (Misses / Instruction) * Miss penalty

= IC * (Memory Accesses / Instruction) * Miss Rate * Miss penalty

Miss rate and miss penalties are different for reads and writes

Memory Stall Cycles

=

**IC * (Reads / Instruction) * Read Miss Rate * Read Miss penalty**

**+ IC * (Writes / Instruction) * Write Miss Rate * Write Miss penalty**

Miss Rate = Misses / Instruction

= (Miss rate * Memory Accesses ) / Instruction Count

= Miss rate * (Memory Accesses / Instruction)

(17)

(18)

Simplest Cache: Direct Mapped

Memor

y

4 Byte Direct Mapped Cache

Memory

Address

0

1

2

3

4

5

6

7

8

9 A

B

C

D

E

F

Cache Index

0

1

2

3 Location 0 can be occupied by data from:

Memory location 0, 4, 8, ... etc.

In general: any memory location

whose 2 LSBs of the address are 0s

Address<1:0> => cache index

Which one should we place in the cache?

How can we tell which one is in the

cache?

(19)

Block 12 is placed in an 8 block cache:

Fully associative, direct mapped, 2-way set associative

S.A. Mapping = Block Number Modulo Number Sets

0 1 2 3 4 5 6

7 Bloc

k

no.

Direct mapped:

block 12 can go

only into block 4

(12 mod 8)

0 1 2 3 4 5 6

7 Bloc

k

no.

Set associative:

block 12 can go

anywhere in set

0 (12 mod 4)

Set

0 Set

1 Set

2 Set

3 0 1 2 3 4 5 6

7 Bloc

k

no.

Fully associative:

block 12 can go

anywhere

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0

1 Block-frame

address

1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3

3 Bloc

k

no.

Where can a block be placed in the Cache?

(20)

Each entry in the cache stores words

Tag on each block

No need to check index or block offset

The ABCs of Caches

How is a block found if it

is in the cache?

Address

Byte Offset

(21)

Each entry in the cache stores words

Tag on each block

No need to check index or block offset

The ABCs of Caches

How is a block found if it

is in the cache?

Address

Byte Offset

(22)

Cache Memory

Take advantage of spatial locality

Store multiple words

The diagram below is a schematic of what

cache looks like.

Block 0 contains multiple words from main memory, identified

with the tag 00000000. Block 1 contains words identified with

the tag 11110101.

(23)

Cache Memory

As an example, suppose a program generates

the address 1AA. In 14-bit binary, this number is:

00000110101010

.

The first 7 bits of this address go in the tag field,

the next 4 bits go in the block field, and the final

3 bits indicate the word within the block.

(24)

Cache Organizations I :

DirectMapped Cache

Cache Index

0

1

2

3 :

Cache Data

Byte 0

0

4

31 :

Cache Tag

Example: 0x50

Ex: 0x01

0x50

Stored as part

of the cache “state”

Valid Bit

:

31 Byte 1

Byte 31

:

Byte 32

Byte 33

Byte 63

:

Byte 992

Byte 1023

:

Cache Tag

Byte Select

Ex: 0x00

9 Block

address

Cache Memory

(25)

Direct Mapping

Address Structure

Tag s-r

Line or Slot r

Word w

8

14

2 24 bit address

2 bit word identifier (4 byte block)

22 bit block identifier

8 bit tag (=22-14)

14 bit slot or line

No two blocks in the same line have the same Tag

field

Check contents of cache by finding line and

checking Tag

(26)

Direct Mapping

Cache Line Table

Cache line Main Memory blocks held

0 0, m, 2m, 3m…2s-m

1 1,m+1, 2m+1…2s-m+1

m-1 m-1, 2m-1,3m-1…2s-1

(27)

(28)

(29)

Direct Mapping pros & cons

Simple

Inexpensive

Fixed location for given block

If a program accesses 2 blocks that map to the

same line repeatedly, cache misses are very high

(30)

Associative Mapping

A main memory block can load into

any line of cache

Memory address is interpreted as

tag and word

Tag uniquely identifies block of

memory

Every line’s tag is examined for a

match

(31)

(32)

(33)

Tag 22 bit

Word

_{2 bit}

Associative Mapping Address Structure

22 bit tag stored with each 32 bit block of data

Compare tag field with tag entry in cache to

check for hit

Least significant 2 bits of address identify which

16 bit word is required from 32 bit data block

e.g.

Address Tag Data Cache line

(34)

Cache Organizations II :

SetAssociative Cache

Cache Memory

Cache Index

Set 1

:

Cache Data

Byte 0

0

4

31 :

Cache Tag

Example: 0x50

Ex: 0x01 mod 16

0x50

Stored as part

of the cache “state”

Valid Bit

:

Byte 1

Byte 31

:

Byte 32

Byte 33

Byte 63

:

Byte 992

Byte 1023

:

Cache Tag

Byte Select

Ex: 0x00

9 Block

address

Set 0

Set 15

8

(35)

Set Associative Mapping

Cache is divided into a number of

sets

Each set contains a number of lines

A given block maps to any line in a

given set

e.g. Block B can be in any line of set i

e.g. 2 lines per set

2 way associative mapping

A given block can be in one of 2 lines in only one

set

(36)

Set Associative Mapping

Example

13 bit set number

Block number in main memory is

modulo 2

13 000000, 00A000, 00B000, 00C000 …

map to same set

(37)

Two Way Set Associative

Cache Organization

(38)

Set Associative Mapping

Address Structure

Use set field to determine cache set to look in

Compare tag field to see if we have a hit

e.g

Address Tag Data Set number

1FF 7FFC 1FF 12345678 1FFF

001 7FFC 001 11223344 1FFF

Tag 9 bit

_{Set 13 bit}

Word

(39)

Two Way

Set

Associative

Mapping

(40)

Replacement Algorithms (1)

Direct mapping

No choice

Each block only maps to one line

Replace that line

(41)

Replacement Algorithms (2)

Associative & Set Associative

Hardware implemented algorithm (speed)

Least Recently used (LRU)

e.g. in 2 way set associative

Which of the 2 block is lru?

First in first out (FIFO)

replace block that has been in cache longest

Least frequently used

replace block which has had fewest hits

Random

(42)

Write Policy

Must not overwrite a cache block

unless main memory is up to date

Multiple CPUs may have individual

caches

I/O may address main memory

directly

(43)

Write through

All writes go to main memory as well as cache

Multiple CPUs can monitor main memory traffic

to keep local (to CPU) cache up to date

Lots of traffic

Slows down writes

(44)

Write back

Updates initially made in cache only

Update bit for cache slot is set when update

occurs

If block is to be replaced, write to main memory

only if update bit is set

Other caches get out of sync

I/O must access main memory through cache

N.B. 15% of memory references are writes

(45)

Let’s Do An Example:

The Memory Addresses We’ll Be

Using

Here’s a number of addresses. We’ll be asking for the data at these

addresses and see what happens to the cache when we do so.

1090

_{0000000000000000000001}

0

001

0 0001

0

4

5

8

9

3

1 Result

Off-set

Set

Tag

Address

Miss

1440

_{0000000000000000000001}

0

110

1 0000

0

4

5

8

9

3

1 _Miss

5000

_{xxxxxxxxxxxxxxxxxxxxxxx}

_xxx

x

0100

0

4

5

8

9

3

1 1470

_{xxxxxxxxxxxxxxxxxxxxxx}

x

xxx

x

0

4

5

8

9

3

1 xxxxx

Cache:

1. Is Direct Mapped

2. Contains 512 bytes.

3. Has 16 sets.

4. Each set can hold 32 bytes or

1 cache line.

(46)

Here’s the Cache We’ll Be

Touching

Initially the

cache is

empty.

N

15 (1111)

N

14 (1110)

N

13 (1101)

N

12 (1100)

N

11 (1011)

N

10 (1010)

N

9 (1001)

N

8 (1000)

N

7 (0111)

N

6 (0110)

N

5 (0101)

N

4 (0100)

N

3 (0011)

N

2 (0010)

N

1 (0001)

N

0 (0000)

Data

(Can hold a 32-byte cache line.)

Tag

V

Set

Address

Cache:

1. Is Direct Mapped

2. Contains 512

bytes.

3. Has 16 sets.

4. Each set can hold

32 bytes or 1

(47)

Doing Some Cache Action

We want to

READ

data from address

1090

= 010|0010|00010

N 15 (1111) N 14 (1110) N 13 (1101) N 12 (1100) N 11 (1011) N 10 (1010) N 9 (1001) N 8 (1000) N 7 (0111) N 6 (0110) N 5 (0101) N 4 (0100) N 3 (0011) N 2 (0010) N 1 (0001) N 0 (0000) Data

(Always holds a 32-byte cache line.) Tag V Set Address 1001 1000 0100 0011 0011 0010 0010 0010 0010 0010 0001 0000 Tag 1100 0000 0000 0010 0010 1101 1101 0010 0010 0000 0000 1000 Set 10100 1620 11110 1470 01000 5000 00000 4096 00000 2048 00000 1600 00000 1440 01011 1099 00010 1090 00000 1024 00000 512 00000 256 Offset

Add. Y 00000….10 Data from memory loc. 1088 - 1119

(48)

We want to

READ

data from address

1440

= 010|1101|00000

N 15 (1111) N 14 (1110) N 13 (1101) N 12 (1100) N 11 (1011) N 10 (1010) N 9 (1001) N 8 (1000) N 7 (0111) N 6 (0110) N 5 (0101) N 4 (0100) N 3 (0011)

Data from memory loc. 1088 - 1119 00000….10 Y 2 (0010) N 1 (0001) N 0 (0000) Data

(Always holds a 32-byte cache line.) Tag

V Set

Address

Y 00000….10 Data from memory loc. 1440 - 1471

1001 1000 0100 0011 0011 0010 0010 0010 0010 0010 0001 0000 Tag 1100 0000 0000 0010 0010 1101 1101 0010 0010 0000 0000 1000 Set 10100 1620 11110 1470 01000 5000 00000 4096 00000 2048 00000 1600 00000 1440 01011 1099 00010 1090 00000 1024 00000 512 00000 256 Offset Add.

Doing Some Cache Action

(49)

We want to

READ

data from address

5000

= 1001|1100|01000

N 15 (1111) N 14 (1110)

Data from memory loc. 1440 - 1471 00000…0010 Y 13 (1101) N 12 (1100) N 11 (1011) N 10 (1010) N 9 (1001) N 8 (1000) N 7 (0111) N 6 (0110) N 5 (0101) N 4 (0100) N 3 (0011)

Data from memory loc. 1088 - 1119 00000…….10 Y 2 (0010) N 1 (0001) N 0 (0000) Data

V Set

Address

Doing Some Cache Action

(50)

We want to

READ

data from address

1470

= 0010|1101|11110

N 15 (1111) N 14 (1110)

Data from memory loc. 1440 - 1471 00000…00010

Y 13 (1101)

Data from memory loc. 4992 - 5023 00000….1001 Y 12 (1100) N 11 (1011) N 10 (1010) N 9 (1001) N 8 (1000) N 7 (0111) N 6 (0110) N 5 (0101) N 4 (0100) N 3 (0011)

V Set

Address

Doing Some Cache Action

(51)

We want to

READ

data from address

1600

= 0011|0010|00000

N 15 (1111) N 14 (1110)

Y 13 (1101)

V Set

Address

Doing Some Cache Action

(52)

We want to

WRITE

data to address

256 = 0000|1000|00000

N 15 (1111) N 14 (1110)

Y 13 (1101)

Data from memory loc. 1600 - 1631 00000….0011 Y 2 (0010) N 1 (0001) N 0 (0000) Data

V Set

Address

Y 00000….0000 Data from memory loc. 256 - 287

Doing Some Cache Action

(53)

We want to

WRITE

data to address

1620

= 0011|0010|10100

N 15 (1111) N 14 (1110)

Y 13 (1101)

Data from memory loc. 4992 - 5023 00000….1001 Y 12 (1100) N 11 (1011) N 10 (1010) N 9 (1001)

Data from memory loc. 256 - 287 00000….0000 Y 8 (1000) N 7 (0111) N 6 (0110) N 5 (0101) N 4 (0100) N 3 (0011)

V Set

Address

Doing Some Cache Action

(54)

We want to

WRITE

data to address

1099

= 0010|0010|01011

N 15 (1111) N 14 (1110)

Y 13 (1101)

Data from memory loc. 4992 - 5023 00000….1001 Y 12 (1100) N 11 (1011) N 10 (1010) N 9 (1001)

Data from memory loc. 256 - 287 00000….0000 Y 8 (1000) N 7 (0111) N 6 (0110) N 5 (0101) N 4 (0100) N 3 (0011)

Data from memory loc. 1600 - 1631 00000…00011 Y 2 (0010) N 1 (0001) N 0 (0000) Data

V Set

Address

Doing Some Cache Action

(55)

Write through

—The information is written to both the block in

the cache and to the block in the lower-level memory.

Write back

—The information is written only to the block in the

cache. The modified cache block is written to main memory

only

when it is replaced.

is block clean or dirty?

WT always combined with write buffers so that don’t wait for

lower level memory

What happens on a write?

(56)

A

Write Buffer

is needed between the Cache

and Memory

Processor: writes data into the cache and the write buffer;

Memory controller: write contents of the buffer to memory.

Write buffer is just a FIFO:

Typical number of entries: 4;

Must handle bursts of writes

;

Processor

Cache

Write Buffer

DRA

M

Write Buffer for Write Through

(57)

Assume: a 16-bit (sub-block) write to memory location 0x0 and causes a miss.

Do

we allocate space in cache and possibly read in the block?

Yes

: Write Allocate (Write back caches)

No

: Not Write Allocate (Write through)

Example:

WriteMem[100]

ReadMem[200]

WriteMem[200]

WriteMem[100]

NWA: four misses and one hit

WA: two misses and three hits

Write-miss Policy:

Write Allocate vs . Not Allocate

(58)

Pentium 4 Cache

80386 – no on chip cache

80486 – 8k using 16 byte lines and four way set associative

organization

Pentium (all versions) – two on chip L1 caches

Data & instructions

Pentium III – L3 cache added off chip

Pentium 4

L1 caches

8k bytes 64 byte lines

four way set associative

L2 cache

Feeding both L1 caches 256k

128 byte lines

8 way set associative

(59)

Intel Cache Evolution

Problem Solution Processor on which feature

first appears

External memory slower than the system bus. Add external cache using faster memory technology.

386

Increased processor speed results in external bus becoming a bottleneck for cache access.

Move external cache on-chip, operating at the same speed as the processor.

486

Internal cache is rather small, due to limited space on chip Add external L2 cache using faster technology than main memory

486

Contention occurs when both the Instruction Prefetcher and the Execution Unit simultaneously require access to the cache. In that case, the Prefetcher is stalled while the Execution Unit’s data access takes place.

Create separate data and instruction caches.

Pentium

Increased processor speed results in external bus becoming a bottleneck for L2 cache access.

Create separate back-side bus that runs at higher speed than the main (front-side) external bus. The BSB is dedicated to the L2 cache.

Pentium Pro

Move L2 cache on to the processor chip.

Pentium II

Some applications deal with massive databases and must have rapid access to large amounts of data. The on-chip caches are too small.