• No results found

Memory & Caches IV. CSE 351 Winter 2021

N/A
N/A
Protected

Academic year: 2021

Share "Memory & Caches IV. CSE 351 Winter 2021"

Copied!
31
0
0

Loading.... (view fulltext now)

Full text

(1)

Instructor: Teaching Assistants:

Mark Wyse Kyrie Dowling Catherine Guevara Ian Hsiao Jim Limprasert Armin Magness Allie Pfleger

(2)

Lab 3 due Monday (2/22)

Cache parameter puzzles and code optimizations

hw16 due Mon (2/22), hw17 due Wed (2/24)

Lab 4 preparation

Remember to complete mid-quarter survey!

(3)

Terminology:

Write-hit policies: write-back, write-through

Write-miss policies: write allocate, no-write allocate

Cache blocking

(4)

▪ multiple levels of cache and main memory

❖ What to do on a write-hit?

▪ Write-through: write immediately to next level

▪ Write-back: defer write to next level until line is evicted (replaced)

• Must track which cache lines have been modified (“dirty bit”)

❖ What to do on a write-miss?

▪ Write allocate: (“fetch on write”) load into cache, then execute the write-hit policy

• Good if more writes or reads to the location follow

▪ No-write allocate: (“write around”) just write immediately to next level

(5)

example simple and focus on the cache policies

0xBEEF

Cache: 1 0 G

Valid Dirty Tag Block Contents

Memory: 0xCAFE F ● ● ● ● ● Block Num

There is only one set in this tiny cache, so the tag is the entire block number!

(6)

0xBEEF

Cache: 1 0 G

Valid Dirty Tag Block Contents

Memory: 0xCAFE F ● ● ● ● ● ● Block Num

Step 1: Bring F into cache

(7)

0xCAFE

Cache: 1 0 F

Valid Dirty Tag Block Contents

Memory: 0xCAFE F ● ● ● ● ● Block Num

Step 1: Bring F into cache

Step 2: Write

0xFACE to cache only and set the dirty bit

(8)

0xFACE

Cache: 1 1 F

Valid Dirty Tag Block Contents

Memory: 0xCAFE F ● ● ● ● ● ● Block Num

Step 1: Bring F into cache

Step 2: Write

0xFACE to cache only and set the dirty bit

(9)

0xFACE

Cache: 1 1 F

Valid Dirty Tag Block Contents

Memory: 0xCAFE F ● ● ● ● ● Block Num Step: Write 0xFEED to cache only (and set the dirty bit)

(10)

0xFEED

Cache: 1 1 F

Valid Dirty Tag Block Contents

Memory: 0xCAFE F ● ● ● ● ● ● Block Num

(11)

0xFEED

Cache: 1 1 F

Valid Dirty Tag Block Contents

Memory: 0xCAFE F ● ● ● ● ● Block Num

Write Miss Write Hit Read Miss

Step 1: Write F back to memory since it is dirty

(12)

0xFEED

Cache: 1 1 F

Valid Dirty Tag Block Contents

Memory: 0xFEED F ● ● ● ● ● ● Block Num

Write Miss Write Hit Read Miss

Step 1: Write F back to memory since it is dirty

Step 2: Bring G into the cache so that we can copy it into %ax

(13)

Want to play around with cache parameters and

policies? Check out our cache simulator!

▪ https://courses.cs.washington.edu/courses/cse351/cachesim/

Way to use:

Take advantage of “explain mode” and navigable history to test your own hypotheses and answer your own questions

Self-guided Cache Sim Demo posted along with Section 7

Will be used in hw19 – Lab 4 Preparation

(14)

Which of the following cache statements is FALSE?

Vote in Ed Lessons

A.

We can reduce compulsory misses by decreasing

our block size

B.

We can reduce conflict misses by increasing

associativity

C.

A write-back cache will save time for code with

good temporal locality on writes

D.

A write-through cache will always match data

with the memory hierarchy level below it

(15)

Write code that has locality!

Spatial: access data contiguously

Temporal: make sure access to the same data is not too far apart in time

How can you achieve locality?

Adjust memory accesses in code (software) to improve miss rate (MR)

Requires knowledge of both how caches work as well as your system’s parameters

(16)

C

=

×

A

B

a

i*

b

*j

(17)

How do cache blocks fit into this scheme?

Row major matrix in memory: Cache

(18)

# move along rows of A

for (i = 0; i < n; i++)

# move along columns of B

for (j = 0; j < n; j++)

# EACH k loop reads row of A, col of B # Also read & write c(i,j) n times

for (k = 0; k < n; k++)

c[i*n+j] += a[i*n+k] * b[k*n+j];

=

C(i,j)

+

A(i,:)

×

B(:,j) C(i,j)

(19)

Scenario Parameters:

Square matrix (𝑛 × 𝑛), elements are doubles

Cache block size 𝐾 = 64 B = 8 doubles

Cache size 𝐶 ≪ 𝑛 (much smaller than 𝑛)

Each iteration:

𝑛 8 + 𝑛 = 9𝑛 8 misses

×

=

(20)

Scenario Parameters:

Square matrix (𝑛 × 𝑛), elements are doubles

Cache block size 𝐾 = 64 B = 8 doubles

Cache size 𝐶 ≪ 𝑛 (much smaller than 𝑛)

Each iteration:

𝑛 8 + 𝑛 = 9𝑛 8 misses

Afterwards in cache: (schematic)

×

=

×

=

(21)

Scenario Parameters:

Square matrix (𝑛 × 𝑛), elements are doubles

Cache block size 𝐾 = 64 B = 8 doubles

Cache size 𝐶 ≪ 𝑛 (much smaller than 𝑛)

Each iteration:

𝑛 8 + 𝑛 = 9𝑛 8 misses ❖

Total misses:

9𝑛

× 𝑛

2

=

9

𝑛

3

×

=

(22)

Can get the same result of a matrix multiplication by

splitting the matrices into smaller submatrices

(matrix “blocks”)

(23)

Matrices of size 𝑛 × 𝑛, split into 4 blocks of size 𝑟 (𝑛=4𝑟)

C

22

= A

21

B

12

+ A

22

B

22

+ A

23

B

32

+ A

24

B

42

= 

k

A

2k

*B

k2

Multiplication operates on small “block” matrices

C11 C12 C13 C14 C21 C22 C23 C24 C31 C32 C43 C34 C41 C42 C43 C44 A11 A12 A13 A14 A21 A22 A23 A24 A31 A32 A33 A34 A41 A42 A43 A144 B11 B12 B13 B14 B21 B22 B23 B24 B32 B32 B33 B34 B41 B42 B43 B44

(24)

𝑟

= block matrix size (assume 𝑟 divides 𝑛 evenly)

# move by rxr BLOCKS now

for (i = 0; i < n; i += r) for (j = 0; j < n; j += r)

for (k = 0; k < n; k += r)

# block matrix multiplication

for (ib = i; ib < i+r; ib++) for (jb = j; jb < j+r; jb++)

for (kb = k; kb < k+r; kb++)

(25)

Scenario Parameters:

Cache block size 𝐾 = 64 B = 8 doubles

Cache size 𝐶 ≪ 𝑛 (much smaller than 𝑛)

Three blocks (𝑟 × 𝑟) fit into cache: 3𝑟2 < 𝐶

Each block iteration:

𝑟2/8 misses per block

2𝑛/𝑟 × 𝑟2/8 = 𝑛𝑟/4

𝑛/𝑟 blocks 𝑟2elements per block, 8 per cache block

𝑛/𝑟 blocks in row and column

×

=

(26)

Scenario Parameters:

Cache block size 𝐾 = 64 B = 8 doubles

Cache size 𝐶 ≪ 𝑛 (much smaller than 𝑛)

Three blocks (𝑟 × 𝑟) fit into cache: 3𝑟2 < 𝐶

Each block iteration:

𝑟2/8 misses per block

2𝑛/𝑟 × 𝑟2/8 = 𝑛𝑟/4

Afterwards in cache

𝑛/𝑟 blocks 𝑟2elements per block, 8 per cache block

𝑛/𝑟 blocks in row and column

×

=

(27)

Scenario Parameters:

Cache block size 𝐾 = 64 B = 8 doubles

Cache size 𝐶 ≪ 𝑛 (much smaller than 𝑛)

Three blocks (𝑟 × 𝑟) fit into cache: 3𝑟2 < 𝐶

Each block iteration:

𝑟2/8 misses per block

2𝑛/𝑟 × 𝑟2/8 = 𝑛𝑟/4

𝑛/𝑟 blocks 𝑟2elements per block, 8 per cache block

𝑛/𝑟 blocks in row and column

×

=

(28)

Here 𝑛 = 100, 𝐶 = 32 KiB, 𝑟 = 30

Naïve:

Blocked:

≈ 1,020,000 cache misses ≈ 90,000

(29)

How data structures are organized

How data are accessed

• Nested loop structure

• Blocking is a general technique

All systems favor “cache-friendly code”

Getting absolute optimum performance is very platform specific

• Cache size, cache block size, associativity, etc.

Can get most of the advantage with generic code

• Keep working set reasonably small (temporal locality) Use small strides (spatial locality)

(30)

8m 2m 512k 128k 32k 0 2000 4000 6000 8000 10000 12000 14000 16000 s1 s3 s5 s7 R e ad th rou gh p u t (M B/ s) Stride (x8 bytes) 64 B block size Slopes of spatial locality Ridges of temporal locality L1 Mem L2 L3 prefetching

(31)

Linux:

lscpu

ls /sys/devices/system/cpu/cpu0/cache/index0/

• Example: cat /sys/devices/system/cpu/cpu0/cache/index*/size

Windows:

wmic memcache get <query> (all values in KB)

Example: wmic memcache get MaxCacheSize

References

Related documents

If we listen to a Bach fugue, indepen- dently of any other activity, we are listening to the functioning of pure musical codes, generating musical discourse; music on this

• Flush cache whenever needed (e.g., write dirty data to main memory before I/O write). • Similarly, invalidate cache before

The majority of both groups, 47 students in the technology courses and 23 students who had not yet engaged in their coursework, scored a 3 - Technology is used by teachers

▪ Occurs when the set of active cache blocks (the working set) is larger than the cache (just won’t fit, even if cache was fully- associative). ▪ Note: Fully-associative only

A mixture of fine and coarse spherical particles was observed over Bucharest, likely re- lated to the presence of smoke from European fires, whereas at Athens mainly fine particles

– Request for main memory access (read or write) – First, check cache for copy. •

To master these moments, use the IDEA cycle: identify the mobile moments and context; design the mobile interaction; engineer your platforms, processes, and people for

In the Channel-SLAM algorithm [7], [8] summarized in Section II, every signal component, or propagation path, corresponds to one transmitter. However, the KEST algorithm may lose