Solution: The cache size is 32 KB and there are 32 bytes per block, so there are 32 K

(1)

1. Caches (30 points): Consider a byte-addressable memory-system with 32-bit physical addresses.

The Level-1 instruction cache can store 32 KB of data and has 32 byte blocks.

(a) The cache is write-through, 2-way set associative and uses a least-recently used strategy to replace cache blocks. What is the total number of bits necessary to build this cache? Would this quantity increase or decrease if the cache was 4-way associative with 16 byte blocks?

Show your calculations clearly.

Solution: The cache size is 32 KB and there are 32 bytes per block, so there are ³²K

32 = 1024 blocks. Since there are 2 blocks per set (2-way associative), there are ¹⁰²⁴₂ = 512 sets. Thus the index field is log₂(512) = 9 bits long. Also the block offset is log₂(32) = 5 bits long.

To build the cache, each block stores 32×8 bits of data, 1 valid bit and 32−9−5 = 18 tag bits.

Also, each set stores one LRU bit. Hence, the total number of bits is 1024(32 × 8 + 1 + 18) + 512 If the block size was decreased to 16 bytes, the number of blocks increases to 2048, which increases the number of tag and valid bits but keeps the data bits constant. If the associativity also increases to 4, the number of sets remains the same (512) but the number of LRU bits per set must increase in order to remember all possible permutations of blocks from least to most recently used. Hence, the number of bits to implement the cache increases .

(b) There are two buses, one for addresses and one for data, connecting the cache to memory.

Each bus is 4 bytes wide and has a 1 cycle latency. If the interleaved memory has 8 banks each with a latency L cycles, compute the miss penalty of the cache. Show your calculations clearly.

Solution: Since the cache is write-through, there are no dirty bits, so the miss-penalty is simply the time to load a new cache block from memory. It takes L + 2 cycles to load 4 bytes of data from the first bank (1 cycle to send the address, L cycles for the bank latency, and 1 cycle to receive the data). Since the cache-block is 32 bytes, 7 additional such transfers are needed. By pipelining, this can be accomplished in 7 additional cycles for a total of L + 2 + 7 = L + 9 cycles. (Note that regardless of the value L, there cannot be a structural hazard involving addresses and data since the buses are different.) Hence the miss penalty for this cache is L + 9 cycles .

(c) A MIPS program contains a loop whose body has N instructions. A variant of this program unrolls the loop so that the loop body now contains 2N instructions. It is observed that the miss-rate in the Level-1 instruction cache for this region of the code rises from nearly 0% to over 12%. Explain this observation.

Solution: Since the cache block size is 32 bytes and each MIPS instruction is 4 bytes, each cache block contains ³²₄ = 8 instructions. The only way this region of the code can achieve 0%

miss-rate is if the entire loop body (4N bytes) fits within the cache in the first iteration, and then remains within the cache for all subsequent iterations. For this to happen, 4N ≤ 32 K and hence N ≤ 8 K. After unrolling, the loop body occupies 8N bytes. In order to suffer a miss-rate of more than 12%, the entire loop body cannot fit in the cache and hence 8N > 32 K

(2)

2. Virtual Memory (25 points):

Consider a virtual memory system with 36-bit virtual addresses, 30-bit physical addresses, and 4 KB pages. There is a single-level direct-mapped cache of size 8 KB with 16 byte blocks.

(a) Since the cache is direct-mapped it is possible for two addresses v₁and v₂ to “collide” in the cache (i.e., the cache can only hold the byte at one of these addresses at a given time), but this cannot happen if v1 and v2 are in the same virtual page. Explain why.

Solution: Since the page size is 4 KB, the page-offset is log₂(4K) = 12 bits. If v1and v2belong to the same virtual page, they are identical in their VPN fields (i.e., upper 36 − 12 = 24 bits) but differ only in their lower 12 bits. Since the VPN bits are converted into the PPN bits, the physical addresses p1 and p2 corresponding to v1 and v2 must differ only in their lower 12 bits.

The block offset is log₂(block size) = 4 bits long. The cache contains ⁸K

16 = 512 blocks. Since the cache is direct-mapped, it has 512 sets and hence the index field is log₂(512) = 9 bits.

Thus, p₁ and p₂ differ either in their index field (in which case they cannot “collide” since they map to different sets) or in the block offset field (in which case they cannot “collide”

since they map to different bytes within the same block).

(b) Your friend claims that the cache can simultaneously hold the first bytes from any two distinct virtual pages. Explain whether your friend is right or wrong.

Solution: Our friend is wrong . Suppose v1 and v2 are the first addresses in two distinct virtual pages. Hence the 12 least significant bits of v1and v2are all zero (page-offset = zero).

If the VPNs of v1and v2 happen to map to PPNs whose least significant bit is identical (e.g., both PPNs are even), then the 13 least significant bits of the corresponding physical addresses p₁ and p₂ will be identical. These 13 least significant bits form the index (9 bits) and block offset (4 bits). Since the two physical addresses have the same index field, they map to the same set in the cache and hence “collide”.

(c) Based on the following quantities, compute a formula for the average time to read a single byte of data from memory. Explain your answer.

Hit time Miss rate Miss penalty

(cycles) (cycles)

TLB ht rt pt

Cache hc rc pc

Solution: Since the TLB is accessed first and the cache is accessed next (in serial), the average time to read a single byte is the AMAT of the TLB plus the AMAT of the cache i.e.,

(ht+ rt· pt) + (hc+ rc· pc)

2

(3)

Latencies of key components of datapath (Problem 3)

Component Latency (ps)

TLB 100

PC register read/write 10

Instruction cache 150

Register file read/write 180

ALU 75

Data cache read/write 180 (all other latencies are negligible)

Description of the six stages: IT IF ID EX MM WB

IT: Translates PCv into physical address PC_p; in parallel updates PC_v to PC_v+ 4.

IF: Reads the instruction from the instruction cache at address PCp computed by the IT stage.

ID: Reads register file and then (for jr instructions) updates the PCv; in parallel computes virtual branch target address, and then (for unconditional branches or conditional branches predicted taken) updates the PCv.

EX: Performs ALU operation (including resolving conditional branch decisions). For load and store instructions only, translates the ALU output into a physical address. For conditional branches, updates PCv (if necessary) based on ALU output.

MM: Reads from/writes to data cache using physical address.

WB: Updates the register file.

(4)

3. Virtual Memory and Pipeline (30 points):

Consider the 6-stage MIPS pipelined datapath on the opposite page. The PC register stores only the virtual-PC PC_v of the next instruction.

(a) What is the minimum possible clock-cycle time for this datapath? Explain your answer.

Solution: The minimum clock-cycle time is the latency of the longest stage, which is 190 ps as explained below:

Stage Latency Explanation

IT 10 + max(100, 0 + 10) read PCv then in parallel:

use TLB to get PCp; update PCv

IF 150 read instruction cache

ID max(180 + 10, 0, 0 + 10) in parallel: read register file and possibly update PCv; possibly update PCv to target PCv

EX 75 + max(100, 0, 10) use ALU then do one of:

translate address with TLB; possibly update PCv

MM 180 read/write from/to data cache

WB 180 update register file

(b) On the above 6-stage pipeline, how many cycles are lost due to stalls/flushes for unconditional branches (j instructions) and conditional branches correctly predicted taken? Draw pipeline diagrams to explain your answer.

Solution: 2 cycles because the IT stage of the target instruction can only begin after the j instruction has computed the virtual branch target, which happens at the end of the ID stage. Hence, the IT stage of the target starts in cycle 4 instead of cycle 2 as shown below, a loss of two cycles:

1 2 3 4 5 6

j target IT IF ID* EX MM WB

target x x *IT IF ...

A similar diagram applies to conditional branches correctly predicted taken.

(c) Consider a new MIPS instruction jw imm($rs) that unconditionally jumps to the instruction at the virtual address stored at Mem[$rs + imm] (i.e., jw imm($rs) is equivalent to lw $reg, imm($rs); jr $reg). How many cycles are lost due to stalls/flushes for the jw instruction?

Solution: (Assumption: The datapath is modified so that the PC_v register is updated at the end of the MM stage to the value read from memory on a jw instruction.) In this case

4 cycles are lost:

1 2 3 4 5 6

jw imm($rs) IT IF ID EX MM* WB

target x x x x *IT

4

(5)

4. Code Optimization (15 points):

Consider two versions of a function to reverse an array of n bytes. Version 1 reverses the array

“in-place” whereas Version 2 allocates space for a new array:

// Version 1 (in-place) // Version 2 (new array) char *reverse(char *a, int n) { char *reverse(char *a, int n) {

int left = 0; char *rev = new char[n];

int right = n-1; for(int i = 0; i < n; ++i) while(left < right) { rev[i] = a[n-1-i];

int temp = a[left]; return rev;

a[left] = a[right]; }

a[right] = temp;

left++; right--;

}

return a;

}

Both versions are run on a memory system with a single-level data-cache with block-size b bytes.

The cache is write-back and implements an allocate-on-write policy. Which version of the code would utilize the data-cache better? Explain your answer.

Solution: First we analyze Version 1. In each iteration of the loop, two array values a[left]

and a[right] are read, and then both these values are updated. The updates certainly result in cache hits even if the reads result in cache misses. Note that since left increase by 1 and right decreases by 1 each time, the reads are spatially local and result in a miss-rate of about ¹_b (since the block-size is b bytes and each block stores b array elements). Since there are about n reads in total, the total number of cache misses is about ⁿ_b.

We now analyze Version 2. Each iteration of the loop reads one array value a[n-1-i] and updates one array value rev[i]. In this case, the reads and writes may independently cause hits and misses, but because of spatial locality and because the cache is allocate-on-write, the miss-rate will be about ¹_b for each array (using the same reasoning as above). In this case, since there are n reads and n writes, the total number of cache misses is about 2ⁿ_b, which is twice as bad as Version 1.

Hence Version 1 utilizes the cache better.