Memory Operation Semantics - The Model of Computation of CUDA and its Formal Semantics

We have already defined the semantics of a store to global memory with the .wb cache operation in section 4.1. In this section, we present some additional memory programs to define the semantics of some other memory operations supported by PTX. However, we do not present programs for all supported operations, but only those that illustrate some points worth mentioning. As already stated above, the documentation does not explain how volatile memory operations affect matching cache lines in the caches, hence we leave volatile operations entirely unspecified.

Memory program 4.2 defines the semantics of a read of global or local memory with the

.cacache operation. First, we check whether the requested address is already stored in the

L1 cache. If so, we read the data from L1 in line 22. We know the data is cached but we do not know whether the cache line is valid, so readL1 might be instantaneous or it might defer execution. In any case, we yield execution after we have read the value to allow other programs to continue. If program 4.2 executes its next micro step, the results of the read are

1 if (! isCachedL1 ) 2 { 3 reserveL1 ; 4 yield ; 5 6 if (! isCachedL2 ) 7 { 8 reserveL2 ; 9 yield ; 10 11 readMem ; 12 yield ; 13 14 updateL2 ; 15 } 16 17 readL2 ; 18 yield ; 19 updateL1 ; 20 } 21 22 readL1 ; 23 yield ; 24 25 writeToRegs ; 26 yield ; 27 28 releaseRegs ; 29 end ;

Listing 4.2: Memory program for global and local reads using the .ca cache operation

written to the destination registers. At this point in time, the L1 cache might have already evicted the cache line; this is not a problem, however, as the data read from the cache is stored in the program’s implicit state. After yielding, we release the destination registers in line 28 and end the program.

On the other hand, if the L1 cache does not contain a matching cache line, we have to retrieve the value from L2 and cache the data in L1. Thus, we reserve a cache line in line 3. reserveL1 is either instantaneous or defers execution, depending on whether there is an eligible cache line that is either unused or can be evicted. Reserving also ensures that the cache line cannot be evicted until we read the value in line 22. Once reserveL1 succeeds, we yield execution. In the next micro step, we check whether the L2 cache contains a matching cache line. If so, we retrieve the value from L2 in line 17 and yield again once the read succeeds. Next we update the reserved L1 cache line. updateL1 is definitively instantaneous, as the matching cache line is pinned by the program. However, updateL1 might not write the value retrieved from L2 into L1, because there might have been another memory program that wrote a newer value to L1 while the value was being retrieved from L2. That cannot happen for global data — as global data never writes to L1 —, but it might potentially happen if a thread reads local data and writes to the same location without waiting for the previous read to complete. However, if this situation can indeed arise is unknown. In any case, readL1 instantaneously reads the value and unpins the cache line and the program continues to

execute as outline above.

If the data is not cached in L2 either, the data has to be retrieved from global or local memory and must be cached in L2 first. This works in an analogous manner to updating L1, with the exception that readMem is used to retrieve the value from memory.

1 writeL1 ; 2 classifyFirstL1 ; 3 yield ; 4 5 releaseRegs ; 6 end ;

Listing 4.3: Memory program for local writes using the .cg cache operation

The semantics of local writes with cache operation .cg are defined by program 4.3. The data is written to the L1 cache, which either happens instantaneously or is deferred. Afterwards, the cache line’s eviction class is set to first as specified by [9, Table 81]. The program ends after releasing the source registers.

1 evictL1 ; 2 yield ; 3 4 if (! isCachedL2 ) 5 { 6 reserveL2 ; 7 classifyNormalL2 ; 8 yield ; 9 10 readMem ; 11 yield ; 12 13 updateL2 ; 14 yield ; 15 } 16 17 readL2 ; 18 atomOp ; 19 writeL2 ; 20 yield ; 21 writeToRegs ; 22 yield ; 23 releaseRegs ; 24 end ;

Listing 4.4: Memory program for atomic operations on global memory

Memory program 4.4 defines the semantics of an atomic memory operation on global memory. First, the matching cache line in the L1 cache is evicted — this seems to be a reasonable thing to do, although the PTX specification does not explicitly mention this step. [7, 12] gives a hint that atomic operations are performed by the raster operation units on data cached in L2. Consequently, we check if the requested address is cached in L2 and load it from global memory if no matching cache line is found. This works in a similar manner

to program 4.2. However, as we are dealing with atomic operations, micro steps play an important role here. Loading the data into L2 is by no means atomic. After the data is written to L2, the program yields execution and other operations, including other atomic operations, may access the cache line. In any case, the cache line cannot be evicted, because

updateL2in line 13 does not undo the pinning of the cache line caused by reserveL2 in line

6. In line 17, program 4.4 executes the core of the atomic operation: The data is read from L2 either instantaneously or at a later point in time if isCachedL2 was true but the cache line was only reserved. Once the read succeeds, the atomic reduction is performed and the computed value is written back to L2 during the same micro step, i.e. atomically. After the successful write, which is guaranteed to be instantaneous because the address is cached, the micro step ends. During the subsequent micro steps, the destination registers are unblocked and updated with the result values.

1 lock ; 2 readMem ; 3 yield ; 4 5 atomOp ; 6 yield ; 7 8 writeMem ; 9 yield ; 10 11 unlock ; 12 yield ; 13 14 writeToRegs ; 15 yield ; 16 releaseRegs ; 17 end ;

Listing 4.5: Memory program for atomic operations on shared memory

Although not officially documented, atomic operations on shared memory are executed differently by the hardware than atomic operations on global memory. As explained in reference to program 4.4, the raster operation units appear to contain dedicated hardware units to process atomic operations. In contrast, atomic operations on shared memory are emulated by the compiler as can be seen by using Nvidia’s cuobjdump disassembler to inspect a compiled program containing atomic shared memory instructions. The value at the requested address is first loaded into a register and the computation of the atomic reduction is performed using the regular instruction set of the CUDA cores. Afterwards, the computed value is written to shared memory. The compiler uses a special flag of the load and store instructions to signal to the streaming multiprocessor that all accessed shared memory banks should be locked or unlocked. We assume that loads and stores of shared memory that do not have the lock flag set do not respect locked banks and access them anyway. This would explain why atomic operations on shared memory are not atomic with respect to non-atomic operations on the same address [9, Table 105]. Our formalization treats atomic operations on both global and shared memory as parts of the memory model for reasons of brevity. We are confident that the semantics of the PTX compiler’s emulation is equivalent to our formalization.

Program 4.5 defines the formal semantics of atomic operations on shared memory. In the first micro step, the program tries to acquire the lock for the accessed shared memory banks. If any of the banks are already locked, lock defers execution and therefore guarantees that only one single atomic operation operates concurrently on the same address. Additionally, no deadlocks can occur because either all or none of the accessed banks are locked. Once the program has acquired the lock, it reads the value, computes the reduction, and writes the result in three separate micro steps. Due to the locking, the program is atomic with respect to other atomic operations accessing the same shared memory locations, but a non-atomic store is indeed able to write a new value to the requested address at some point in between. Once the result is written to shared memory, the banks are unlocked and the program yields again. Releasing and writing to the destination registers is done in the usual way.

1 writeToRegs ;

2 yield ; 3 releaseRegs ;

4 end ;

Listing 4.6: Memory program for writes to registers

An instance of program 4.6 is launched whenever threads update the value of a register, i.e. for all arithmetic, logic, and shift instructions supported by PTX. As the writes are performed by the decomposition function, there is not much work left for the program. It merely invokes the decomposition function and releases the blocked registers during a subsequent macro step.

In document The Model of Computation of CUDA and its Formal Semantics (Page 58-62)