Implicit Program State - Formalization and Formal Semantics of Memory Programs

4.4 Formalization and Formal Semantics of Memory Programs

4.4.1 Implicit Program State

Memory programs cannot use variables to store data and pass information around. Instead, all relevant state is implicit and inaccessible to the program. The semantics of Boolean and operational expression, however, pass parts of the implicit state as input to the cache and memory operations defined in the previous sections. Hence, we have to formalize the implicit state of memory programs before we define the semantics of expressions.

The program stateΣ as specified below might seem overly bloated. However, we deliber-

ately do not define it in a more fine-grained way. The coarse structure of the state prevents the need of a multitude of projections in function definitions and the like. However, we

do partition the state domain into four sub-domains: Σatomicfor atomic memory operations,

Σdevice for DRAM memory operations, Σshared for shared memory operations, and Σreg for

σ ∈ Σ = Σatomic∪Σdevice∪Σshared∪Σreg

Except forΣregwhich is somewhat of a special case, all of these program states contain one

cache or memory or the value provided by the program that should be written to a cache or memory. Reading from and writing to these byte lists is handled by the program semantics implicitly, not by the programs themselves. To highlight the intended usage of the byte lists, we introduce the following three aliases:

vin∈Valuein= Byte ∗

v_i/o∈_Value_in/out = Byte∗ _v_out∈_Value_out= Byte∗

Typically, read requests from threads of the same warp are coalesced into fewer, but larger memory transactions. When that happens, it is unclear which value should be written to which output register. As we do not know how requests are coalesced — this is done by an abstract function defined later on —, we also do not know how the value retrieved by the memory program should be decomposed and copied into the destination registers of the requesting threads. We solve this issue by assuming that a decomposition function is given to each program that has to decompose a retrieved value. The abstract function that translates thread requests into one or more memory operations creates a decomposition function that reverts the effects of coalescing and stores it in the program’s implicit state. The program semantics can then use the decomposition function to successfully write the retrieved values to the destination registers.

df ∈ DecompFunc= RegFile × Σ → RegFile

Suppose two threads execute the statement ld.global.ca.s32 r, [a], where r is the destination register that stores the result of the read and a is the location of the value to read.

Furthermore, suppose that the values of r and a are r1 and a1for the first thread and r2and

a2for the second one. Moreover, assume that a2 = a1+ 4. Because both threads read a 4 byte

signed integer, the request might be coalesced into one 8 byte request at location a1. After

the completion of the corresponding memory program, the program’s state contains the

retrieved value in its value list ~b. The decomposition function subsequently copies ~b0. . .~b3

into register r1and ~b4. . .~b7into register r2.

When a thread issues a read request to memory, the destination register of the read is blocked until the read is completed, i.e. the value is written to the register. For a write request, the source register is blocked until the memory subsystem has picked up the value. A thread cannot be scheduled if its next instruction accesses a blocked register. CUDA does not specify when exactly registers are unblocked; it is only said that registers used in a write request are unblocked “much more quickly” than those used in a read request [9, 6.6]. Therefore, we make it the responsibility of the memory programs to unblock blocked registers at some point during the lifetime of the program. Consequently, the program’s state contains a list of all registers that were blocked when the request was issued. All of these registers are unlocked at the program’s command.

All programs are identified by a unique memory program index and their states optionally contain a decomposition function and a list of blocked registers the program should release. Furthermore, all states store the index of the streaming multiprocessor of the threads that issued the request. The SM index is used to find out which register file, L1 cache, or shared

memory should be manipulated by the program. Additionally,Σatomicstores the location and

the size of the value that should be manipulated by the program. It also comprises the atomic operation that should be performed as well as the state space — global or shared — that should be accessed. We assume that a single memory program performs only one atomic operation. Several atomic operations on distinct addresses could be executed in parallel just

as well. The CUDA specification does not give any hints as to what the hardware really does, but as atomic operations cannot possibly affect one another, this really has no effect on program behavior. On the other hand, the PTX specification clearly states that the memory model does not guarantee atomicity of atomic operations on shared memory with respect to other, non-atomic memory operations on the same address [9, Table 105]; neither does the formalization presented in this chapter. Depending on the type of the atomic operation, an atomic memory request either has one or two input values. An atomic operation writes the original value at the accessed location to the destination register.

σa ∈Σatomic= PhysMemAddr × MemOpSize × ProcIdx × StateSpaceatomic×AtomicOp

×_Value_out×_Value_in×_Value_in×_{DecompFunc × RegAddr}∗×_MemOpIdx

The program state for DRAM memory operations is used by programs for both read and write operations. Consequently, the byte list in the program’s state is used either for the input or output value. The state also stores the request’s address and the size.

σd ∈Σdevice= PhysMemAddr × MemOpSize × StateSpacedevice×CacheOp_ε×ProcIdx

×_Value_in/out×_DecompFunc

ε×RegAddr∗×MemOpIdx

As memory operations on shared memory can concurrently read or write several different locations, the state of a shared memory program contains several addresses and access sizes. Again, it is assumed that any possible bank conflicts are resolved before memory programs and their states are created by splitting conflicting accesses into several conflict-free ones. Just like the state of DRAM memory operations, the shared memory operation state utilizes the byte list for input and output purposes.

σs∈Σshared = ProcIdx × PhysMemAddr ∗

×_MemOpSize∗×_Value_in_/out∗×_DecompFunc

×_RegAddr∗×_MemOpIdx

Some memory operations only update registers, but do not touch other types of memory. This is a bit of a special case, because we can use the decomposition function to update the registers, hence the program’s state does not have to store the register addresses or the

new values. Memory programs with aΣregprogram state are issued when threads perform

operations that write to registers. Basically, this includes all arithmetic, logic, and shift instructions supported by PTX. The decomposition function inherently supports coalescing register updates of several threads into one memory operation, assuming the hardware supports that.

σr∈Σreg = ProcIdx × DecompFunc × RegAddr

∗

×_MemOpIdx

In document The Model of Computation of CUDA and its Formal Semantics (Page 48-50)