Device Semantics - The Model of Computation of CUDA and its Formal Semantics

The device represents the topmost level of the semantics hierarchy. Even though CUDA supports SLI configurations where two or more GPUs can be used by a CUDA application, our formalization focuses on the semantics of a single device only. We also do not take GPU/CPU interactions into account, as this would add yet another level of complexity to the semantics.

The domain of devices manages the list of contexts created on the device. Additionally, the device communicates with the host program using device input/output actions as defined in section 5.8.2. The memory manager establishes the connection between the program semantics and the memory environment. We take a closer look at this connection in the following section.

∆ ∈ Device = MemMgr × DeviceIO × Context∗

5.8.1 Linking the Program Semantics to the Memory Environment

Memory operations are issued at the thread level. Higher levels of the hierarchy group thread memory actions and amend additional data. At the device level, memory operations are stored in context actions. The memory environment on the other hand does not know about threads or context actions at all. It operates on memory operations which in turn are defined by a memory program, an implicit program state, and a memory operation priority. Consequently, we need to define a mapping from context actions to memory operations to connect the program semantics to the memory environment.

First, we define the domain of memory managers. A memory manager consists of a memory environment that represents the state of all types of memory supported by our formalization of CUDA’s memory model. Additionally, it manages a list of pending memory operations as well as a function that maps a thread index to memory operation indices. The latter allows us to check whether a thread has completed a memory barrier by inspecting the state of the pending memory requests issued by the thread.

mm ∈ MemMgr= MemEnv × MemOp∗× (ThreadIdx → (MemOpIdx∗)⊥)

The function memOp takes a memory manager and a list of context actions and returns an updated memory manager. This function is responsible for converting memory requests issued by some threads into memory operations understood by the memory environment. We leave the function abstract, as most of the conversion process is not sufficiently specified by the CUDA documentation.

memOp : MemMgr × ContextAction∗→_MemMgr_⊥

However, there are several assumptions we have to make about the conversion process. First of all, registers are always big enough to store a complete data word; memory operations, on the other hand, might specify smaller access sizes. This is not a problem for read operations as a smaller value can always be written into a larger register. Though for write operations,

some of the register’s bits are cut off if the access size is smaller than the size of the data

word. But this does not pose a problem because of the preprocessing we perform to convert a PTX program into a program environment: During the preprocessing, the compiler checks

whether the program is type safe, i.e. whether all memory operations operate on input operands of matching sizes. For instance, a 64 bit register cannot be the source register of a 32 bit write operation. Type safety guarantees that the source register of a store operation is of equal or smaller size than the memory access size. Consequently, cutting off the highest bits of a register does not result in a loss of data, as these bits are guaranteed to be zero in any case.

The memOp function inspects the given context actions. Empty context actions as well as thread block actions with empty warp memory actions are ignored. If the context actions indeed contain valid memory requests, the context actions contain all necessary information for the memOp function to select the correct memory program for the given memory operation type and to initialize the implicit memory program state accordingly. Memory operation coalescing is also performed by memOp. Furthermore, all registers affected by the operations are blocked and will only become available again once the associated memory program releases them.

memOp also assigns priorities to all generated memory operations. As explained in chapter 4, in most cases we do not know the order in which memory operations are processed. We do know, however, that volatile reads and writes of the same thread are guaranteed to be processed in the order they were issued, hence we assume that memOp assigns lower priorities to volatile operations that are issued later during the execution of the program. Non-volatile operations might all be prioritized equally or there might be some unknown prioritization. Additionally, if threads of the same warp write to the same location in memory, it is undefined how many writes are actually processed and which write is processed last. The same is true for several atomic operations on the same address performed by threads of the same warp. If there are n concurrent accesses to the same address, the memOp function is free to generate between one to n memory operations with any priority.

In our formalization of the PTX semantics, if a thread executes a membar instruction, the requested memory barrier level is stored in the thread’s state. As long as the memory barrier

level is not reset to ε, the warp scheduler is unable to schedule the thread’s warp. The

CUDA specification does not explain in detail when a memory barrier completes. Therefore, we define the abstract function updMemBar that checks the state of all memory operations associated with a thread waiting for a memory barrier. If all of those operations have advanced as far as necessary, the updMemBar function resets the thread’s memory barrier

to ε, allowing it to be scheduled again. updMemBar uses the information stored by the

memory manager to retrieve all memory operations associated with a thread. When the memOp function creates a new memory operation, the memory managers mapping of thread indices to memory operation indices is updated accordingly. To avoid having to clean up this mapping once a memory operation completes, we assume that the memOp function never reuses a memory operation index that it has previously used to identify a memory operation, even after this memory operation has long been completed.

updMemBar : Context × MemMgr → Context

5.8.2 Device Input/Output

We introduce device input and output messages similar to the context input/output mech-

anism defined in section 5.7.2. The device has to patch through context messages to the context level. Therefore, the device’s input actions comprise messages that contain a context

input action and the index of the context the messages should be forwarded to. Additionally, the host program can instruct the device to create or destroy a context as well as to load a program into a specific context. The domain Prog represents PTX programs after the application of the code transformations presented in section 5.1.3.

din ∈ DeviceInput ::= load Prog ContextIdx | create | destroy ContextIdx

| _{ContextInput ContextIdx}

Conversely, messages output by the device patch through message emitted by the contexts. Furthermore, messages containing the indices of newly created contexts and recently loaded programs are sent to the host program.

dout ∈ DeviceOuput ::= created ContextIdx | ContextOutput ContextIdx

| loadedProgIdx ContextIdx

Analogous to the domain of context I/O actions, we define the domain of device I/O actions that comprises lists of input and output actions. Like for contexts, we do not enforce any ordering of incoming and outgoing messages.

dio ∈ DeviceIO= DeviceInput∗×_DeviceOutput∗

Transition system →I/OD handles the I/O actions of a device. A transition can either

generate some output based on some input or silently handle some input without generating any output.

Configurations : dio ∈ DeviceIO Transitions : dio−dout−−→

din

I/OD _dio0, _{dio −−→}

din

I/OD _dio0

Once a program has been loaded, the corresponding input action is removed from the device’s list of input actions. Moreover, the index of the newly loaded program is returned to the host program later on.

(load) h(load prog cid) ◦−→din,−douti−−→ −loaded−−−−−−−−−−−progid cid→

loadprog cid

I/OD h−→_din, (loaded progid cid) ◦−_douti−−→

In a similar fashion, the index of a newly created context is returned to the host program while the corresponding input action is removed.

(create) h(create) ◦−→din,−douti−−→ −−−−−−−−→createdcid

create

I/OD h−→din, (created cid) ◦−_douti−−→

Destroying a context does not generate any output and removes the corresponding input action from the list.

(destroy) h(destroy cid) ◦−→din,−douti −−−−−−−−→−−→

destroycid

5.8.3 Device Rules

Transition system →D defines the semantics of the device. It does not generate any device

actions, because it represents the highest level of the semantics hierarchy. It can, however, receive input messages from the host program or send output messages to the CPU.

Configurations :∆ ∈ Device

Transitions :∆ −→D∆0, ∆ −−−→

dout D _∆0

, ∆−din−→D ∆0

We also finally define how the register look up functions retrieve register values from the memory environment. Given a memory environment, the regLookup function returns a grid register look up function.

regLookup : MemEnv → GridRegLookup

We define the behavior of the register look up function as follows. The function parameters are specified at different levels of the hierarchy as explained in the preceding sections. Most importantly, the register look up function does not allow retrieving a value from a blocked register.

regLookup(η)(gridSize, blockSize)(bid, pid)(tid, ro)

(regAddr)= d if (d, ready) = regFile(η, pid)(ro + regAddr)

(%tid.dim)= tiddim

(%ntid.dim)= blockSizedim

(%ctaid.dim)= biddim

(%nctaid.dim)= gridSize_dim

The (exec) rule selects a context to execute; contexts cannot be executed concurrently and the contexts scheduling happens implicitly. The regLookup function is used to create a register look up function that is passed all the way down to the thread rules and expression evaluation functions to synchronously retrieve a value from an unblocked register. The updMemBar function checks whether any of the context’s threads have completed a memory barrier. All memory requests issued by the context’s threads are converted into memory operations using the memOp function. While the selected context executes one step, the memory environment executes a macro step, allowing several memory operations to advance and several silent rules to be invoked. Program execution and memory operation processing are performed in a truly parallel manner.

(exec) hupdMemBar(ζ_i, mm) | regLookup(η)i −−→ −−→ cact C _ζ0 i hη, − → opi →M,∗hη0,−_op→0i h∆ . (mm . η,→−op), ζ₁. . . ζ_i. . . ζ_ni →D h∆ / memOp((mm / η0,−_op→0_),−_{cact), ζ}−→

1. . . ζ0_i. . . ζni

Input messages are inserted into the device’s input list and are serviced at a later execution step. There is no guaranteed ordering of input messages.

Conversely, output messages created during an earlier execution step are sent to the host program in an unpredictable order.

(out) h∆ . (dio . dout ◦−dout)i −−−→−−→

dout

D_h_{∆ / (dio /}−_dout)i−−→

Input messages addressed to a specific context are forwarded to that context using the (cin) rule.

(cin)

hζ_i |regLookup(η)i−cin−→Cζ0

h∆ . (mm . η), (dio . (cin cid) ◦−→din), ζ₁. . . ζ_i. . . ζ_ni →Dh∆ / (dio /−→din), ζ₁. . . ζ0

i. . . ζni

if ζ_i,cid = cid Output actions emitted by a context are added to the device’s output list and will be sent to the host program at a later point in time.

(cout)

hζ_i. cid | regLookup(η)i −−−→

cout C _ζ0

h∆ . (mm . η), (din .−dout), ζ−−→ ₁. . . ζ_i. . . ζ_ni →Dh∆ / (dio / (cout cid) ◦−dout), ζ−−→ ₁. . . ζ0

i. . . ζni

The abstract function newVM : MemEnv × Context∗ →_{VirtMemSpace creates a new virtual}

memory space for a newly created context. Different contexts to do not share physical memory addresses, hence newVM does not map any virtual addresses to a physical address that is already mapped by another context. Other than that, the newly created context returned by newCtx : VirtMemSpace × ContextIdx → Context is empty, i.e. it does not yet have any grids, programs, I/O messages, or a program map. The index assigned to the new context must be unique.

(create)

dio−−−−−−−−→createdcid

create

I/OD _dio0 h∆ / dio0, ζ₁. . . ζ_nnewCtx(vms, cid)i →D∆0 h∆ . (mm . η), dio, ζ₁. . . ζ_ni →D _∆0

if vms= newVM(η, ζ1. . . ζn) ∧ ∀i ∈ {1. . . n} . ζi,cid , cid A context can be destroyed at any time, implicitly aborting all of the context’s grids. As already discussed in section 5.7.3 for individual grid abortion, pending memory operations issued by the aborted grids are not canceled.

(destroy)

dio −−−−−−−−→

destroycid

I/OD _dio0 h∆ / dio0, ζ₁. . . ζ_i−1 ζ_i₊₁. . . ζ_ni →D ∆0

h∆ . dio, ζ₁. . . ζ_i−1 ζ_i ζ_i₊₁. . . ζ_ni −→D _∆0 if ζi,cid= cid

Loading a program into a context relies on the load function already defined in section 5.1.3. The newly loaded program is added to the context’s program list and the program is assigned a context-wide unique program index. The load function returns a memory environment that guarantees that all threads read the correct values of all global and constant variables for which the PTX program specifies initial values.

(load)

dio−−−−−−−−−−−−→loadedprogidζcid

loadprogζcid

I/OD _dio0 h∆ / (mm / η0

), dio0, (ζ / φ ◦ ζ_~φ) ◦ ~ζi →D_∆0

h∆ . (mm . η), dio, ζ ◦ ~ζi →D ∆0

In document The Model of Computation of CUDA and its Formal Semantics (Page 117-122)