Compression algorithm in detail - Compressed GPU hash table (recursive)

0.24 default: low fill rate (table size: 256MiB → fill rate 0.24)

6. Compressed GPU hash table (recursive)

6.1.2 Compression algorithm in detail

See Algorithm 6.1 (next page) for a simplified version of our recursive C function treeRec() that constructs the compression tree.

The type for references to tree nodes in the hash table is _indextype, the type for vector elements and hash table entries is _inttype; in our implementation, both are aliases for the type of an unsigned 32-bit integer ( uint32_t). As a node consists of two 32-bit references or vector elements (or a mixture of both), it is 64-bit wide and occupies two 32-bit hash table entries; the MSB of the first entry is used as root bit.

In constructing the tree, we start with the complete vector and then recursively call treeRec() to construct the subtrees for the “left” and “right” parts of the vector; this process is repeated till the base case is reached, i.e., a subvector of length 1. The base case just returns the subvector, i.e., its sole element; the other calls return a reference to the root node of the just constructed subtree, i.e., the index of that root node in the hash table. The calling invocation then creates a node, consisting of those returned references or sole vector elements, inserts the node into the hash table by calling findOrPut() and returns a reference to it, i.e., the index of that node in the hash table.

indextype treeRec(inttype *vector, int vectorLength) { indextype result;

if (vectorLength == 1) result = *vector; else { // length > 1

int split = (vectorLength / 2);

indextype node[] = { treeRec(vector, split),

treeRec((vector + split), // pointer arithmetic (vectorLength - split)) };

result = findOrPut(node); }

return result; }

Algorithm 6.1: treeRec(), the function that creates the compression tree (simplified) Note that the created node may already exist in the hash table; then, the _findOrPut() function returns a reference to the existing node.

Wrapper function: treeFindOrPut()

The initialtreeRec() call returns a reference to the root node of the compression tree for the inserted vector. This reference is then used in our top-level treeFindOrPut() wrapper function to check whether this was already a root node, indicating that the vector was already present in the hash table, or not, by atomically checking the root bit (bitmask ROOT_BIT_32) and setting it:

FoundOrPut treeFindOrPut(inttype *vector) { FoundOrPut result;

indextype rootIndex = treeRec(vector, d_vectorLength);

result = ((atomicOr((d_table + rootIndex), ROOT_BIT_32) & ROOT_BIT_32) ?

SEEN : NEW);

return result; }

Algorithm 6.2: treeFindOrPut(), top-level function for inserting vectors (simplified) In Algorithm 6.2, FoundOrPut is the enum for indicating the result of the find-or-put operation: a new vector (_NEW), an existing vector (_SEEN) or, in the actual implementation, a full table (the table is considered full when for rehashing no hash functions are left anymore). The constants d_vectorLength and d_table refer to the length of each input vector and (a pointer to) the start address of the hash table, respectively. The actual implementation accounts for sole-element vectors, by by-passing the treeRec() function and inserting the vector, padded with an empty (32-bit) element to get a 64-bit node, directly into the hash table.

Adaptations to findOrPut()

As _findOrPut() now only operates on 64-bit (tree) nodes, i.e., a combination of two 32-bit hash table entries, each node group (was: vector group) now consists of two CUDA threads, in which one thread checks the first entry of a slot in the table and the other thread the other (last) entry. If the slot is empty, the node group leader (the first thread in a node group) tries to claim the full slot, by a 64-bit _atomicCAS() operation; the other thread synchronises on the result. This result now also includes the case that a concurrent invocation inserted the same node; then the hash table reference to that node is returned, instead of inserting the same node again (in a different slot/bucket), as the uncompressed implementation from GPUexplore 1.0 (Subsection 5.1.2) does. We have implemented an improvement (Subsection 6.3.2), in which we have reduced each node group to only one thread, by each thread now checking a full 64-bit slot, instead of only a 32-bit wide entry of it.

The findOrPut() function now returns a reference, to the inserted node, instead of returning whether the inserted vector was new or not; now, the treeFindOrPut() wrapper function has the responsibility for doing this. The adapted _findOrPut() also allows for a full table, by returning a reserved value in that case; the actual implementation of treeRec() propagates this value to its caller(s) and, eventually, returns this value to the calling treeFindOrPut() invocation.

Sequential and parallel operation

Our implementation of tree-based compression supports the parallel processing of multiple vectors within a warp, i.e., by multiple bucket groups. This is directly related to the set bucket size: with a bucket size of 8, each warp has four bucket groups, each consisting of eight threads. In this case, each bucket group has four node groups, consisting of two threads each; node groups are only relevant in findOrPut().

Bucket groups are also used in the treeRec() and treeFindOrPut() functions. Hence, when a warp executes a _treeRec() or _{treeFindOrPut()} operation, each thread in a warp only operates on the vector of his bucket group; all threads in a bucket group do exactly the same, except in the operation of the findOrPut() function, in which the threads of a bucket group are separated into node groups and each node group operates on his own slot (of the same bucket) in the hash table; each thread in a node group operates on his own entry in that slot.

For each bucket group (vector), constructing the compression tree, i.e., inserting its nodes to the hash table, is, however, done sequentially and follows the recursive call order (depth-first search). Chapter 7 introduces a method to construct the tree in parallel and bottom-up, i.e., starting at the leaf nodes and then working upwards.

In document Porting tree based hash table compression to GPGPU model checking (Page 54-56)