Version without corrupted vectors (GPUexplore 1.0)

5. Uncompressed GPU hash table

5.1 Stand-alone implementation

5.1.2 Version without corrupted vectors (GPUexplore 1.0)

To eliminate this possibility for corrupted vectors, we reverted to the insertion procedure of GPUexplore 1.0 [5]: The algorithm tries to claim the first entry of a bucket slot only. When it succeeds, it fills the other entries. Otherwise, it immediately proceeds to the next bucket slot, without waiting for the current slot to be filled (i.e., to check for a duplicate vector).

Replication still occurs, but now also for vectors that consist of four or of a smaller number of elements: As the apparent scheduling of atomic memory operations per four threads is nowhere documented and could change any moment, we stick to 32-bit atomic operations only. This means now only bucket slots of exactly one 32-bit entry (one vector element) could be fully claimed by a vector group (of one thread); other vector lengths would require waiting for the completion of writing the full bucket slot by a concurrent invocation. For reasons of simplicity, our implementation only checks the return value of the atomicCAS() operation for the empty entry (i.e., for a successful claim). Therefore, also input with vector length 1 may get replicated.

As we are not using the assumption anymore that atomic memory requests are scheduled in half-warps, slots can now cross half-warp and half-bucket boundaries again. Although replication (false negatives) is still possible, corruption of vectors (false positives) is not possible anymore. False negatives hurt performance, but does not affect correctness, as false positives do.

Reading non-volatile

When reading hash table entries for comparing them to the elements of the vector that is being inserted (Algorithm 3.3 (page 23), line 11), this is done in a non-volatile way, i.e., a stale (old) value may be read, e.g., when it has been cached (remember that cache coherence protocols do not exist in GPUs yet).

This does not lead to false positives, as the only stale value that can be read is the empty entry value, which is a restricted value that vector elements are not allowed to have; when one or more entries in a bucket slot have the empty entry value, the algorithm always concludes that the bucket slot does not contain a vector, including the vector that is being inserted.

It, however, may lead to (more) false negatives, as the algorithm may falsely conclude that a bucket slot does not contain the vector that is being inserted. The subsequent atomicCAS() operation (line 19) operates on the actual value, but as its return value is only checked for a successful claim (by comparing it to the empty entry value), the algorithm just assumes that the bucket slot was taken by another vector and proceeds with inserting the vector into another bucket slot.

We did experiments with a version that reads hash table entries in a volatile way (line 11). We could, however, not measure any differences in the amount of replication regardless of whether L1 loading was enabled for all global reads or not. Runtimes were worse, especially with CC 6.1 as target architecture. Therefore, we decided to keep reading in a non-volatile way.

Data races

In the strict sense, the non-atomic hash table reads are racy, as there can be concurrent writes to the same hash table entries, both atomic (first entry of a bucket slot) and non-atomic (the other entries of a bucket slot). We did experiments with atomic loads, but as those are not natively supported by CUDA, we mimicked them using atomic increments by zero, which return the original (and, in this case, also new) values. We measured a slowdown in performance, especially with bucket sizes 8 and 16. Because of this and because atomic loads (and stores) are apparently not needed in CUDA (and thus not supported), we decided to keep non-atomic reads (and non-atomic writes for the non-first entries of a bucket slot).

Low-level optimisations

We tried several low-level optimisations, for processing multiple types of inputs and both on the default bucket size of 32 entries and on bucket size 8. We then still used an old hash function from Cassee, whose higher register pressure lowered occupancy.

Most optimisations have no or almost no effects:

● enabling L1 loading for all global reads: no effect with CC 3.0 as target architecture (_{sm_30}), small positive effect with CC 6.1 as target architecture (_{sm_61}) ● using restricted pointers (__restrict__), where appropriate, to indicate that the

object the (array) pointer references to is not aliased by another pointer, possibly enabling some compiler optimisations, e.g., enabling reading from the read-only cache: almost no effect, regardless of L1 loading for all global reads or not

● forced inlining (__forceinline__), especially of the find-or-put function: no effect, regardless of L1 loading or not

● removing_constqualifiers (from the input array, i.e., trying to force loading via L2 cache only): no effect, independent of L1 loading or target architecture

● forcing register spilling (maxrregcount), which impacts the optimal execution configuration, has a large beneficial effect with sm_61 and then performance is getting close to the original performance of _{sm_30}; slight effect with _{sm_30}

Other “optimisations” have effects, but mostly negative:

● CC 6.1 as target architecture ( sm_61): negative, as this leads to a higher register pressure (and, consequently, a different optimal execution configuration)

● a mixture of CC 3.0 and CC 6.1 as target architectures ( _{compute_30,}_{sm_61}): little bit slower (same register pressure as sm_30)

In document Porting tree based hash table compression to GPGPU model checking (Page 36-38)