Removal of replication - Stand-alone implementation

5. Uncompressed GPU hash table

5.1 Stand-alone implementation

5.1.3 Removal of replication

Although replication does not impact correctness, its (performance) effects can be large: Replication leads to a decrease in the effective table size. It may hurt performance in several ways: Inserting vectors may take more time as (atomically) writing a replicated vector takes more time than reading the identical vector that has already been inserted. But even if the vector does not exist in the table yet, insertion may take longer as it takes longer to find an empty slot in a highly filled table. False negatives may also lead to more redundant work that needs to be done by the model checker that uses the table. Indeed, during our experiments with real-world data we experienced a replication up to 41%, especially on lower bucket sizes. We could only measure a very small replication in our experiments with random data, as the distribution of duplicated vectors is then different: randomly (random data) versus very local (real-world data); very local duplication offers more possibilities for replication, as more duplicate vectors are being inserted concurrently, especially on lower bucket sizes. The results are different from GPUexplore (only 2% replication with real-world models), as the hash table access patterns are different and local duplicate detection takes place, via the shared cache. As replication natively does not appear in the compressed hash tables of Chapters 6 and 7, we need to design and implement a replication-free uncompressed hash table to allow a fair comparison.

Replication-free implementation

To get a replication-free implementation, we re-introduced the spinlock of the multi-core implementation of Laarman et al. (Subsection 3.1.1). As our GPU implementation does not use distinct bucket and data arrays, but only a data array, we use the first entry in a bucket slot for locking purposes. Our method to get rid of replication is very similar to the solution of Laarman et al.

A bucket/vector group now tries to claim an (apparently) empty bucket slot by writing a temporary ‘writing-in-progress’ element into the first entry of that slot. The most significant bit (MSB) of this element is set, to indicate that writing is in progress (for this to work, we require that the MSB of all input vector elements is not set). The other bits contain the memoised hash.

If a bucket/vector group successfully claimed an empty slot, it subsequently writes all vector elements, except the first element. Only if those have been written, the vector group leader writes its own element (i.e., the first element of the vector) into the first entry of the slot, indicating that writing has been completed. A _{__syncwarp()}warp-level synchronisation primitive ensures that this first element is only being written after the other threads in the vector group have written their elements. Writing is done in a volatile way to make sure that the writes are visible to other threads (as long as they also read in a non-volatile way).

If a bucket/vector group did not succeed in claiming an (apparently) empty slot, it will read the value returned by the atomicCAS() operation, i.e., the actual value of the first entry of the slot. If this is the temporary ‘writing-in-progress’ element, it will check the memoised hash. In many cases, it can already conclude that the vector that is being inserted by a concurrent invocation is different, as the hashes do not match. But if the memoised hash is the same as its own hash, the vector that is being inserted is possibly identical and the vector group spinlocks on the first entry of the slot, till the ‘writing-in-progress’ element has been replaced with the first element of the vector that is being inserted. It can now check the full slot for an identical vector; this is done immediately if the _atomicCAS() operation did not return a ‘writing-in-progress’ element.

For this to work, reading the slot entries from the hash table should be done in a non-volatile way; otherwise, a stale (old) value may be read and the algorithm may falsely conclude that the vector has not been inserted yet. The initial reading (before claiming an (apparently) empty slot) can still be done volatilely, as long as every bucket slot that does not contain an entire vector (i.e., one or more slot entries are empty or the first entry in the bucket slot is a (hash-matching) ‘writing-in-progress’ element) is considered an empty slot and is subsequently read in a non-volatile way.

Risk of deadlock

When two or more bucket groups in the same warp try to claim the same bucket (bucket size is lower than 32), there is a risk of deadlock: Only one bucket group succeeds and the other bucket group(s) then spinlocks on (the first entry of) the bucket slot. If the threads of the warp operate in lock-step, the constant spinlocking of the non-succeeding bucket group(s) may obstruct the succeeding bucket group from reaching its “unlocking” operation (i.e., replacing the ‘writing-in-progress’ element in the first entry of the bucket slot with the first element of the vector).

During the experiments with our replication-free implementation, we had no deadlocks: Apparently, the execution of the succeeding bucket group is completed first, including the unlocking operation (the threads of the other bucket group(s) are then temporarily disabled). Only then the execution of the non-succeeding bucket group(s) is resumed; as the first entry in the bucket slot does not contain a ‘writing-in-progress’ element anymore, spinlocking will not take place and the entire bucket slot can be read immediately (in a non-volatile way).

This execution order, however, may differ with GPU architecture or compiler version, as it is nowhere specified. But it then will only lead to deadlock, not to a false ‘verification successful’ claim. Moreover, the most recent NVIDIA GPU architectures feature Independent Thread Scheduling: this fixes the issue as threads in a warp do not need to operate in lock-step anymore. Therefore, we decided to keep our current replication-free implementation and did not try to design a solution that is not dependent on the execution order.

Experimental results

With our random-data input, the performance benefits of the replication-free implementation are very limited, up to 3%. But the replication was already very small initially, due to the random distribution of duplicated vectors in the random-data input. In contrast to the situation in GPUexplore [ 5], the spinlocking does not hurt either. We cannot explain the difference, as the tried spinlocking implementation is not available. With our real-world data, the performance benefits are more profound, up to 18%. The benefits are more profound as the initial replication was more severe, up to 41%, especially on lower bucket sizes, due to the very local duplication in the real-world data. As there are no false negatives anymore, the program that uses the hash table does not need to do redundant work anymore. This may lead to even more performance benefits. All experimental data can be found in the [U-H/R] table in Appendix B.2.

In conclusion, we designed and implemented a replication-free uncompressed hash table, with an equal or even better performance compared to GPUexplore’s implementation with replication. The full table size can now effectively be used and the program that uses the hash table will not get false negatives returned anymore.

In document Porting tree based hash table compression to GPGPU model checking (Page 38-40)