Observations - Approach II: Static Placement

3.2 Approach II: Static Placement

3.2.2 Observations

We test the simple hash-based group-by described in the previous section on an NVIDIA K80 (single-GPU, Table 2.3). We use 13 thread blocks (the number of multi-processors on the GPU) and 1024 threads per block. The allocated hash tables are twice the size of the group count, leading to a fill factor of 50%. As a hash function, we apply the 32-bit FNV-1a hash [Fowler et al. 2015] and we map the hash value to a bucket ID via modulo division. The group-by is performed on a table with ⇡1.6 billion rows consisting of four INTEGER columns:

CREATE TABLE atable (

col1 INTEGER NOT NULL, col2 INTEGER NOT NULL, col3 INTEGER NOT NULL, col4 INTEGER NOT NULL ) ;

The table is stored in a column format, leading to a storage size of 6GB per column and 24GB for the whole table. A table with this size can usually not be stored directly in the GPUs memory, supporting our initial implementation choice of storing the base data in the system’s main memory. The values in col1 are independent, uniformly distributed, and random in the range

of [0, 232₎_{. Accessing the hash table based on this distribution results in scattered memory}

accesses with minimal caching opportunities. By contrast, non-randomized or skewed distri- butions generally have more localized memory accesses and make better use of the GPU cache and, thus, lead to better performance. Our data distribution can thereby be considered as the worst-case scenario. To analyze the performance as a function of the number of groups, we use the following query that utilizes the MOD operation to limit the number of groups:

Query A: _{SELECT MOD(col1, ?), COUNT(*) FROM atable GROUP BY MOD(col1, ?);}

We study the impact of the data transfers on the overall performance by extending the SELECT clause of Query A with expressions that reference more columns. Ideally, we would expect the execution time for these variants of Query A to be determined by how many columns are referenced, as shown in Figure 3.10a. The GPU can read from the host memory via zero- copy access at a peak speed of 11.8GB/s in our setup. Since we believe that the connection from GPU to host memory should be the bottleneck, we would expect an execution time that is inversely proportional to the total size of the accessed columns. This is illustrated by the four equi-spaced lines in Figure 3.10a. However, the actual performance we observe is different, as shown in Figure 3.10b. We observe that: (1) the performance does not remain constant as we increase the number of groups by adjusting the MOD parameter in the query. (2) The runtime has a high variability, when only one column is accessed. The execution time can have large jumps for certain group combinations. This is not a measurement artifact – the numbers are in fact repeatable. (3) Accessing more than one column appears to hide some of this variability. (4) The execution time increases for few groups and (5) sharply increases for 100 million groups and more.

0 2 4 6 8 10 number of groups (M) runtime (sec) 0 1 2 3 4 5 6 7 8 9 10 1e−06 1e−04 0.01 0.1 1 10 100 1000 11.8 GB/s

Query 4: MOD(col1,?), SUM(col2+col3+col4) Query 3: MOD(col1,?), SUM(col2+col3) Query 2: MOD(col1,?), SUM(col2) Query 1: MOD(col1,?), COUNT(*)

(a) Expected performance behavior.

0 2 4 6 8 10 number of groups (M) runtime (sec) 0 1 2 3 4 5 6 7 8 9 10 1e−06 1e−04 0.01 0.1 1 10 100 1000 Query 4 Query 3 Query 2 Query 1

(b) Actual performance behavior.

Figure 3.10: Performance of initial group-by implementation, measured as execution time of a table scan over ⇡1.6 billion rows (fill factor 50% and 13 x 1024 threads).

For further investigations, we focus on the 1-column query, as this query shows the performance effects more detailed than the other queries, which partially mask the effects through data transfers. For the 1-column query, we can distinguish five regions in which different phe- nomena appear to dominate. Figure 3.11 shows the ranges of these five regions enumerated as I, . . . , V. Unfortunately, we do not have insight information from NVIDIA on their hardware that would provide clear explanations for the different behaviors. In order to provide an expla- nation, we combine the publicly available hardware descriptions with extensive profiling via performance counters and benchmarks.

Region I: Contention. The group count, and thus the hash table, is small. For a globally shared hash table, this leads to contention that limits the overall performance. Beginning with the Kepler GPU architecture (this includes the K80), atomic operations on device memory are performed in ALUs in the L2 cache [Nyland and Jones 2012], which is shared by all processors on the GPU. Atomic operations, e.g., atomic additions used for updating a SUM aggregate, are treated as store instructions. The operations are then routed to the ALUs based on their target memory address. A first-in first-out (FIFO) buffer in front of each ALU enqueues the operations before they are processed and the memory locations in L2 are updated.

The downside of this approach is that updates to the same hash bucket, or to buckets that are located on the same cache line, are sent to the same ALU. For small numbers of groups or a highly skewed distribution of group-by keys, this introduces a work imbalance on the ALUs and, thus, contention.

Region II: L2 & Spiky Performance. For all regions, the execution time of our Group-By implementation is spiky. The execution times in Region II correspond to the access time for the input columns, given the PCI Express bandwidth of ⇡ 11.8GB/s. If it were not for the spikes, the performance in the second region would be entirely dominated by the available PCIe bandwidth. In Region II, the hash sizes are large enough that contention, which domi-

#groups (M) runtime (sec) 1 10 100 1e−06 1e−04 0.01 0.1 1 10 100 1000 I II III IV V

Figure 3.11: Regions with different behavior for Query A (fill factor 50% and 13 x 1024 threads). nated performance in Region I, no longer has a significant impact. The region contains hash tables that are  1.5MB, i.e., that still fit into the L2 cache on the K80. We repeated the ex- periment multiple times, assuming we would observe a non-deterministic artifact. However, the spikes persisted at the same locations. Through deeper investigation, we found out that the spikes are due to collisions of the hash mapping. The FNV-1a hash function followed by the modulo division mapped many different keys to the same or close hash buckets, leading to long sequences of linear probing. FNV-1a is known as a good hash function and we can confirm this by studying the distribution of 32-bit hash values. The hash values are uniformly distributed over the full range of 32-bit integers. The hash collisions are actually introduced when mapping the 32-bit hash values into hash buckets. This mapping does not sufficiently preserve the uniformity of the computed hash values, especially for small and compact key do-

mains (key_max key_min⌧ 220_{). By contrast, we do not see performance issues for values that}

were chosen randomly from large domains (> 220_{). However, the modulo operation in the}

grouping expression of our benchmark query effectively produces such a compact key domain. Compact key domains are common in real workloads, e.g., for sequential order keys. The collisions of the hash mapping, and therefore the spikes, can be seen for all regions, however, in other regions the effects of the collisions are masked by other effects.

Region III: Hash Table > L2 Data Cache. In Region III, we observe the expected increase in execution time when the hash table grows beyond the L2 cache capacity (1.5MB for the K80). The execution time gradually increases and remains constant when almost every access results in a cache miss and in a random access to global memory.

Region IV: TLB issues (1). Region IV starts at a hash table size of about 130MB (⇡8.5M groups). The behavior looks like missing the next level of data cache; however, L2 is already the last level cache on the GPU. The TLB is also organized as a cache, and we believe these effects can be influenced by a TLB cache. However, Nvidia does not publish architectural information, so we need to investigate further with micro-benchmarks.

Region V: TLB issues (2). The jump at the beginning of Region V starts with hash table sizes of 2GB. Again, we suspect a TLB cache to be the source of this problem and more investigation is needed.

In document Heterogeneity-Aware Placement Strategies for Query Optimization (Page 56-59)