Placement Optimization - Heterogeneity-Aware Placement Strategies for Query Optimization

4.4 Evaluation

4.4.2 Placement Optimization

Given the runtime estimation for each operator, we now look at the whole query plan and eval- uate our two optimization strategies: local and global optimization. For that, we first introduce our evaluation setup including two benchmark workloads, before looking at the search spaces and the optimization quality.

32 GB Main Memory (2132 MHz) AMD CPU AMD iGPU — 10.3 GB/s Nvidia K20 Nvidia GT640 1.3 GB/s 12.4 GB/s PCIe2 x4 PCIe3 x16

(a) Overview (b) Focus on optimal placement

Figure 4.11: Query runtime of SSB queries with 1M different placements per query. Hardware Setup For the evaluation, we use one heterogeneous system consisting of four dif- ferent CUs. The system consists of one CPU (AMD CPU) and three GPUs (AMD iGPU, K20, and GT640). The main memory connections are shown in Figure 4.10 (hardware properties in Table 2.3). The transfer bandwidths are peak values we measured, however, the real bandwidths for various data sizes can differ. We benchmark these bandwidths through transferring

16KB to 256MB for each CUi ! CUj combination and additionally for transfers from the

host memory to the CUs. The benchmark results are stored similar to operator runtimes in our estimation model. However, we do not apply data cleaning, as the data is a small fixed set of tuples, where memory space is not an issue.

Database Workload In our heterogeneous system, we use gpuDB [Yuan et al. 2013] to ex- ecute Star Schema Benchmark queries (SSB) [O’Neil et al. 2009] and Ocelot [Heimel et al. 2013] to execute TPC-H queries [TPC 2014]; both systems are limited to the stated bench- marks. Both systems allow OpenCL-based execution and, therefore, are able to use different CUs with one code base. However, both systems also do not allow heterogeneous execution within a query but only manually fixed single-CU execution. We extend these systems to log the query structure, operator runtimes, and data sizes, to extract information for runtime estimation and placement optimization. The required information can be retrieved by executing each query on every CU in a single-CU mode. Therefore, we do not need to implement heterogeneous placement at this point in order to evaluate our approach. Through the offline evaluation, we can estimate the benefits and limitations of our approach. For gpuDB, we are able to execute all 13 SSB queries with our system. For Ocelot, only 9 queries could be executed on all four CUs, while the other queries abort with different errors on at least on CU. We use scale factor 10 for the SSB queries and scale factor 5 for the TPC-H queries.

Search Space and Strong Placements To illustrate the search space and the resulting dif- ferences in query runtime, we use the collected runtime information and generate 1M random hypothetical placements for each query. The performance results, including all operator runtimes and transfer costs, are shown in Figure 4.11 and Figure 4.12 as Box-Whisker-Plots. Every box represents one query and the distribution of runtimes for the different placements. The

(a) Overview (b) Focus on optimal placement

Figure 4.12: Query runtime of TPC-H queries with 1M different placements per query. box itself describes the value ranges from 25% to 75%; the included line represents the me- dian; the whiskers show ± 1.5 IQR (interquartile range, usually the size of the box); and the shown points are outliers. Additionally to the Box-Whisker-Plots, we show the number of operators and the best possible execution in Figure 4.11b and Figure 4.12b. We computed the optimal placement through exhaustive search with pruning, which required several hours for most queries.

With 9 to 24 operators for SSB queries, the search space contains 49 _{= 262144}_{to 4}24 ₌

2.8_{⇤ 10}14_{possible placements. For the TPC-H queries, there are 9 to 36 operators, leading to a}

maximal search space of 436_{= 4.7}_{⇤ 10}21_{possible placements. Additionally, we can see a wide}

range of runtimes for the different placement possibilities, therefore, it is highly important to apply placement optimization.

To reduce the search space for global optimization, we proposed strong placements, single operator placements that are always placed on one CU, even with the worst case transfers. Such strong placements do not need to be considered in the optimization anymore and, therefore, reduce the search space. For the evaluation, we build every possible sub-set of our four CUs. There are four sub-sets with only one CU, six possibilities for two CU combination, four sub- sets with three CUs, and one set with four CUs. We report the average percentage of strong placements for each sub-set size in Figure 4.13 and Figure 4.14. If we only use one CU, all

0 20 40 60 80 100 strong placements (%) SSB Queries 1_1 1_2 1_3 2_1 2_2 2_3 3_1 3_2 3_3 3_4 4_1 4_2 4_3 1 CU enabled 2 CUs enabled 3 CUs enabled 4 CUs enabled

0 20 40 60 80 100 strong placements (%) TPC−H Queries Q3 Q4 Q5 Q6 Q10 Q11 Q12 Q15 Q18 1 CU enabled 2 CUs enabled 3 CUs enabled 4 CUs enabled

Figure 4.14: Occurrence of strong operator placements within TPC-H queries.

operators can be considered as strong placements. For two CUs, 20% to 50% of the operators can be fixed to one CU, reducing the search space significantly. For example, for SSB query 2_1 nine out of 18 operators can be assigned as strong placements on average, leading to a search

space of 29 _{= 512}_{instead of 2}18_{= 262144}_{. However, for more than two CUs, we see a large}

decrease of strong placements for SSB queries (Figure 4.13) and a smaller decrease for TPC-H queries (Figure 4.14). This decrease is caused by (1) a larger choice of CUs, where the actual execution times are more likely to be close for at least two CUs with more CUs in the system, and (2) possible worst case transfers are more expensive, leading to no secure decision for one CU, as it might introduce harmful transfers. All in all, we conclude that strong placements are beneficial for two CUs, while the search space improvement is decreasing for systems with more than two CUs.

Local vs. Global Optimization Finally, we want to compare the performance of local and global optimization. For this, we evaluate single-CU execution, local placement optimization and global placement optimization with different starting placements on the presented SSB queries and TPC-H queries. We use runtime estimation with the input data from all queries and all CUs and do not apply data cleaning. The results are normalized to the pre-computed ideal placement execution and are shown in Figure 4.15 for the SSB queries and TPC-H queries. We can make multiple observations.

• Single-CU execution has runtimes in the full range, while the CPU mostly shows the worst runtime and the GT640 or iGPU show the best single-CU runtime, because of their parallel execution and relatively fast connection to the main memory.

• Local optimization shows good results, being always in the top 2% of the query runtimes. For four SSB queries and two TPC-H queries, the simple local optimization approach is able to find the best possible placement. However, there are also queries where local optimization results in a worse performance than single-CU execution, because of unnecessarily introduced data transfers (SSB query 4_1 and 4_2).

• The performance of global optimization varies with the starting placement. Starting with single-CU placements can result in finding a local optimum in the starting placement, i.e., no single placement adjustment can improve the performance, since transfers intro-

relativ e perf or mance SSB Queries 1_1 1_2 1_3 2_1 2_2 2_3 3_1 3_2 3_3 3_4 4_1 4_2 4_3 0.0 0.2 0.4 0.6 0.8 1.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●single CU local global on singleCU global on random global on local

(a) SSB queries - full scale

relativ e perf or mance SSB Queries 1_1 1_2 1_3 2_1 2_2 2_3 3_1 3_2 3_3 3_4 4_1 4_2 4_3 0.980 0.985 0.990 0.995 1.000 ● ● ● ● ● ● ● ● ● _● ● ● ● ●_{single CU} local global on singleCU global on random global on local (b) SSB queries - top 2 % relativ e perf or mance TPC−H Queries 3 4 5 6 10 11 12 15 18 0.0 0.2 0.4 0.6 0.8 1.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●_{single CU} local global on singleCU global on random global on local

relativ e perf or mance TPC−H Queries 3 4 5 6 10 11 12 15 18 0.980 0.985 0.990 0.995 1.000 ● ● ● ● ● ●single CU local global on singleCU global on random global on local (d) TPC-H queries - top 2 %

Figure 4.15: Placement optimization results relative to the best placement’s runtime. duce too much additional costs. Even with better placements being possible, our greedy approach would not leave the starting placement for that reason. This can be seen for most SSB queries in Figure 4.15a. There, even some random starting placements find a single-CU execution as local optimum.

• Besides the local optimums as single-CU execution, global placements with different single-CU starting placements and random starting placements result in good runtimes, which are partly better than local optimization. This shows us that we need to apply global optimization with multiple different starting placements, while choosing the best placement to achieve a good final result.

• In order to make sure, global optimization achieves the same or even better placement than local optimization, we found that it is beneficial to combine both approaches. At compile-time, we simulate local optimization, i.e., traversing the query plan, while defin- ing the placement decision only according to input transfers and runtime estimation. The resulting placements for every operator can then be used as a starting placement for the global optimization, where we improve the placement by finding data sharing opportu- nities through the global view. This approach shows results that are always as good as local placement if not better. While this is a good heuristic, some global optimizations

with other starting placements achieved even better results (e.g., SSB query 2_1, 2_3, 4_1, 4_2, and TPC-H query 5). Again, this shows us that we need to evaluate the global optimization with multiple starting placements.

• Finally, we see that we always found good placements within only 0.5% of the ideal runtime, using our local or global optimizations. These optimizations execute in millisec- onds instead of hours for the prune-based search, making the whole placement optimization applicable for online query processing.

During our evaluation, we noticed two interesting trends:

(1) With changing magnitude of the transfer costs, the ideal optimization strategy changes. For example, with low costs or no transfer costs at all (e.g., because of small transfered data or high bandwidth connections), local optimization is sufficient, as data sharing does not need to be considered. Local optimization would simply pick the best performing CU for each operator and achieve the best possible runtime. If the transfer costs are significantly larger than the actual execution, then single-CU execution is sufficient, where transfers are only needed for base data, while all intermediate results are shared on one CU. For scenarios in between the men- tioned ones, e.g., transfer costs are significant but not much larger than execution times, global optimization is needed to find the ideal heterogeneous placement, while considering data sharing between operators. Global optimization can actually be used in all scenarios with different transfer costs. It can find highly heterogeneous placements and single-CU placements, depend- ing on the magnitude of transfer costs and the operator executions.

(2) A second trend we have seen is that local optimization might produce worse results when optimizing for many more operators. As local optimization does not consider future operators, the placement is decided locally, adding harmful transfer costs for wrong decisions. With only a few operators, there are not many chances for future operators to share data or suffer from earlier decisions. For many more operators, data sharing becomes more important because every placement decision could influence many future decisions and introduce many more unnecessary transfer.

4.5 CONCLUSION

In this chapter, we proposed a runtime estimation approach based on online learning during execution and linear interpolation between raw tuples for the estimation. We have shown that this approach is capable of estimating runtime even for irregular behavior and that possible errors are low and do not propagate to the placement decisions in a large extent.

We also discussed two different placement optimization strategies, local and global optimization, including implementation details and evaluation. We have shown that the search space can be reduced with strong placements for two CUs in one system, while the potential for more than two CUs is limited. However, even without strong placements, local and global optimizations found good placements for our evaluation queries. Global optimization is most likely to find a good resulting placement, when multiple different start placements are considered. We have shown that the large search space for global optimization is not a problem and that our greedy algorithm is suited for the placement optimization. The remaining challenge is accurate cardinality estimation, because compile-time approaches like global optimization are highly dependent on the estimated cardinality information for both, the operator runtime estimation and the transfer estimation.

5

ADAPTIVE PLACEMENT OPTIMIZATION

5.1 Open Challenges

5.2 Adaptive Placement Approach

In document Heterogeneity-Aware Placement Strategies for Query Optimization (Page 104-111)