7.7 Experiments and Optimizations
7.7.3 Non-Clustered Baseline
The results in this section were obtained using a non-clustered table layout, where each table is an unordered heap, no join indices are created to connect tables related by a foreign-key, and hash-join is typically used to evaluate queries with joins.
Without Checkpointing
Figure 7.8a shows initial results for power runs on a non-clustered (i.e. heap) table layout, without checkpointing. The graph clearly shows that query times keep rising until run 14, and that commit times, most notably for RF1, keep increasing consistently. We also see an outlier in run 25. Ignoring this outlier, total times (i.e. including RF times) for a power run degrade from 60s in the first run to 92s in run 30. During this time, PDT memory consumption grows from 0 to roughly 18GB. Note, however, that most of those updates come from the throughput runs, that are run immediately after each power run. A single power run adds 0.1% inserts and also deletes 0.1%, which results in roughly 105MB of PDT update data.
The read-only query times (i.e. queries 1-22, ignoring RF functions) increase from 51s to 64s over the first 14 runs, at which they flatten out. This is caused by PDT inserts ending up at the tail of the unordered orders and lineitem tables, introducing skew in the distribution of PDT updates over parallel threads (which use range partitioning). In general, the higher the ratio of updates within the output vector of a scan, the worse the performance, as merging of PDT inserts requires logic that resembles tuple-at-a-time processing. In this scenario, where inserts end up in the final parallel partitions (based on range partitioning), scan performance degrades from 0.7 cycles per tuple to 2.5. Once the tail partition is saturated with PDT updates, around run 14, further degradation of query times stops. PDT memory consumption, however, keeps growing linearly, as indicated by the line plot (with scale on the right y-axis). This is caused by checkpointing being disabled for this experiment.
Figure 7.8a also shows a considerable degradation of transaction commit times for RF1. This can be attributed to two factors: write-PDT to read- PDT propagation on every commit, and increasing costs for generating snapshot copies of the growing read-PDTs. Both issues are relatively easy to optimize.
The initial default behavior of Vectorwise was to trigger write-to-read prop- agation once the write-PDT exceeds a configurable, fixed threshold. In this experiment, this threshold was set to 4MB (i.e. in the order of the cache size), which happened to get exceeded on every commit. Write-to-read propagation is relatively expensive, as it generates a snapshot copy of the read-PDT, so that write-PDT updates can be migrated into an isolated copy. We concluded that reducing the number of write-to-read propagations is more important than keep- ing the write-PDT strictly in the cache, and changed the trigger to be based on a dynamic threshold, i.e. a fraction (f) of the current read-PDT size. The effect is that, assuming fixed size commits, write-to-read propagation is (eventually) only called once every 1/f’th commit.
Besides the number of read-PDT snapshot copies made, the copying process itself turned out to be expensive as well. Not only does it copy the PDT tree structure, but also the full value-space (i.e. all insert tuple data). Observing
7.7. EXPERIMENTS AND OPTIMIZATIONS 175 0 20 40 60 80 100 120 140 160 1 5 10 15 20 25 30 0 5 10 15 20 Total time (s) PDT memory (GB) Run # queries rf1 rf1 commit rf2 rf2 commit PDT memory 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 (a) Baseline 0 20 40 60 80 100 120 140 160 1 5 10 15 20 25 30 0 5 10 15 20 Total time (s) PDT memory (GB) Run # queries rf1 rf1 commit rf2 rf2 commit PDT memory 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 (b) Optimized commit
176 CHAPTER 7. INDEX MAINTENANCE
the difference between commit times for RF1 and RF2, which holds only deletes that do not store anything in value-space, we can even conclude that copying and migrating value-space data constitutes the bulk of the RF1 commit cost. Copying the PDT tree structure makes sense, as it is relatively small and sub- ject to scattered changes. Snapshot-copying the value-space data, however, is often needless, as this is an append-only data structure, stored in pre-allocated memory buffers. These buffers are subject to re-sizing, and often end up with unused space in the tail. Therefore, if the write-PDT value-space data fits in the unused memory area of the read-PDT value-space, we simply append the write-PDT data directly into the global read-PDT, avoiding a snapshot copy of the value-space part of the read-PDT. This is safe as long as only one commit can be active at any moment in time, which is the situation in Vectorwise.
After implementing both optimizations, using a write-to-read propagation threshold of 0.1 × size(read-PDT), RF1 commit times improve considerably, as shown in Figure 7.8b. The only exception is run 25, where we see a spike that can be attributed to increased RF1 commit times. Recall that each bar in the graph represents the average of 5 runs. Closer inspection of server logs showed that the spikes are actually being caused by an outlier in only one of those 5 runs, where RF1 commit suddenly takes 90s instead of the 20-25s observed for the remaining 4 runs (in the baseline experiment). The additional time is spent migrating write-PDT value-space data to the read-PDT. All memory allocations were verified to be properly aligned and the issue failed to occur with profiling enabled. The most likely explanation therefore remains that it has to do with an oddity around memory copies on the NUMA architecture, maybe in combination with Vectorwise’s own memory manager.
With Checkpointing
To avoid PDT memory consumption from growing out of bounds, we enable checkpointing to rebuild tables in the background. Figure 7.9a shows the effect of enabling automatic checkpoints, which results in both orders and lineitem tables being rebuilt independently of each other once they reach an update ratio of 2%. We can clearly observe the positive effect of those checkpoints on PDT memory consumption, which now fluctuates around a constant average value. However, we also see regular spikes in execution times, which were not present in earlier runs without checkpointing. Inspection of logs and profiles revealed that the higher peaks (i.e. runs 5, 10, 12, 17, 22 and 29 happen to overlap with the final half of the time-frame where a background checkpoint of the (larger) lineitem table is running and committing. After the commit of such a checkpoint, queries suffer from I/O misses, as the checkpointing process of the large lineitem table evicted some blocks from the buffer manager, while the compressed TPC-H data-set just about fits the 32GB buffer manager space. Doubling the buffer manager memory makes those I/O misses disappear, as illustrated by the benchmark results in Figure 7.9b, where the impact of a background checkpoint is reduced to a slight CPU overhead in the form of a single thread that scans and recompresses a checkpointed table.
The overall trend of both graphs is that, on average, query times, commit times and PDT memory consumption all stay flat. Checkpointing successfully prevents the amount of PDT updates in memory from growing out of bounds, thereby reducing both the negative effect of PDT updates on scan performance
7.7. EXPERIMENTS AND OPTIMIZATIONS 177 0 20 40 60 80 100 120 140 160 1 5 10 15 20 25 30 0 5 10 15 20 Total time (s) PDT memory (GB) Run # queries rf1 rf1 commit rf2 rf2 commit PDT memory 0 0 5 0 5 5 5 5 2 5 0 5 5 5 5 3 5 0 5 5 5 5 3 5 5 2 5 5 5 3
(a) 128GB memory configuration
0 20 40 60 80 100 120 140 160 1 5 10 15 20 25 30 0 5 10 15 20 Total time (s) PDT memory (GB) Run # queries rf1 rf1 commit rf2 rf2 commit PDT memory 0 0 5 0 5 5 5 5 0 5 0 5 5 5 5 0 5 0 5 5 5 5 0 5 5 5 5 5 5 0
(b) 256GB memory configuration (in-memory checkpointing)
178 CHAPTER 7. INDEX MAINTENANCE
and expensive PDT snapshotting times during commit.