Clustered Table Layout - Experiments and Optimizations

7.7 Experiments and Optimizations

7.7.4 Clustered Table Layout

In this section, we show TPC-H results on a clustered table layout, where tables with a foreign key may be ordered on the sort key of the referenced parent table and connected by means of a join index. This allows clustering of data to be exploited, which results in reduced scan ranges (and data volume) that can be propagated over to connected tables. Besides, clustering allows merge-join to be used for evaluating plans with joins, which is typically less CPU and memory intensive than the typical hash-join used on an unordered heap table.

Figure 7.10a depicts clustered results without automatic checkpointing, and shows a clear improvement in execution time over non-clustered runs. The read- only query times improve by roughly a factor 2, degrading from 25s in run 1 to 28s in run 30, and do not exhibit the skew where PDT updates end up in the tail parallel partition(s), which caused non-clustered runs to degrade much worse, from 51s to 64s. In terms of total time, including refresh routines, the benefit becomes slightly less than a factor two, 35-40s for clustered, versus 63- 79s (ignoring the outlier in run 25) for non-clustered, as both layouts require similar time to perform both RF1 and RF2, around 12s on average. Still, we can conclude that a clustered layout shows significant benefits over a non-clustered layout in terms of query evaluation performance, without introducing negative effects in terms of update performance.

With automatic checkpointing enabled, Figure 7.10b first illustrates what happens without the deferred index maintenance support described in Sec- tion 7.6.2. What we see is that runs 3 to 22 are continuously overlapped by a background checkpoint, but that PDT memory consumption keeps rising. In fact, we are dealing with four background checkpoints rather than one. How- ever, each of those checkpoints fails to commit, due to concurrent updates to the join index summary (JIS), which used to result in a conflict. The initial policy of Vectorwise was to abort the checkpointing transaction upon such a conflict, retrying up to three times. Clearly, all subsequent efforts fail as well, resulting in wasted resources only.

With the deferred JIS maintenance from Section 7.6.2 in place, PDT inserts into the lineitem table that occurred during a background checkpoint are properly applied to the newly checkpointed image once the checkpoint commits, eliminating the need to abort with a conﬂict. After adding this functionality, the drops in PDT memory consumption in Figure 7.11a indicate that checkpoints commit successfully now. Overall, however, PDT memory consumption stays considerably higher than in the non-clustered scenario. The reason is twofold.

First of all, rebuilding a clustered layout is more complex and expensive, as updates are scattered over the item and lineitem tables. Therefore, both tables have to be rebuilt in their entirety. Also, connected tables within a cluster need to be rebuilt in one go, using merge-joins to rebuild the join indices between parent and child tables. Non-clustered heap tables, on the other hand, have most updates in the tail, and can be checkpointed independently of each other. The second reason for higher PDT memory consumption is that, while a checkpoint takes longer to complete, the arrival rate of new updates during such a checkpoint is about a factor two higher, simply because query throughput on the clustered layout is roughly twice as high. This could be remedied by using

7.7. EXPERIMENTS AND OPTIMIZATIONS 179 0 20 40 60 80 100 120 140 160 1 5 10 15 20 25 30 0 5 10 15 20 Total time (s) PDT memory (GB) Run # queries rf1 rf1 commit rf2 rf2 commit PDT memory 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

(a) Without automatic checkpointing.

0 20 40 60 80 100 120 140 160 1 5 10 15 20 25 30 0 5 10 15 20 Total time (s) PDT memory (GB) Run # queries rf1 rf1 commit rf2 rf2 commit PDT memory 0 0 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 0 0 0 0 0 0 0 0

(b) Automatic checkpointing fails to commit

180 CHAPTER 7. INDEX MAINTENANCE 0 20 40 60 80 100 120 140 160 1 5 10 15 20 25 30 0 5 10 15 20 Total time (s) PDT memory (GB) Run # queries rf1 rf1 commit rf2 rf2 commit PDT memory 0 0 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5

(a) 128GB memory configuration

0 20 40 60 80 100 120 140 160 1 5 10 15 20 25 30 0 5 10 15 20 Total time (s) PDT memory (GB) Run # queries rf1 rf1 commit rf2 rf2 commit PDT memory 0 0 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5

(b) 256GB memory configuration (in-memory checkpointing)

7.7. EXPERIMENTS AND OPTIMIZATIONS 181

a storage subsystem that is more in balance with the processing power of our 16-core 256GB NUMA machine. The three-way HDD RAID in this platform is rather unfortunate, as reading and rebuilding the TPC-H database consumes around 5 minutes I/O-wise.

Poor I/O performance is not the only reason for slow clustered checkpointing. In Figure 7.11a we also note several high spikes in execution time. These spikes correspond to runs that overlap a committing background checkpoint, after which we switch to the new table images. A fraction of those spikes can be attributed to an increase in the read-only “queries” component. As in the non-clustered case, this is caused by I/O misses due to evicted blocks after the switch. This time, however, doubling the server memory does eliminate most of the I/O misses, as indicated by the queries component in the bars of Figure 7.11b, but fails to eliminate the spikes entirely.

Analysis of execution profiles revealed that the remaining spikes are caused by checkpoint commit taking excessively long, often around 270-370 seconds. These commits reuse Vectorwise’s general transaction commit routine, which is guarded by a global mutex. Therefore, this mutex is also used to start, abort, or commit a regular transaction, meaning that, as long as a checkpoint (or regular transaction) is committing, no other transaction can start or finish. We will not go into much technical details on this, but a brief discussion of where time is spent during checkpoint commit aids in reaching our final conclusion.

In our clustered benchmarks, a typical checkpoint commit iterates a list of roughly 150 trans-PDTs, from transactions that committed during the checkpoints lifetime. Each refresh function generates two trans-PDTs, one for orders, one for lineitem. So, overall, we are dealing with 75 transactions, containing around 4GB of PDT data, the bulk of which is consumed by roughly 22 million lineitem inserts and 5.5 million inserts into orders. First of all, these PDTs are iterated in commit order, and propagated into two fresh (write-) PDTs, which eventually become the initial read-PDTs in the newly checkpointed image. As these propagations reuse the trans-to-write propagation of regular transaction commits, a copy-on-write snapshotting mechanism was triggered on every propagation into this ever growing write-PDT. Given that we did not switch to the checkpointed image yet, this write-PDT is still free from concurrent readers, making this snapshotting a needless waste of CPU time.

Next, both these initial read-PDTs are used to perform deferred index maintenance for the orders and lineitem tables. This involves iterating over the inserts in these PDTs and calling min-max and JIS maintenance routines. Es- pecially the min-max update routine turned out to be highly ineﬃcient for such bulk updates, involving a function call per update (i.e. per PDT insert tuple), where each tuple triggers two “compare” calls per attribute, to compare against the current min and max values. Overall, this took more than a minute to han- dle our 27.5 million insert updates. One can easily envision a “bulk” algorithm that should be capable to perform these updates in a few seconds. For example, given that the min-max SID-based partitioning is known, one could turn the process around: passing SID ranges to the PDT and letting it return min and max attribute values for the inserts found in each SID range. This could even be parallelized based on the SID partitioning.

Finally, the entire 4GB PDT data, together with the catalog changes for newly rebuilt tables, are being written to the WAL. Such “object serialization” involves a lot of byte-level work, and is therefore very slow, taking around 70

182 CHAPTER 7. INDEX MAINTENANCE 0 20 40 60 80 100 120 140 160 1 5 10 15 20 25 30 0 5 10 15 20 Total time (s) PDT memory (GB) Run # queries rf1 rf1 commit rf2 rf2 commit PDT memory 0 0 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5

Figure 7.12: Clustered TPC-H power runs with improved checkpoint commit times.

seconds. If checkpointing starts a fresh, private WAL, there is no need to do such serialization while holding the global mutex. If checkpointing updates the current WAL, there should not even be a need to serialize the PDT data, as that has already been done when the transactions committed originally. This logic should be carefully analyzed and improved within Vectorwise.

To conclude, most of the checkpoint commit work can be further optimized and should be moved outside the commit mutex wherever possible. This involves a signiﬁcant overhaul of the commit infrastructure, which, due to limited time and research interest, is considered future work for Vectorwise. Disabling the copy-on-write snapshotting for the newly built “checkpoint” PDTs is the only issue that could be improved with relative ease. The ﬁnal results with that optimization in place are shown in Figure 7.12. With a properly optimized implementation of checkpoint commit, however, (in-memory) results with checkpointing enabled should match, or even beat, the results without checkpointing from Figure 7.10a.

In document Updating compressed column-stores (Page 191-195)