4.2 Vectorized in-cache execution model
4.2.2 In-cache execution
MonetDB/X100 primitives are similar to MonetDB operators in many ways: they use highly specialized code, achieve high CPU efficiency, and consume and produce arrays of values. In MonetDB this last property can lead to the materi- alization of large intermediate results, which results in a significant performance degradation, as discussed in Section 3.4 – the main memory bandwidth is often simply too low to sustain the data hunger of the CPU-efficient primitives. To avoid this problem, MonetDB/X100 exploits the fact that cache-memory band- width is significantly higher than that of RAM, and introduces a principle of in-cache execution. The idea is presented in the left-hand side of Figure 4.1: vectors are organized in such a way that their entire memory footprint is small enough to fit in the CPU cache. This allows keeping the materialized results on the CPU die, without the need of expensive writes into main memory.
RAM mul1 mul2 CPU cache add1 l_tax l_extendedprice l_quantity netto_value tax_value total_value 0.5 1 2 5 10 20 50 100 200 1 32 1K 32K 1M Execution time (cycles/operation)
Vector size (tuples) mul1 mul2 add
Figure 4.3: Plan of a simple query in MonetDB/X100, demonstrating in-RAM or in-cache data place- ment
Figure 4.4: Impact of data location and vector size on primitive performance, using a query from Figure 4.3
analyze the performance of a simple X100 query, equivalent to this SQL state- ment:
SELECT l_quantity * l_extendedprice AS netto_value, netto_value * l_tax AS tax_value,
netto_value + tax_value AS total_value FROM lineitem
The simplified plan for this computation is presented in Figure 4.3. It shows that the base columns are read from main memory (storage manager buffers), while the intermediate results stay in the CPU cache. Figure 4.4 demonstrates the speed of the primitives depending on the used vector size. For small vector sizes, the execution time is dominated by the per-vector logic. On the other hand, for large vector sizes, the vectors do not fit in the CPU cache anymore, causing expensive main-memory traffic.
The performance of the individual primitives depends on the location of their input data. As Figure 4.4 shows, the mul1() primitive is significantly slower than add1() – the reason is that it needs to read a large amount of data from main memory. In this case, with the optimal vector size, it spends 3.5 CPU cycles to add two 4-byte wide values. On a 2.16GHz machine, this results in a memory bandwidth of ca. 4.8 GB/sec. On the other hand, add1 only spends 0.9 cycles on the equivalent computation (on this platform multiplication and addition exe- cute at the same speed), because it can quickly access all its input data in the CPU cache. The performance of mul2 lands in between (2.1 cycles), as only one of its inputs is RAM-resident. This experiment demonstrates that even with se-
Section 4.2: Vectorized in-cache execution model 73
quential access keeping the data in-cache can result in a significant performance improvement and explains the performance benefit of MonetDB/X100 over the MonetDB-style processing.
4.2.2.1 Cache interference
Many cache-conscious data processing techniques implicitly assume that dur- ing the operation of a given algorithm it has an exclusive ownership of the cache [MBK02]. However, in real-life scenarios, it is often the case that multiple queries are active at the same time in the system, and the execution context of the CPU changes quite frequently. As a result, the cache content useful for one query can be replaced with other data. This problem has been demonstrated e.g. in [CAGM07, Section 8.2.5] where the performance of the hash-join algorithm based on cache-partitioning has been shown to deteriorate significantly (up to 78%) when the cache content is periodically flushed.
To analyze the impact of cache interference on the performance of Mon- etDB/X100 in-cache processing, two parts of the execution layer need to be considered: the iterator pipeline and the individual operators. The impact on the operators is algorithm-specific and not directly related to the chosen execution model. For example, the cache-partitioned hash-join will suffer from the cache interference problems in all the execution models: tuple-based, column-based, and vector-based. Therefore, we postpone the discussion of cache-conscious op- erator implementations to Chapter 5, and for now focus on the execution model itself.
An execution context change in the CPU can occur when an application performs a system call, e.g. by requesting an I/O operation. Also, the kernel can decide to preempt the current process, e.g. when an interrupt happens or the quantum of time assigned to this process is finished. In the scenarios MonetDB/X100 is targeted at, the system is typically performing relatively few and large I/Os, hence the interrupts do not occur often. With the typical kernel scheduling frequency in the range of 100 times per second (10ms), it is reasonable to assume that an executing process has an uninterrupted slice of time in the range of 0.1 to 10 ms.
To evaluate the robustness of the vectorized in-cache execution model pro- posed in MonetDB/X100, we analyze the performance of the TPC-H Query 1 with enforced cache-flushes happening at different intervals. Query 1 is a good candidate to demonstrate the impact on the iterator pipeline itself, as all the operators except for the aggregation are stateless, and the aggregation uses a very small hash-table, as it only computes 4 output values. As a result, only the
0 5 10 15 20 25 30 35 40 45 50 100 300 1000 3000
Cache-flush CPU usage / query slowdown (%)
Cache-flush sleep time (usec) Cache-flush CPU usage
X100 query slowdown
Figure 4.5: Impact of the cache-flushing on the TPC-H Query 1 performance in MonetDB X100 (2.4 GHz Athlon64, 512KB L2 cache)
vectors containing the data passed between the operators use the CPU cache. In the experiment presented in Figure 4.5, a series of queries is running, and, for part of the queries, a separate program is forcing the cache-flush by touching all the cache lines in a separate cache-sized main-memory area.
As the results show, the slowdown of the queries is in line with the CPU time taken by the flushing program, and not significantly higher. This demonstrates that the vectorized in-cache model is relatively robust to cache interference. One of the reasons the impact of the cache flushing is so marginal, compared to the results from [CAGM07], is that for most operations in this query the data accesses are fully sequential. As a result, if the cache-misses start to occur at the beginning of the vector, the hardware-prefetching mechanisms available in most modern CPUs (see Section 2.2.3) will be triggered and will start loading the remaining part of the data. Additionally, even with a 0.1ms time slice, tens or hundreds of primitives can execute, exploiting the exclusive access to the cache during this period.
4.2.2.2 Vector size and allocation
For high execution efficiency, it is important that the vector sizes in the vector- ized execution model are on the one hand as large as possible, to minimize the interpretation overhead, and on the other hand not too large, to fit in the CPU cache. Two aspects of this issue need to be analyzed: how many vectors need to be allocated, and the size of each vector.
Section 4.2: Vectorized in-cache execution model 75
In the existing MonetDB/X100 implementation, every primitive uses its in- put vectors and introduces a new result vector, with vector sizes fixed in the entire query plan. The only optimization currently applied is that if one of the primitive input vectors has only one parent, and shares the same datatype with the result vector, it can be reused. This is similar to the register allocation prob- lem, found in compiler optimization (see e.g. [HKMW66, AC76]). It is possible to see physical vectors as vector registers and then map logical vectors used by the primitives to them. Such an optimization can reduce the number of vectors within a given query.
An interesting point in vector allocation optimization is that different vectors can have different datatype widths. As such, for optimal results, the classical register allocation scheme needs to be extended to be able to e.g. reuse a 64-bit wide vector space as two 32-bit vectors. Another possible optimization exploits the fact that query graphs can often be decomposed into a collection of pipelin- able sub-graphs (separated by e.g. blocking operators). Then, vector allocation for each sub-graph can be performed separately.
Once the number of vectors for a given query sub-graph is known, we can find the vector size by dividing the available D-cache size by the number of vectors. Typically, not the entire cache can be used, as there is space overhead related to operator state, especially for blocking operators with a potentially large state (e.g. hash aggregation). Another problem is related to the trade- offs between targeting different CPU cache levels. While, as presented earlier, L1 can provide much better performance than L2, it can require significantly smaller vectors. The choice should be based on the complexity of the query: for very simple queries L1 should be targeted, and for queries with more vectors, it should be L2.