Computation at the L4 Cache - Understanding fundamental database operations on modern hardware

stored on intelligent RAM is not yet feasible and maybe never will be feasible. Another approach, taken by e.g. Netezza, is to introduce field programmable gate arrays (FPGAs) into the data path between disk and memory. With those FPGAs it is possible to perform computations on the data before it is brought into the memory hierarchy. Such computations can increase the payload that is brought into memory in case of e.g. filtering, aggregation or compression. This approach has two mayor drawbacks. One drawback is the energy efficiency of those FPGA elements [61]. The other drawback only arises with the upcoming in-memory databases — there simply is no need to load any data from disk to answer queries.

5.6 Computation at the L4 Cache

The high throughput of the MVCL instruction led us to further investigations of the hardware implementation of this instruction. The move of a whole memory page is performed without bringing all cache lines into the L1, L2 or even the L3 cache but instead 16 cache lines of the L4 are used as a buffer for the page that needs to be fetched and stored back. This leads to the idea of performing computations on this memory page while it is being moved from one page address in memory to the other, without bringing any data into CPU registers.

To evaluate such a hardware we performed several simulations. The setup is as follows: We use two arrays A and B, each filled with 512 MB of data, stored in memory. These arrays correspond to columns in a column-store database and we perform filtering and aggregation on those columns. As a baseline we perform filtering and aggregation in the CPU only. Additionally we simulate the existence of filter and aggregation hardware. We simulate the existence of such a hardware component as follows: First, we create and store a bit mask of the qualifying tuples upfront in memory. Afterwards, when the hardware should be used to filter or aggregate a page, we simply move the page to a scratch area and use the upfront created bit mask as if the move created that bit mask. The time measurement of the simulated hardware filter and aggregation operations include the time to perform page moves but not the time to create the bit mask in memory.

Listing 5.1 contains the C++ code that was used to count the qualifying entries in the table. The first very simple query we look at just counts the number of qualifying tuples with a simple equality predicate, SELECT COUNT(∗)WHERE A=0. Figure 5.4 shows the simulated runtime when varying the selectivity of the filter condition. We see that the move page instruction and therefore our expected filter runtime is only half the time compared with the standard CPU filtering runtime. When we compare the green line with the red line we also note that the cost to count the set bits in the bit mask is rather small compared with the cost to create

Listing 5.1: Code for counting qualifying tuples in CPU

int queryForCount ( EntryType query , EntryType ∗ t a b l e , int t a b l e S i z e ) { int c o u n t = 0 ; for (int i = 0 ; i < t a b l e S i z e ; i ++) { c o u n t += ( t a b l e [ i ] == q u e r y ) ; } r e t u r n c o u n t ; } ��

Figure 5.4: Simulated Filter Performance

the bit mask. Nevertheless, this additional cost can be avoided by counting the number of qualifying values already in the filter engine.

When we now look at a slightly more complex query, namely the query

SELECT SUM(B) WHERE A=0, we can perform this query in at least three different ways. First, we can perform the aggregation in the CPU by reading elements from the array A until we find a qualifying element and then read the corresponding element of B and add it to the running sum. We see the code to perform the aggregation fully in the CPU in Listing 5.2. Second, we can first compute a bit mask by streaming A through the new page move engine and then either use this bit mask in the CPU to decide what elements to load from B. Third, after computing the bit mask by streaming A through the new page move engine, we also stream the array B together with the bit mask through the page move engine, which now performs an aggregation under mask operation.

5.6. Computation at the L4 Cache 133

Listing 5.2: Code for aggregating all qualifying tuples in CPU

EntryType sum B where A 0

( EntryType ∗ colB , EntryType ∗ colA , int t a b l e S i z e ) { EntryType sum = 0 ; for (int i = 0 ; i < t a b l e S i z e ; i ++) { i f ( colA [ i ] == 0 ) sum += c o l B [ i ] ; } r e t u r n sum ; } ��

Figure 5.5: Simulated Aggregation and Filter Performance

The runtime of those three approaches can be seen in Figure 5.5. The blue line corresponds to the first option, using only the CPU, the green line to the second, using the created bit mask as filter when performing the sumation in the CPU, and the violet line correspond to the third option, creating the bit mask and performing the aggregation in the smart cache. When the selectivity is very high, i.e almost no elements qualify, it pays off to load only the needed elements into the CPU instead of performing the aggregation in the Page Move Engine. The break even point is around a selectivity of 5%.

In document Understanding fundamental database operations on modern hardware (Page 147-150)