Average Reuse in an Instance - Inside Transactions

6.6 Inside Transactions

6.6.3 Average Reuse in an Instance

Figure 6.8demonstrates the instruction reuse frequency across different instances of transactions or database operations. However, it does not indicate how frequently a memory address is reused within each instance. Therefore, we also measure the average per instruction and per data address accesses within one instance of each transaction and database operation. Figure 6.9shows the results for theAccountUpdatetransaction and theinsert tupleoperation in TPC-B. The results for TPC-C and TPC-E and the other database operations share similar trends.

Figure 6.9omits the address labels on the x-axis, but places the addresses based on their frequency across different transaction instances (from left to right the frequency increases). The addresses on the right of the light-gray vertical line appear in all the instances.Figure 6.9 highlights that the frequently reused addresses across transaction and operation instances are also frequently reused within each instance.

6.7 Conclusions

Recent studies emphasize that there is still a clear mismatch between what modern hardware offers and what OLTP systems need. Memory stalls dominate the overall execution time, and in turn OLTP performance deteriorates and the underlying hardware remains largely underutilized.

We conduct a detailed trace simulation of instruction and data misses in OLTP benchmarks modeling the memory hierarchy of one of the most commonly used hardware types. The experimental results link the most important memory-related stall types to software components in the storage manager, and quantify the effect of increasing the data size. More specifically, our results demonstrate that the capacity related L1 instruction misses are the main cause of the stalls even when working with large memory-resident data sets followed by the compulsory long-latency misses. The index probe operation, which is the most frequent routine for OLTP workloads, is the fundamental cause of both data and instruction misses coming from different levels of the cache hierarchy. The misses coming from the B-tree, lock, and buffer pool management are the essential factor in the misses observed during an index probe.

On the other hand, we also perform a memory characterization study for the same benchmarks. The goal of this study is to better understand the similarities or differences of the instruction and data footprint across transactions to be able to get further insights on improving cache locality for OLTP applications. This study demonstrates that:

• Transactions exhibit high instruction overlap because of the common database operations they execute, especially among same-type transactions. This offers an opportunity to

achieve better L1-I cache locality by scheduling transactions in a way that would enable instruction reuse across transactions based on their common actions.

• The percentage of the data that is common across transactions is very low due to infrequent reuse of the database tuples. The few frequently used data are small-sized and mostly read-only. Accordingly it may be possible topinthem in the caches to improve data cache locality.

• The cache blocks that are highly common across different transaction instances tend to be more frequently reused within each instance. Therefore, any technique for improving cache locality for the common instructions and data across different instances also has potential to improve cache locality within each transaction instance.

To achieve a more graceful integration of hardware and software for OLTP systems, both of the layers should become more aware of each other. On the software side, reducing code complexity in the components mentioned above and designing more cache-friendly index structures are crucial. On the hardware side, as the next part (Part III) of this thesis also shows, dedicating several close-by cores for specific transaction operations can help reducing the capacity misses while exploiting instruction commonality across transactions as well as create opportunities for hardware specialization.

Part III

7 Boosting Instruction Cache Reuse in

OLTP

The previous part highlighted that the performance of online transaction processing workloads suffers from instruction stalls; the instruction footprint of a typical transaction exceeds by far the capacity of an L1 cache, leading to ongoing cache thrashing. Several proposed techniques remove some instruction stalls in exchange for error-prone instrumentation of the code, or a sharp increase in the L1-I cache unit area and power. Others reduce instruction miss latency by better utilizing a shared L2 cache. This chapter presents two hardware mechanisms that change the way we traditionally schedule transactions to minimize L1 instruction misses. Both of the mechanisms exploit the observation that OLTP transactions exhibit significant intra- and inter-thread overlap in their instruction footprint (Section 6.6).

SLICC is a programmer-transparent and low-cost technique that migrates threads spreading their instruction footprint over several L1 caches. Under SLICC, a transaction’s first iteration prefetches the instructions for the subsequent iterations or similar subsequent transactions. SLICC reduces instruction misses by 60% on average for TPC-C and TPC-E, thereby improving performance by 70%.

On the other hand, even though SLICC is promising for high core counts, it performs sub- optimally or hurts performance when running on few cores. Therefore, this chapter also presents STREX, another programmer-transparent hardware technique that exploits typical transaction behavior to improve instruction reuse in the L1 caches. STREX time-multiplexes the execution of similar transactions dynamically on a single core so that instructions fetched by one transaction are reused by all other transactions executing in the system as much as possible. Both SLICC and STREX dynamically slice the execution of each transaction. Experiments show that, when compared to baseline execution on 2 – 16 cores, STREX consistently improves performance (48% on average) while reducing the number of L1 instruction and data misses by 36% and 13% on average, respectively.1

7.1 Introduction

As discussed inChapter 5, existing server infrastructures are not tailored well to the needs of OLTP applications, with memory stalls accounting for 80% of the overall execution cycles, most of which are due to first-level instruction cache misses. Transactions of canonical OLTP systems are randomly assigned to worker threads, each of which usually runs on one core of a modern multicore system. The instruction footprint of a typical transaction does not fit into a single L1-I cache, thus, thrashing the cache and incurring high instruction miss rates. Although L2 and L3 caches are growing in size, today’s technology and CPU clock cycle constraints prevent deploying L1-I caches larger than 32KB.

Several works propose to alleviate instruction stalls using hardware [30,53,91,158] or software [74,159] techniques. Existing techniques effectively reduce the number of L1-I misses or the associated penalty when running OLTP workloads, but in return either require error-prone instrumentation to the software code base or employ hardware prefetching tables that more than double the area devoted to the L1-I cache units. Others reduce instruction miss latency by better utilizing the aggregate L2 cache capacity.

AsSection 6.5.1demonstrates, the instruction footprint of a typical OLTP transaction fits comfortably in the aggregate L1-I cache capacity of modern multicore or many-core chips. Provided that there is sufficient code reuse (Section 6.6.2), spreading the footprint of transactions over multiple L1-I caches has potential to reduce instruction cache misses. There- fore, this chapter initially proposes SLICC (Self-Assembly of Instruction Cache Collectives), a hardware technique that utilizes thread migration to minimize instruction misses for OLTP workloads. SLICC divides the instruction footprint of a transaction into smaller code segments and spreads them over multiple cores, so that each L1-I cache holds part of the instruction footprint. As part of this process the L1-I caches self-assemble to form acollectivethat reduces the instruction misses for this transaction and other similar ones. SLICC exploits intra- and inter-thread instruction locality in two orthogonal ways:

• A thread looping over multiple code segments spread over multiple caches observes a lower miss rate (as opposed to a conventional system in which each segment would evict the others from the cache), thereby avoiding thrashing.

• A preamble thread effectively prefetches and distributes common code segments for subsequent threads, thereby reducing the total miss rate.

As execution progresses, old cache collectives are naturally disassembled and new ones are formed to hold the footprints of new transactions.

On the other hand, as this chapter also shows, SLICC is not as effective and mayhurt performance when the footprint of all concurrently running transactions exceeds the available aggregate L1-I capacity. While the number of cores on-chip in main-stream servers is expected to grow in the future, the number of cores per application may not always be sufficient. Data

7.1. Introduction

center design and deployment and application trends influence the available per application core count.

• Current data center design trends are toward consolidating more virtual machines on servers as this increases utilization, improves security and energy efficiency, and reduces costs and management overhead [26,85,198].

• A data center may run multiple OLTP workloads, each with different transaction types. • While L1-I capacities remain cycle-time limited, OLTP transaction instruction footprints

are increasing. Transactions are becoming more complex and thus larger as a result of additional functionality such as data analytics (e.g., WebSphere [84], WebLogic [141]), or more complex logic.

Accordingly, it is also desirable to avoid SLICC’s performance cliff and to develop an instruction stall reduction technique that is effective irrespective of the number of available cores. The temporal code overlap among similar OLTP transactions (Section 6.6.2) also motivates STREX, a transaction scheduling mechanism that exploits inter-transaction locality by group- ing and synchronizing the execution of similar transactions into time slices. During each time slice, aleadtransaction brings into L1-I an instruction code segment that all other transactions ought to reuse. Ideally, when the transactions within a group overlap perfectly, only the lead transaction incurs all necessary misses, the misses that a transaction would incur on a conventional system anyhow. As a result of STREX’s time slicing, the remaining transactions find the instructions they need in L1-I.

This chapter has the following contributions and organization:

• Based on the observation that concurrent transactions on a multicore server exhibit significant code overlap (Section 6.6),Section 7.3presents the design of SLICC, a transaction scheduling mechanism that spreads the execution of a transaction over multiple cores to both exploit aggregate L1-I capacity and enable instruction reuse.

• In order to avoid the drawbacks of SLICC under insufficient core counts,Section 7.4 proposes STREX, a transaction scheduling mechanism that batches transactions on a single core to execute their common code segments one after the other, also exploiting instruction overlap across transactions.

• Section 7.5prototypes SLICC and STREX on an x86 multicore simulator and evaluates them using the TPC-C [191] and TPC-E [193] benchmarks. The evaluation demonstrates that STREX reduces L1-I miss rates (∼36%) and improves performance (35-55%) regardless of the core counts. On the other hand, while SLICC hurts performance under low core counts, it outperforms STREX (by∼16% on average) when there are enough cores to spread the instruction footprint of the workloads.

Transaction 1

A B C A B C A B C

Transaction 2 Transaction 3

Miss Overhead

(a) Conventional on one core

A A A B B B C C C (b) STREX T1 T2 T3 T1 T2 T3 T1 T2 T3 A B C (c) Conventional on a multicore A B C A B C Core 1: T1 Core 2: T2 Core 3: T3 A A A (d) SLICC B B B C C C T1 T1 T1 T2 T3 T2 T2 T3 T3 Time

Figure 7.1: Ways of scheduling transactions.

Finally,Section 7.2discusses how transaction behavior can be potentially exploited to improve instruction reuse in caches while giving a high-level overview for SLICC and STREX.Section 7.6reviews related work andSection 7.7presents our conclusions.

In document Transactions Chasing Scalability and Instruction Locality on Multicores (Page 139-146)