7.4 Stratified Transaction Execution
7.5.6 Hardware Cost
Table 7.4details the cost of SLICC’s and STREX’s hardware components.
Section 7.3.2describes the hardware componentsTable 7.4gives for SLICC except for the
thread queue, which holds threads waiting for cores. Each thread queue entry contains a unique numerical ID, a pointer to the threads’ context, and a core ID. The thread queues can be local to each core, or centralized to one core. The table shows the cost for a centralized queue. Fewer entries are required when the queues are local to each core. Theteam management table
is responsible for forming teams of similar threads. Each entry consists of: a unique numerical ID, a type ID, a team ID, index within a team, and a timestamp. The team management table is best thought of as being centralized, since every core needs to know which cores are assigned to which teams. We can either have one centralized copy or per core copies that are kept coherent. For this work we simulated a centralized copy at one of the cores and modeled the necessary traffic. On each core, a SLICC agent is responsible for managing the thread queue. The thread queue is a circular FIFO buffer and the first entry is executed until it migrates, completes, or gets blocked for I/O. On the latter case, the thread is moved to the end of the queue. With an over-provisioned thread queue of 30 threads and a copy of the team management table, per core, SLICC requires a maximum of 966 bytes in addition to logic. None of the logic operations for SLICC are on the critical path of transaction execution. On the other hand, STREX utilizes two main units: a team formation unit and a thread scheduler unit. The team formation unit is used to group similar transactions into teams as in SLICC. In this work STREX searches through a window of 30 threads. Theteam management
7.5. Evaluation
Table 7.4: Hardware space cost of SLICC and STREX. Cache Monitor Unit
SLICC STREX
Missed-Tag Queue 60-bits
NA (MTQ) (16-core,matched_t= 4) Miss Shift-Vector 100-bits NA (MSV) Cache Signature 2K-bits NA (Bloom Filter)
Total 2208 bits (276 Bytes) 0 bits
Thread Scheduler
SLICC STREX
Thread Queue 30-entries (12-bits ID, 20-entries (12-bits ID,
48-bits pointer to thread context, 48-bits pointer to thread context,
4-bits core ID) 1-bitleadflag)
phaseI DCounter NA 8-bits
AuxiliaryphaseI D
NA 8-bit per cache block
Table (512 cache blocks)
Total 1920 bits (240 Bytes) 5324 bits (665.5 Bytes)
Team Formation
SLICC STREX
Team 60-entries (12-bits ID, 30-entries (12-bits ID,
Management 32-bits timestamp, 4-bits type ID, 32-bits timestamp, 4-bits type ID,
Table 4-bits team ID, 8-bits team index) 4-bits team ID, 8-bits team index)
Total 3600 bits (450 Bytes) 1800 bits (225 Bytes)
Grand Total 7728 bits (966 Bytes) 7124 bits (890.5 Bytes)
tablemaintains information about threads until they are dispatched to a core and its entries consist of the same information as in SLICC. As detailed inSection 7.4.1, the thread scheduler unit is responsible for incrementing the phaseI D counter, tagging cache blocks with the currentphaseI Dvalue, keeping track of theleadthread, monitoring instruction cache block victims, and context switching threads. The thread queue is a circular FIFO buffer. Each entry consists of a unique ID, a pointer to the thread’s context in the L2 cache, and aleadflag bit. The size of the thread queue should be the maximum value allowed for theteam_sizeconfiguration parameter. Most experiments setteam_sizeto 10 with 20 being the maximum considered. Assuming one team management table per core, the total storage required per core by STREX is 890.5 bytes in addition to the logic.
7.6 Related Work
There have been several hardware and software proposals for reducing instruction stalls that are applicable to OLTP workloads such as instruction prefetching [52,53,101,105,161], computation spreading [30], and transaction batching [74].
Instruction prefetching is a well-studied research area. Stream buffers [101,161] are simple to implement in hardware, but they provide relatively low instruction coverage. More sophisti- cated prefetchers [52,53] utilize bookkeeping structures to record encountered instruction streams, and to replay them when part of the stream is touched again. Their structures in- crease area and energy. Moreover, prefetching, unless 100% accurate, increases miss traffic for fetching blocks that are never touched prior to being evicted. PIF [53] was reported to achieve near-optimal instruction coverage. Section 7.5compares SLICC and STREX with PIF and shows that their performance is competitive while their hardware space cost is 40×lower than PIF’s. SHIFT [105] is a recent proposal that aims to minimize the space cost of PIF through sharing the instruction stream history across cores, which also exploits the observation of high temporal code overlaps across concurrent threads in a system. Nevertheless, any prefetching technique is orthogonal to the scheduling mechanisms this chapter proposes. For example, STREX and SLICC can avoid many of the misses that PIF has to incur, thus possibly reducing the storage, power, and bandwidth overhead of PIF. PIF could reduce execution time for the initial transactions, thus improving performance when used in conjunction with STREX or SLICC. Therefore, there is potential to investigate the combination of these proposals. Chakraborty et al. show a high-degree of redundancy in instruction fragments across threads concurrently running on multiple cores [30]. They propose CSP, which employs thread mi- gration to distribute the dissimilar instruction code segments and group the similar ones together. For system code, which is commonly used by multiple threads, CSP fragments and distributes the code across a group of dedicated cores. CSP then migrates threads to these dedicated cores to execute system code. When threads are done, they return back to their original cores to resume execution for the user-level code. Thus CSP is limited to fragmenting OS code, losing opportunities of fragmentation within user code. SLICC borrows ideas from CSP, however, generalizes thread migration to include interleaved user-OS code fragmentation points. In addition, thread migration in SLICC is managed by the hardware, while with CSP, the OS performs the migrations.
STEPS [74], on the other hand, is a software solution whose approach is identical in spirit to STREX. STEPS relies on manual code instrumentation, which is a cumbersome task that requires a high level of expertise, is prone to many errors as it is manual, and results in code that is not portable since it is platform dependent. A slightly improved version, autoSTEPS, automates several components of the instrumentation process.
7.7. Conclusions
7.7 Conclusions
OLTP workloads suffer from high instruction miss stalls on high-end server processors since their transaction instruction footprints are by far larger than current L1-I caches, thus leading to ongoing cache thrashing. To exploit the significant temporal instruction overlap among sim- ilar transactions, this chapter presents two programmer-transparent scheduling mechanisms to increase instruction reuse in the caches. While SLICC adaptively spreads the execution of a transaction over multiple cores through thread migration, STREX time-multiplexes trans- actions on one core. They both enable reuse of common instructions by localizing them to cores. As a result, they improve performance over conventional transaction scheduling and exhibit competitive performance to state-of-the-art prefetchers despite significantly lower space cost. When the available aggregate L1 instruction cache capacity is enough to spread a workload’s instruction footprint, SLICC outperforms STREX, whereas under low core counts STREX should be the choice of scheduling.
8
Transaction-aware Instruction Chas-
ing
The previous chapter (Chapter 7) aims to maximize instruction cache locality through two hardware mechanisms and surveys related work that propose either software- or hardware-side solutions to the same problem. However, exploiting hardware resources based on the hints given by the software-side has not been widely studied for data management systems. This chapter presents ADDICT, a software-guided hardware mechanism that schedules transactions in a way to maximize the instruction cache locality.
ADDICT is based on the same observation that inspired the two hardware mechanisms in the previous chapter: concurrent transactions exhibit high instruction commonality (Section 6.6). However, ADDICT initially performs a profiling step to determine the most frequent actions of database operations, whose instruction footprint can fit in an L1 instruction cache, and assigns a core to execute each of these actions. Then, it schedules each action on its corresponding core. This way, it requires less hardware complexity and leads to more precise scheduling decisions. Our prototype implementation of ADDICT reduces L1 instruction misses by 85% and the long latency data misses by 20% compared to the conventional way of scheduling transactions. As a result, ADDICT leads up to a 50% reduction in the total execution time for the evaluated workloads. Furthermore, it is 20% and 35% faster than SLICC and STREX, respectively, on average.1
8.1 Introduction
As discussed in the previous part, several workload characterization studies show that micro- architectural resources are severely underutilized when running online transaction processing (OLTP) applications [54,177,186] (and alsoChapter 5). Up to 80% of the execution cycles go to memory stalls [54]. As a result, on modern processors, OLTP barely achieves one instruction per cycle (IPC), far below the processors peak capability of four IPC.
Previous work on reducing memory stall time for data management systems aimed at reducing cache miss rates, focusing primarily on improving locality and cache utilization for data rather than for instructions. Proposals range from cache-conscious data structures and algorithms [32,58] to sophisticated data partitioning and thread scheduling [154] on the software-side, whereas hardware techniques mainly target data prefetching [175].
However, as we have shown inPart II, for traditional transaction processing systems, the stall time due to L1 instruction misses is at least as problematic as long-latency data misses from the last-level cache. Improving code layout by writing better code or by compilation optimizations [159] does improve instruction cache utilization, but does so by mainly reducing conflict misses. However, it is capacity misses that dominate L1 instruction misses on today’s most commonly used server hardware (Section 6.5.1); the instruction footprint of a transaction is too big to fit in the L1 caches, thus thrashing L1-I and leading to very lengthy stalls.
Chapter 7proposes two hardware mechanisms, STREX and SLICC, which address capacity instruction misses in OLTP. STEPS [72,74] is a software mechanism with the same goal as STREX and SLICC. These proposals are motivated by the observation that threads executing transactions in parallel on a multicore serverexecute a significant amount of common code
(Section 6.6). To be able to reuse the common instructions already brought into L1, STEPS [74] and STREX [15] time-multiplex a batch of threads on the same core, whereas SLICC [13,14] spreads the computation of a transaction to several cores to localize the common instructions to specific caches. Nevertheless, STREX and SLICC are completely oblivious to software and miss the opportunity to more precisely improve instruction locality through software guidance. STEPS, on the other hand, is a pure software technique designed to run only on a single-core and requires significant manually-aided instrumentation. Furthermore, all three techniques increase average transaction latency and STREX and STEPS increase the potential of deadlocks due to extensive batching and context-switching.
The goal of this chapter is to better exploit the L1 caches when running transactions based solely on hints from the software-side. The traditional way of scheduling transactions con- siders each as one big, monolithic task. Therefore, the granularity of tasks assigned to run on a core is too coarse, which leads to cache thrashing due to the large instruction footprint of the scheduled task. This work proposes to reduce the granularity of task-to-core assignment by scheduling the actions of common database operations. This approach bridges the gap between a transaction’s instruction footprint and the L1 capacity.
To assign finer-grained tasks to cores while running transactions, we design ADDICT, a trans- action scheduling mechanism that chases instruction cache locality. ADDICT first segments a database operation into smaller actions, where the instruction footprint of each action fits in a single L1 instruction cache. Then, it assigns specific cores for each of these actions and migrates the transactions over multiple cores using core assignment decisions that aim to maximize instruction locality for each action.