2.4 Genomic Multidimensional Range Query Benchmark (GMRQB)
2.4.3 Realistic Range Query Templates
In collaboration with Bioinformaticians28, we designed the Genomic Multidimensional Range
Query Benchmark (GMRQB), a workload of eight realistic complete- and partial-match range
query templates, which are applied to the data set described in Section 2.4.2. In addition to the eight query templates, we also provide a mixed workload consisting of all templates randomly mixed together, which enables experiments with changing workload patterns. These range queries retrieve specific subsets of genomic variants interesting for further analysis. They specify range predicates on some, most, or all dimensions. Depending on the dimension, predicates may either specify single points or ranges of different sizes. For instance, we always use a point query for the dimension gender, yet always apply range predicates to the dimension location. All queries of GMRQB restrict the genomic location, i. e., attributes chromosome and location. Queries in the workload are templates that have to be instantiated with meaningful values. For the genomic location, we use the RefSeq database29 to align genomic ranges to coding regions. All other variables are filled using randomly-selected values found in the original data. Listing 2.1 shows Query Template 3 from GMRQB written as a SQL statement to ease readability30. Like all query templates, it restricts the search to a certain genomic region defined by a chromosome and a location range. Additionally, it retrieves only variants that were found in an individual of a particular gender.
Listing 2.1: Query Template 3 of GMRQB.
1 SELECT * FROM variants
2 WHERE chromosome BETWEEN ? AND ? 3 AND location BETWEEN ? AND ? 4 AND gender = ?;
For each query template, Table 2.2 shows the average selectivity and the average number of queried dimensions. Except Query Template 8, all query templates are partial-match queries. For completeness, Appendix A provides all query templates.
The used data set has three disadvantages with regards to the evaluation of MDRQ:
• We had to use hashing to transform attributes originally stored as strings into floating- point values. Unfortunately, such transformations prevent meaningful range queries on these dimensions.
• Some dimensions, e. g., gender, or reference genome, have only very few distinct values. Actually, range queries on these dimensions turn into point queries.
• Although the queries of the benchmark resemble a real-world interactive analysis of ge- nomic variant data, these queries were not extracted from real applications. It would be
28
We would like to thank the bioinformaticians from our working group, especially Yvonne Lichtblau, for their valuable feedback on the design of the GMRQ Benchmark.
29
RefSeq: NCBI Reference Sequence Database, https://www.ncbi.nlm.nih.gov/refseq/, Last access: August 29, 2018.
30
Implementations of MDIS typically require multidimensional range queries to be specified as two vectors, where the first (second) vector denotes the lower (upper) boundary of the range query.
GMRQB Query Template Average Selectivity Average Number of Queried Dimensions Query Template 1 10.76% (σ = 7.24%) 2 (σ = 0.0) Query Template 2 2.19% (σ = 2.27%) 5 (σ = 0.0) Query Template 3 5.36% (σ = 3.61%) 3 (σ = 0.0) Query Template 4 0.22% (σ = 0.15%) 4 (σ = 0.0) Query Template 5 0.20% (σ = 0.15%) 5 (σ = 0.0) Query Template 6 0.11% (σ = 0.11%) 6 (σ = 0.0) Query Template 7 0.05% (σ = 0.06%) 7 (σ = 0.0) Query Template 8 0.00001% (σ = 0.00002%) 19 (σ = 0.0) Mixed Workload 1.58% (σ = 3.58%) 5.81 (σ = 4.11)
Table 2.2: The query templates of the GMRQB.
very interesting to monitor workloads from researchers in, for instance, precision medicine and add these queries to GMRQB.
3 CSSL: Processing One-Dimensional Range
Queries in Main Memory
This chapter addresses one-dimensional main-memory index structures, of which many have been proposed over the last years, e. g., the adaptive radix tree (ART) [Leis et al. 2013], the fast
architecture sensitive search tree (FAST) [C. Kim et al. 2010], or the cache-sensitive B+-tree
(CSB+-tree) [Rao et al. 2000]. These are typically based on the concepts of traditional index structures, e. g., B-trees [Bayer et al. 1972], radix trees [Morrison 1968], or hash tables [Garcia- Molina et al. 2000], but adapt them to the needs of main-memory settings. Like disk-based index structures, which optimize data transfers between external and main memory, in-memory index structures aim to work as much as possible on data held in higher, faster levels of the memory hierarchy when evaluating search queries, which boils down to optimizing CPU cache misses. Such optimizations are motivated by analyses of Ailamaki et al. [Ailamaki et al. 1999], which identified LLC misses as one of the major contributors to the runtimes of database workloads on modern hardware.
Existing in-memory index structures mainly focus on achieving high lookup performance, but neglect range queries, despite their numerous applications and use cases (see Introduction). While most hash tables obviously lack pruning capabilities for range queries anyway, because they do not store data in a sorted order, also many in-memory tree variants, such as ART or CSB+, show poor search efficiency when executing range queries. Search trees keep data in a sorted order and implement range queries by looking up the smallest matching element and iterating over all consecutive elements until a mismatch occurs. Most in-memory approaches optimize the first step of range queries, typically implemented as a lookup operation, but neglect the second step, which often requires chasing many pointers with random accesses.
The major challenge to an efficient in-memory execution of range queries, especially for queries with a moderate or a low selectivity, are random data accesses that induce cache misses and lead to CPU stalls [Ailamaki et al. 1999]. Taking this observation into account, it is not surprising that, in main memory, sequential full-table scans outperform tree-based index structures for range queries with selectivities of approximately 1% or larger [Das et al. 2015].
In this chapter, we present the cache-sensitive skip lists (CSSL) as a novel main-memory index structure based on conventional skip lists [Pugh 1990; Munro et al. 1992]. CSSL employ a specific memory layout to take maximal advantage of the features of modern CPUs, e. g., multi-level cache hierarchies, SIMD instructions, and pipelined execution. They store data such that the range query operator can almost-sequentially traverse over matching elements, which exploits cache line prefetching and strongly reduces CPU cache and TLB misses. Moreover, the used memory layout enables a vectorization of the range query algorithm.
• We propose the cache-sensitive skip list, a main-memory index structure offering efficient execution of range queries.
• We show how to apply SIMD instructions to the range query operator of skip lists. • We compare CSSL with other main-memory index structures using different workloads on
synthetic and real-world one-dimensional data sets.
The remainder of this chapter is organized as follows. In Section 3.1, we present work related to CSSL. Section 3.2 introduces skip lists, the index structure that CSSL are based on. Section 3.3 presents the foundational concepts behind CSSL and describes algorithms for executing lookups and range queries; we also show how to process updates. Section 3.4 compares CSSL with state-of-the-art main-memory index structures and Section 3.5 summarizes this chapter.
Parts of this chapter have been previously published in [Sprenger et al. 2016].
3.1 Related Work
Although concepts like cache-aligned data layouts, index traversal with SIMD instructions, and pointer elimination have been investigated before [C. Kim et al. 2010; Rao et al. 1999; Rao et al. 2000], to the best of our knowledge, we are the first to combine these to accelerate range queries in one-dimensional main-memory index structures.
Skip lists [Pugh 1990] were proposed as a probabilistic alternative to B-trees [Bayer et al.
1972]. They have been applied to multiple areas and have been adapted to different purposes, e.g., lock-free skip lists [Fomitchev et al. 2004], deterministic skip lists [Munro et al. 1992], or concurrent skip lists [Herlihy et al. 2006]. In [Xie et al. 2016], Xie et al. present a parallel skip list-based main-memory index, named PI, that processes batches of queries using multiple threads. Skip lists are not only of interest for researchers, but also part of several modern database management systems. The main-memory database system MemSQL [Chen et al. 2016] uses skip lists to implement secondary indexes1 and the key-value store Redis [Carlson 2013] employs them to manage sorted sets2. CSSL are based on deterministic skip lists [Munro et al. 1992], but employ a cache-friendly data layout tailored to modern CPUs and beneficial for the execution of range queries.
There are several other approaches addressing in-memory indexing [M. Böhm et al. 2011; C. Kim et al. 2010; Kissinger et al. 2012; Leis et al. 2013; Rao et al. 1999; Rao et al. 2000], yet few specifically target range queries. Cache-sensitive search trees (CSS-trees) [Rao et al. 1999] build a tree-based dictionary on top of a sorted array that is tailored to the properties of the cache hierarchy, e. g., the sizes of the cache lines, and can be searched in logarithmic time. CSS-trees are static by design and need to be completely rebuilt when ingesting updates. Rao and Ross [Rao et al. 2000] introduce the CSB+-tree, a cache-conscious B+-tree [Comer 1979],
which minimizes pointer usage and reduces space consumption. As shown in Section 3.4, CSSL outperform CSB+-trees significantly for all considered workloads.
1
The Story Behind MemSQL’s Skiplist Indexes - MemSQL Blog, http://blog.memsql.com/ the-story-behind-memsqls-skiplist-indexes/, Last access: August 29, 2018.
2
An introduction to Redis data types and abstractions - Redis, https://redis.io/topics/ data-types-intro, Last access: August 29, 2018.