• No results found

Data Management Systems

N/A
N/A
Protected

Academic year: 2021

Share "Data Management Systems"

Copied!
57
0
0

Loading.... (view fulltext now)

Full text

(1)

Data Management Systems

Query Processing

Execution models

Optimization – heuristics &

rewriting

Optimization – cost models

Operators

Gustavo Alonso

Institute of Computing Platforms

Department of Computer Science

ETH Zürich

Access to base tables

Sorting and Aggregation

Joins

(2)

Starting point

Data and indexes are stored in blocks as part of extents

Data from the base tables must be read from the buffer cache and

into the working space of the query (will be processed while reading

it)

Basics:

• Minimize I/O if blocks not in memory

• Minimize accesses to tuples if data in memory • Prefer sequential access

(3)
(4)

The fastest access for single tuple: row_id

The fastest way to access one

tuple is by using its row or tuple

id:

• Row_id = Block_id, offset

• Essentially, a pointer to the tuple

Only need to access the block

where the tuple is

The row_id is found:

SELECT * FROM T

WHERE T.row_id = 123456

SELECT * FROM T

WHERE T.key = AB34TF

Bu

ff

er

(5)

When to use row_id access

SELECT * FROM T WHERE T.A = 42

index on A

• Row_id access can also be used when there is a predicate over an indexed attribute:

• use the index to find the

matching tuples and retrieve them using the row_id

• If there are many matches, it might induce many random accesses to different blocks

(6)

When to use row_id access

SELECT * FROM T

WHERE T.A = 42 AND T.B > 50

index on A

• Row_id access through an index works even with more complex predicates

• use the index to find the matching tuples using the indexed attribute

• Retrieve the tuples that match the other predicate

(7)

Which index to use?

Assume there are two indexes, one on A and one on B

SELECT * FROM T

WHERE T.A = 42 AND T.B > 50

We can use any of them to retrieve the data

• Find all tuple T.A = 42 and then check T.B > 50

• Find all the tuples T.B > 50 and then check T.A = 42

• Find all tuples T.A = 42, find all tuples T.B > 50, match the two lists of row_ids

Which one to use depends on the relative selectivities of each

(8)

The “slowest” access: full table scan

A full table scan reads all the blocks and all the tuples in each block

=> it is expensive, especially if data not in memory

But not is not the slowest, very stupid plans might be worse

SELECT * FROM T WHERE T.id =1 OR T.id=2 OR T.id=3 … SELECT * FROM T WHERE T.age <= 20 OR T.age >20

Full table scan is the upper bound in cost for retrieving data from a

base table

• Worst case scenario for query planning, if nothing else works, use a full table scan

(9)

When to use a full table scan?

When there is no other option:

• There are no indexes or indexes are not selective enough

• Predicates involving several columns of the same table (self join) SELECT * FROM T where T.A > T.B

• The amount of data retrieved is large enough that sequential access is better than many random accesses

Several ways to minimize the overhead of a full table scan:

• Shared scans = use the cursor from the scan of another query • Sample scans = do not read everything but just a sample

• Column store = scanning a compressed column using SIMD can be fast once data is in memory

(10)

Clustered indexes

A clustered index enforces that the data in the extent is organized

according to the index:

• B+ tree = data is sorted

• Hash index = tuples with same key are in the same bucket

In those cases, we might not traverse the index for each tuple:

• Find the relevant blocks

(11)

Clustered index example

SELECT * FROM T

WHERE T.A = 42 AND T.B > 50

index on A T.B > 50 YES Emit tuple NO Ignore tuple SELECT * FROM T

WHERE T.A = 42 AND T.B > 50

T.A = 42 YES Emit tuple NO Ignore tuple Clustered index on B

(12)

Lowering the costs of table scans: Zone Maps

A zone map is a combination of coarse index and statistics

• For every block in of a table

• Keep the max and min values for some or all columns • Before reading a block, check the zone map:

• If range does not match the predicate, do not read the block

It can significantly reduce I/O cost

In some cases it can replace an index

Other statistics can be kept in a zone map

(13)

Other considerations

Other factors affecting how to access a base table:

• A table scan using an index can be expensive but it will return sorted data:

• Start at the beginning of the leaves of the index and retrieve the tuples one by one (sequential access to the row_ids)

• Expensive if index not clustered but might be cheaper than sorting the data • I/O is significantly more expensive that accessing data in memory

• Random accesses are far worse for I/O than for data in memory

• Scans in memory can be reasonably fast

• Changes the decision points between random access and scans • Scans on columns not the same as scans on rows

(14)
(15)

Why sorting data?

Recall that data is not necessarily stored in a sorted manner

The query requires it

SELECT * FROM T ORDER_BY (T.age)

Some operations are easier over sorted data

SELECT DISTINCT(T.name) FROM T

sort the data by T.name and return the first for each group SELECT AVG(T.age) FROM T GROUP_BY(T.level)

sort the data by T.level and then find the average age for each group Joins, selections, intra-table predicates …

Sorting is expensive

• Requires extra space (no sorting in place for base tables) • Requires CPU (comparisons)

(16)

External sort (data does not fit in memory)

Why external sort?

• Obvious: data does not fit in main memory (data and results!!)

• Less obvious: many queries running at the same time sharing memory

Two key parameters

• N: number of pages of input • M: size of in memory buffer

Behavior of algorithm determined by many parameters: I/O, CPU, I/O

(17)

Two-phase External Sort

• N size of the input in pages, M size of the buffer in pages • Phase I: Create Runs

1. Load allocated buffer space with tuples 2. Sort tuples in buffer pool

3. Write sorted tuples (run) to disk

4. Goto Step 1 (create next run) until all tuples processed • Phase II: Merge Runs

• Use priority heap to merge tuples from runs • Special cases

• M >= N: no merge needed

(18)

For simplicity we will hide this

The size of the buffer needed:

• Minimal configuration:

• Number of blocks -1 are used to read in data blocks

• 1 block is used to write data out • Better:

• Number of blocks -1 to read data in

(19)

External Sort

97

17

3

5

27

16

2

99

13

(20)

External Sort

97

17

3

5

27

16

2

97

17

3

load

(21)

External Sort

97

17

3

5

27

16

2

99

13

3

17

97

sort

(22)

External Sort

97

17

3

5

27

16

2

3

3

17

97

17

97

write

run

(23)

External Sort

97

17

3

5

27

16

2

99

13

5

3

17

97

27

16

load

(24)

External Sort

97

17

3

5

27

16

2

5

3

17

97

5

16

27

16

27

(25)

External Sort

97

17

3

5

27

16

2

99

13

2

3

1 7

97

5

16

27

99

13

load

(26)

External Sort

97

17

3

5

27

16

2

2

3

17

97

5

16

27

2

13

99

End of Phase 1

(27)

External Sort

3

3

17

97

5

16

27

2

13

99

5

2

merge

(28)

External Sort

2

3

3

17

97

5

16

27

2

5

2

merge

(29)

External Sort

2

3

3

5

13

merge

3

17

97

5

16

27

2

13

99

(30)

External Sort

2

3

5

17

5

13

merge

3

17

97

5

16

27

2

(31)

External Sort

2

3

5

17

16

13

merge

3

17

97

5

16

27

2

13

99

(32)

External Sort

2

3

5

13

17

16

13

3

17

97

5

16

27

2

(33)

One pass vs. multi-pass sort

Previous algorithm is a one-pass algorithm (every data item is read

once and written once)

However:

• If there are many runs, I/O overhead is too high (we need to bring too many runs to memory)

(34)
(35)

Ways to speed up sorting

Prefetching and double buffering:

• Use more than one block for writing data out • Prefetch blocks as needed while processing

Take advantage of indexes

• Clustered: read the data in order as it is already sorted (no CPU cost, sequential access)

• Non-clustered: use the index to read the data in order (no CPU cost but very expensive in terms of random access), may work for small ranges

(36)

Sorting vs hashing

Sorting is relatively expensive

If not required by the query, hashing is often the better way to

answer queries of the form:

SELECT DISTINCT(T.name) FROM T

SELECT AVG(T.age) FROM T GROUP_BY(T.level)

In both cases:

• Build a hash table on the attribute of interest

(37)

Distinct with hashing

SELECT DISTINCT(T.name) FROM T

name age T Hash(name) John Mary Louis Anne Bob Tom John John Anne

Tom Tom Tom DISTINCT

(38)

Aggregation with hashing

SELECT AVG(T.age) FROM T GROUP_BY(T.level) name age T Hash(level) John, Manager, 50 Louis, Admin, 27 level Anne, Manager, 48 Tom, Manager,57 Donald, Admin, 27 Tim, Assistant, 50 Anne, CEO, 53 John, Researcher, 47 AVG(AGE)

(39)

External hashing

If the hash table does not fit in memory

• Partition the data first using a hash function • Then use the partitions to build the hash table

If the partitions do not fit in memory

• Partition the partition again until it does • Build the hash table in several runs

For sorting, we use M blocks, M-1 to read data in, 1 to write data out

For hashing, we use M buffers, 1 to read data in, M-1 to write data

(40)

External hashing

block Hash 1 In me mor y er se en t

Write to disk as buckets fill

(41)

External hashing

Blocks on disk

Hash 2 …..

(42)
(43)

Nested loop join

Actually, two nested scans

While there are tuples in R

• Get a tuple from R

• Compare with all tuples in S (scan S for matches) • Output if match

Complexity is

O(|R|*|S|)

, i.e.,

O(N

2

)

Sounds expensive but still used in practice (makes sense if, e.g., S is

(44)

Nested loop join

Outer table is the table used in the outer loop

Inner table is the table used in the inner loop

Outer table always the smaller of the two tables (in number of blocks)

• Maximizes sequential access in the inner loop

Optimization: block nested loop join

• get a block from R

• (hash and then) compare with all blocks of S

(45)

Nested loop joins, simple vs block

Each tuple in outer table brings all the blocks from the inner table

Each block in outer table brings all the blocks from the inner table

(46)

Nested loop joins on indexed table

Each tuple in outer table looks up the

Index on inner table

(47)

Nested loop joins with zone maps

Use the zone map to determine whether we need to look into a block

Outer table sorted

Min = 3 Max = 17 Min = 2 Max = 23 Min = 11 Max = 19 Min = 1 Max = 8 In range? 8

(48)

Nested loop joins on sorted input

No need to scan outer table for every tuple in inner table as order tells Outer table sorted

Scan only until matches are

(49)

Sort-merge join

sort sort T R T sorted on join attribute R sorted on join attribute

MERGE Joined table Scans the two sorted

tables but it uses the sorted order to avoid having to go back on any of the tables

(50)

Canonical Hash Join

k hash(key) 1. Build phase bucket 1 bucket n-1 bucket 0

hash table 2. Probe phase

k

R S

hash(key)

(51)
(52)
(53)

Partitioned Hash Join (Shatdal et al. 1994)

• No more cache misses during the join

k R S h1(key) h1(key) . . . 1 p 1 p k . . . . . . . . .

Idea: Partition input into disjoint chunks of cache size

① Partition ② Build ③ Probe ① Partition

h2(k)

p > #TLB-entries  TLB misses

p > #Cache-entries  Cache thrashing

(54)

Multi-Pass Radix Partitioning

Problem: Hardware limits fan-out, i.e. T = #TLB-entries (typically 64-512)

Solution: Do the partitioning in multiple passes!

1st pass h1(key) 1 T . . . 2nd Pass h2(key) 1 T . . . 2nd Pass h2(key) 1 T . . . . . . . . . ... ith pass ... partition - 1 input relation i = logT p

(55)

Thread-1 Thread-2 Thread-3 Thread-N

R el ation Local Histograms Global Histogram & Prefix Sum

P artition e d 1st Scan 2nd Scan

Each thread scatters out its tuples based on the prefix sum

Parallel Radix Join

Parallelizing the Partitioning:

Pass - 1

(56)

Parallel Radix Join

Parallelizing the Partitioning:

Pass - (2 .. i)

Thread-2 Thread-4 Thread-N

R ela tion Histogram Prefix Sum

1st Scan 2nd Scan

Each thread individually partitions sub-relations from pass-1

. . .

(57)

Summary

Database engines implement many operators with many different

variations depending on the input data

Implementations are also tailored to the particular data type being

processed (integers, strings, etc.)

Today, emphasis is on:

• Using the caches efficiently • Exploiting SIMD

References

Related documents

• Introduction: OpenFOAM and object-oriented CFD software design • Code verification: layered development and generic programming • Code validation framework: tutorials and

near-term inflows, outflows, and balances of usable resources. Such information is useful in assessing the City’s financing requirements. Specifically, unreserved

This integrated framework enables us to test new organizational performance applications against an existing understanding of organization systems and the need for alignment

The superintendent will ensure the School Health Council provides an annual report to the school board, individual schools, and the public on the implementation of and compliance

Accounting Review, African Development Review, Agricultural Economics, American Economic Review, American Journal of Agricultural Economics, Annals of Regional Science,

Scan Enter Group 9-4 Label Scan User Define Label--&gt; Scan Six Digit Labels in Table Hex (Only 3 sets of length can be defined) Scan Confirm Label in Table Hex --&gt; Scan

left outer join: Everything from right operand table, and whatever matches from the left4. If no match, then those values are set

• A right outer join between the source and target tables returns the results of an inner join and the missing rows it excluded from the target table. • A full outer join returns