Technical Deep-Dive in a Column-Oriented In-Memory Database

(1)

Technical Deep-Dive in a

Column-Oriented

In-Memory Database

Martin Faust

[email protected]

Research Group of Prof. Hasso Plattner

Hasso Plattner Institute for Software Engineering University of Potsdam

(2)

Research Group of Prof. Dr. h.c. Hasso Plattner

§  Research focuses on the technical aspects of enterprise software and

design of complex applications

§  In-Memory Data Management for Enterprise Applications

§  Human-Centered Software Design and Engineering

§  Programming Model for Enterprise

Applications

§  Maintenance and Evolution of

Service-Oriented Enterprise Software

§  RFID Technology in Enterprise Platforms

§  Teaching activities:

§  Lectures, Exercises, Seminars

§  Bachelor/Master projects

§  Master thesis

Enterprise Platform and Integration Concepts Group - Topics

(3)

The Design Thinking

Methodology

Viability Business Factors Feasibility Technical Factors Solutions Needs & Opportunities Desirability Human Factors

(4)

Demand Prediction Supply Chain Innovation Database Technology Massachusetts Institute of Technology … Software … Hardware Cloud Computing UC Berkeley Design Research In-Memory Data Management Stanford University Hasso Plattner Institute,

Enterprise Systems & Applications Chair

…

End-Users

Setup

Enables Leading Research in the Field of In-‐Memory Data Management

(5)

Learning Map of our

Online Lecture @ openHPI.de

Founda'ons for a New Enterprise Applica'on Development Era Founda'ons of Database Storage Techniques The Future of

Enterprise Compu'ng Advanced Database Storage Tech-‐ niques In-‐Memory Database Operators

(6)

Goals

Deep technical understanding of a column-‐oriented, dicAonary-‐ encoded in-‐memory database and its applicaAon in enterprise compuAng

Chapters

□  The future of enterprise compuAng

□  FoundaAons of database storage techniques

□  In-‐memory database operators

□  Advanced database storage techniques

□  ImplicaAons on ApplicaAon Development

(7)

Chapter 1:

The Future of Enterprise

Computing

(8)

New Requirements for

Enterprise Computing

□  Sensors

□  Events

□  Structured + unstructured data

□  Social networks

□  ...

(9)

Enterprise Application

Characteristics

(10)

¨  Modern enterprise resource planning (ERP) systems are challenged by mixed workloads, including OLAP-‐style queries. For example:

§  OLTP-‐style: create sales order, invoice, accounAng documents,

display customer master data or sales order

§  OLAP-‐style: dunning, available-‐to-‐promise, cross selling,

operaAonal reporAng (list open sales orders)

¨  But: Today’s data management systems are opAmized either for daily transac'onal or analy'cal workloads storing their data along rows or columns

O

n

l

ine

T

ransacAon

P

rocessing

O

_P

_rocessing

n

l

ine

A

nalyAcal

OLTP vs. OLAP

(11)

Drawbacks of the Separation

¨  OLAP systems do not have the latest data

¨  OLAP systems only have predeﬁned subset of the data

¨  Cost-‐intensive ETL processes have to sync both systems

¨  SeparaAon introduces data redundancy

¨  Diﬀerent data schemas introduce complexity for applicaAons combining sources

(12)

Enterprise Workloads are

Read Dominated

0 % 10 % 20 % 30 % 40 % 50 % 60 % 70 % 80 % 90 % 100% OLTP OLAP W orkload 0 % 10 % 20 % 30 % 40 % 50 % 60 % 70 % 80 % 90 % 100% TPC-C W orkload _Select Insert Modification Delete Write: Read: Lookup Table Scan Range Select Insert Modification Delete Write: Read:

§  Workload in enterprise applicaAons consAtutes of

§  Mainly read queries (OLTP 83%, OLAP 94%)

§  Many queries access large sets of data

(13)

Few Updates in OLTP

Pe rc en tag e of ro w s up date d

(14)

Combine OLTP and OLAP data

using

modern

hardware and database systems

to create a

single source of truth,

enable

real-‐'me analy'cs

and

simplify

applicaAons and database structures.

AddiAonally,

¨  ExtracAon, transformaAon, and loading (ETL) processes

¨  Pre-‐computed aggregates and materialized views

become (almost) obsolete

.

Vision

(15)

¨  Many columns are not used even once

¨  Many columns have a low cardinality of values

¨  NULL values/default values are dominant

¨  Sparse distribuAon facilitates high compression

Standard enterprise so`ware data is sparse and wide.

(16)

Many Columns are not

Used Even Once

55% unused columns per company in average

40% unused columns across all 12 analyzed companies

0 % 10% 20% 30% 40% 50% 60% 70% 80% 1 - 32 33 - 1023 1024 - 100000000 13 % 9 % 78 % 24 % 12 % 64 %

Number of Distinct Values

Inventory Management Financial Accounting

% of Columns

(17)

(18)

A

Changes in Hardware…

… give an opportunity to re-‐think the assump'ons of yesterday because of what is possible today.

§  Main Memory becomes cheaper and larger

■  MulA-‐Core Architecture

(128 cores per server)

■  Large main memories: 2TB /

blade

■  One blade ~$50.000 =

1 enterprise class server

■  Parallel scaling across blades

■  64-‐bit address space

■  12TB in current servers

■  Cost-‐performance raAo rapidly

declining

■  Memory hierarchies

(19)

In the Meantime

Research has come up with…

§  Column-‐oriented data organizaAon

(the column store)

§  Sequen'al scans allow best bandwidth uAlizaAon

between CPU cores and memory

§  Independence of tuples allows easy parAAoning and

therefore parallel processing

§  Lightweight Compression

§  Reducing data amount

§  Increasing processing speed through late materializaAon

§  And more, e.g., parallel scan/join/aggregaAon

… several advances in soRware for processing data

(20)

(21)

SanssouciDB: An In-Memory

Database for Enterprise Applications

Main Memory at Blade i

Log Snapshots

Passive Data (History)

Non-Volatile Memory Recovery Logging Time travel Data aging

Query Execution Metadata TA Manager

Interface Services and Session Management

Distribution Layer at Blade i

Main Store Differential Store Active Data M er ge C ol um n C ol um n C om bi ne d C ol um n C ol um n C ol um n C om bi ne d C ol um n Indexes Inverted Object Data Guide In-‐Memory Database (IMDB)

¨  Data resides permanently in

main memory

¨  Main Memory is the primary

“persistence”

¨  SAll: logging and recovery

to/from ﬂash

¨  Main memory access is

the new boTleneck

¨  Cache-‐conscious algorithms/

data structures are crucial

(locality is king)

Main Memory at Server i

Distribution Layer at Server i

(22)

Chapter 2:

Foundations of Database

Storage Techniques

(23)

Learning Map of our

Online Lecture @ openHPI.de

(24)

(25)

Memory Basics (1)

□  Memory in today’s computers has a linear address layout:

addresses start at 0x0 and go to 0xFFFFFFFFFFFFFFFF for 64bit

□  Each process has its own virtual address space

□  Virtual memory allocated by the program can distribute over

mulAple physical memory locaAons

(26)

Memory Basics (2)

□  Memory layout is only linear

□  Every higher-‐dimensional access (like two-‐dimensional

database tables) is mapped to this linear band

26

0x0 0xFFFF FFFF FFFF FFFF

FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF

(27)

Memory Hierarchy

Memory Page

Nehalem Quadcore

Core 0 Core 1 Core 2 Core 3

L3 Cache L2

L1 TLB

Main Memory Main Memory

QPI

Nehalem Quadcore

Core 0 Core 1 Core 2 Core 3

L3 Cache L2 L1 TLB QPI L1 Cacheline L2 Cacheline L3 Cacheline

(28)

Physical Data Representation

§  Row store:

§  Rows are stored consecuAvely

§  OpAmal for row-‐wise access (e.g. SELECT *)

§  Column store:

§  Columns are stored consecuAvely

§  OpAmal for akribute focused access (e.g. SUM, GROUP BY)

§  Note: concept is independent from storage type

+

28 Doc Num Doc Date Sold-To Value Status Sales Org Row 4 Row 3 Row 2 Row 1 Row-Store Column-store

(29)

Row Data Layout

□  Data is stored tuple-‐wise

□  Leverage co-‐locaAon of akributes for a single tuple

□  Low cost for tuple reconstrucAon, but higher cost for

sequenAal scan of a single akribute

C B A C B A C B A C B A C B A C B A C B A C B A Column Operation Row Operation

(30)

Columnar Data Layout

□  Data is stored akribute-‐wise

□  Leverage sequenAal scan-‐speed in main memory

for predicate evaluaAon

□  Tuple reconstrucAon is more expensive

C C C C B B B B A A A A C C C C B B B B A A A A Column Operation Row Operation

Column Column Column

(31)

Row-oriented storage

A1 B1 C1 A2 B2 C2 A3 B3 C3 A4 B4 C4

(32)

Row-oriented storage

32 A1 B1 C1 A2 B2 C2 A3 B3 C3 A4 B4 C4

(33)

Row-oriented storage

A1 B1 C1 A2 B2 C2

A3 B3 C3 A4 B4 C4

(34)

Row-oriented storage

34

A1 B1 C1 A2 B2 C2 A3 B3 C3

(35)

Row-oriented storage

(36)

Column-oriented storage

36 A1 B1 C1 A2 B2 C2 A3 B3 C3 A4 B4 C4

(37)

Column-oriented storage

B1 C1 B2 C2 B3 C3 B4 C4 A1 A2 A3 A4

(38)

Column-oriented storage

38 A1 B1 C1 A2 B2 C2 A3 B3 C3 A4 B4 C4

(39)

Column-oriented storage

(40)

(41)

Motivation

□  Main memory access is the new bokleneck

□  Idea: Trade CPU Ame to compress and decompress data

□  Compression reduces number of memory accesses

□  Leads to less cache misses due to more informaAon on a

cache line

□  OperaAon directly on compressed data possible

□  Oﬀseong with bit-‐encoded ﬁxed-‐length data types

(42)

Dictionary Encoding Example

8 billion humans □  Akributes §  ﬁrst name §  last name §  gender §  country §  city §  birthday à 200 byte

□  Each akribute is stored

dicAonary encoded 42

(43)

Sample Data

rec ID fname lname gender city country birthday … … … …

39 John Smith m Chicago USA 12.03.1964 40 Mary Brown f London UK 12.05.1964 41 Jane Doe f Palo Alto USA 23.04.1976 42 John Doe m Palo Alto USA 17.06.1952 43 Peter Schmidt m Potsdam GER 11.11.1975

(44)

Dictionary Encoding a Column

□  A column is split into a dicAonary and an akribute vector

□  DicAonary stores all disAnct values with implicit value ID

□  Akribute vector stores value IDs for all entries in the column

□  PosiAon is implicit, not stored explicitly

□  Enables oﬀseong with ﬁxed-‐length data types

44

Rec ID fname

… … 39 John 40 Mary 41 Jane 42 John 43 Peter … …

Dic'onary for “fname” Value ID Value

… … 23 John 24 Mary 25 Jane 26 Peter … …

ATribute Vector for “fname” posi'on Value ID

… … 39 23 40 24 41 25 42 23 43 26 … …

(45)

Sorted Dictionary

□  DicAonary entries are sorted either by their numeric value or

lexicographically

§  DicAonary lookup complexity: O(log(n)) instead of O(n)

□  DicAonary entries can be compressed to reduce the amount

of required storage

□  SelecAon criteria with ranges are less expensive (order-‐

(46)

Data Size Examples

46

Column Cardi-‐

nality Needed Bits Item Size Plain Size (Dic'onary + Column) Size with Dic'onary Compression Factor

First

name 5 million 23 bit 50 Byte 373 GB 238.4 MB + 21.4 GB ≈ 17

Last

name 8 million 23 bit 50 Byte 373 GB 381.5 MB + 21.4 GB ≈ 17

Gender 2 1 bit 1 Byte 7 GB 2 bit + 953.7 MB ≈ 8

City 1 million 20 bit 50 Byte 373 GB 47.7 MB + 18.6 GB ≈ 20

Country 200 8 bit 47 Byte 350 GB 9.2 KB + 7.5 GB ≈ 47

Birthday 40,000 16 bit 2 Byte 15 GB 78.1 KB + 14.9 GB ≈ 1

(47)

Chapter 3:

In-Memory

(48)

Learning Map of our

Online Lecture @ openHPI.de

(49)

Scan Performance (1)

8 billion humans ¨  Akributes §  First Name §  Last Name §  Gender §  Country §  City §  Birthday è 200 byte

¨  QuesAon: How many men/women?

(50)

Scan Performance (2)

50 Row 1 Row 2 Row 3 ... Row 8 x 10 First Name Last Name Gender Country City Birthday Table: humans 9

Row Store – Layout

¨  Table size =

8 billion tuples x 200 bytes per tuple à ~1.6 TB

¨  Scan through all rows

with 2 MB/ms/core à ~800 seconds with 1 core

(51)

Scan Performance (3)

Row 1 Row 2 Row 3 ... First Name

Last Name Country

City Birthday Table: humans Row 8 x 109 Gender

Row Store – Full Table Scan

¨  Table size =

8 billion tuples x 200 bytes per tuple à ~1.6 TB

¨  Scan through all rows

with 2 MB/ms/core à ~800 seconds with 1 core

(52)

Scan Performance (4)

52 Row 1 Row 2 Row 3 ... First Name

Last Name Country

City Birthday Table: humans Data not loaded Data loaded and used Data loaded but not used Row

8 x 109

Gender

¨  8 billion cache accesses à

64 byte

à ~512 GB

¨  Read with 2 MB/ms/core

à ~256 seconds with 1 core

(53)

Scan Performance (5)

First Name

Last Name Country

City

Birthday

Table: humans

Gender

Column Store – Layout

¨  Table size

§  Attribute vectors: ~91 GB

§  Dictionaries: ~700 MB

à Total: ~92 GB

(54)

Scan Performance (6)

54

Column Store – Full Column Scan on “Gender”

First Name

Last Name Country

City Birthday Table: humans Data not loaded Data loaded and used Data loaded but not used Gender

¨  Size of attribute vector

“gender” =

8 billion tuples x 1 bit per tuple à ~1 GB

¨  Scan through

attribute vector with 2 MB/ms/core à

(55)

Scan Performance (7)

Column Store – Full Column Scan on “Birthday”

First Name

Last Name Country

City

Birthday

Table: humans

Gender

¨  Size of attribute vector

“birthday” =

8 billion tuples x 2 Byte per tuple à ~16 GB

¨  Scan through

column with

2 MB/ms/core à

(56)

Scan Performance – Summary

56

¨  How many women, how many men?

Column

Store Row Store

Full table

scan access Stride Time in

seconds 0.5 800 256

1,600x

(57)

(58)

Tuple Reconstruction (1)

¨  All akributes are stored consecuAvely ¨  200 byte à 4 cache accesses à 64 byte à 256 byte ¨  Read with 2 MB/ms/core à ~0.128 μs with 1 core

Accessing a record in a row store

58

(59)

Tuple Reconstruction (2)

¨  All akributes are stored in separate columns

¨  Implicit record IDs are used to

reconstruct rows

Virtual record IDs

(60)

Tuple Reconstruction (3)

¨  1 cache access for each akribute ¨  6 cache accesses à 64 byte à 384 byte ¨  Read with 2 MB/ms/core à ~0.192 μs with 1 core 60

Virtual record IDs

(61)

(62)

SELECT Example

SELECT fname, lname FROM world_population

WHERE country="Italy" and gender="m"

fname lname country gender

Gianluigi Buﬀon Italy m Lena Gercke Germany f Mario Balotelli Italy m Manuel Neuer Germany m Lukas Podolski Germany m Klaas-‐Jan Huntelaar Netherlands m

(63)

Query Plan

¨  MulAple plans exist to execute query

¨  Query OpAmizer decides which is executed

¨  Based on cost model, staAsAcs and other parameters

¨  AlternaAves

¨  Scan “country” and “gender”, posiAonal AND

¨  Scan over “country” and probe into “gender”

¨  Indices might be used

¨  Decision depends on data and query parameters like e.g. selecAvity

SELECT fname, lname FROM world_population

(64)

Query Plan (i)

PosiAonal AND:

¨  Predicates are evaluated and generate posiAon lists

¨  Intermediate posiAon lists are logically combined

¨  Final posiAon list is used for materializaAon

64

country = "Italy" gender = "m"

fname, lname position list position list positional AND π

(65)

Query Execution (i)

Posi'on 0 2 Posi'on 0 2 3 4 5 Posi'on 0 2 country = 3 ("Italy") gender = 1 ("m") AND

Value ID Dic'onary for “country” 0 Algeria 1 France 2 Germany 3 Italy 4 Netherlands … Value

ID Dic'onary for “gender”

0 f 1 m

fname lname country gender

Gianluigi Buﬀon 3 1 Lena Gercke 2 0 Mario Balotelli 3 1 Manuel Neuer 2 1 Lukas Podolski 2 1 Klaas-‐Jan Huntelaar 4 1

(66)

country = "Italy" Gender fname, lname position list Probe Positional Filter gender = "m"

Query Plan (ii)

Based on posiAon list produced by ﬁrst selecAon,

gender column is probed.

66

(67)

(68)

Insert

□  Insert is the dominant modiﬁcaAon operaAon

§  Delete/Update can be modeled as Inserts as well (Insert-‐only

approach)

□  InserAng into a compressed in-‐memory persistence can be

expensive

§  UpdaAng sorted sequences (e.g. dicAonaries) is a challenge

§  InserAng into columnar storages is generally more expensive than

inserAng into row storages

(69)

Insert Example

rowID fname lname gender country city birthday 0 MarAn Albrecht m GER Berlin 08-‐05-‐1955

1 Michael Berg m GER Berlin 03-‐05-‐1970

2 Hanna Schulze f GER Hamburg 04-‐04-‐1968

3 Anton Meyer m AUT Innsbruck 10-‐20-‐1992

4 Sophie Schulze f GER Potsdam 09-‐03-‐1977

... ... ... ... ... ... ...

world_populaAon

INSERT INTO world_populaAon

(70)

INSERT (1) w/o new Dictionary entry

fname lname gender country city birthday

0 MarAn Albrecht m GER Berlin 08-‐05-‐1955 1 Michael Berg m GER Berlin 03-‐05-‐1970 2 Hanna Schulze f GER Hamburg 04-‐04-‐1968 3 Anton Meyer m AUT Innsbruck 10-‐20-‐1992 4 Sophie Schulze f GER Potsdam 09-‐03-‐1977 ... ... ... ... ... ... ...

INSERT INTO world_populaAon VALUES (Karen, Schulze, f, GER, Rostock, 11-‐15-‐2012)

0 0 1 1 2 3 3 2 4 3 0 Albrecht 1 Berg 2 Meyer 3 Schulze D AV

Akribute Vector (AV) DicAonary (D)

(71)

0 Albrecht

1 Berg

2 Meyer

3 Schulze

1.  Look-‐up on D à entry found

D

0 MarAn Albrecht m GER Berlin 08-‐05-‐1955 1 Michael Berg m GER Berlin 03-‐05-‐1970 2 Hanna Schulze f GER Hamburg 04-‐04-‐1968 3 Anton Meyer m AUT Innsbruck 10-‐20-‐1992 4 Sophie Schulze f GER Potsdam 09-‐03-‐1977 ... ... ... ... ... ... ... AV 0 0 1 1 2 3 3 2 4 3

(72)

0 0 1 1 2 3 3 2 4 3 5 3

1.  Look-‐up on D à entry found

2.  Append ValueID to AV

0 MarAn Albrecht m GER Berlin 08-‐05-‐1955 1 Michael Berg m GER Berlin 03-‐05-‐1970 2 Hanna Schulze f GER Hamburg 04-‐04-‐1968 3 Anton Meyer m AUT Innsbruck 10-‐20-‐1992 4 Sophie Schulze f GER Potsdam 09-‐03-‐1977 5 Schulze ... ... ... ... ... ... ... AV 0 Albrecht 1 Berg 2 Meyer 3 Schulze D

INSERT (1) w/o new Dictionary entry

(73)

INSERT (2) with new Dictionary Entry I/II

0 MarAn Albrecht m GER Berlin 08-‐05-‐1955 1 Michael Berg m GER Berlin 03-‐05-‐1970 2 Hanna Schulze f GER Hamburg 04-‐04-‐1968 3 Anton Meyer m AUT Innsbruck 10-‐20-‐1992 4 Sophie Schulze f GER Potsdam 09-‐03-‐1977 5 Schulze

... ... ... ... ... ... ...

0 0 1 0 2 1 3 2 4 3 D AV 0 Berlin 1 Hamburg 2 Innsbruck 3 Potsdam

(74)

1.  Look-‐up on D à no entry found

INSERT INTO world_populaAon VALUES (Karen, Schulze, f, GER, Rostock, 11-‐15-‐2012) D AV 0 0 1 0 2 1 3 2 4 3 0 Berlin 1 Hamburg 2 Innsbruck 3 Potsdam

... ... ... ... ... ... ... Akribute Vector (AV) DicAonary (D)

INSERT (2) with new Dictionary Entry I/II

(75)

2.  Append new value to D (no re-‐sorAng necessary)

0 Berlin 1 Hamburg 2 Innsbruck 3 Potsdam 4 Rostock D

... ... ... ... ... ... ... Akribute Vector (AV) DicAonary (D) AV 0 0 1 0 2 1 3 2 4 3

(76)

2.  Append new value to D (no re-‐sorAng necessary)

3.  Append ValueID to AV

0 MarAn Albrecht m GER Berlin 08-‐05-‐1955 1 Michael Berg m GER Berlin 03-‐05-‐1970 2 Hanna Schulze f GER Hamburg 04-‐04-‐1968 3 Anton Meyer m AUT Innsbruck 10-‐20-‐1992 4 Sophie Schulze f GER Potsdam 09-‐03-‐1977 5 Schulze Rostock

... ... ... ... ... ... ... Akribute Vector (AV) DicAonary (D) 0 0 1 0 2 1 3 2 4 3 5 4 0 Berlin 1 Hamburg 2 Innsbruck 3 Potsdam 4 Rostock D AV

INSERT (2) with new Dictionary Entry I/II

(77)

0 Anton

1 Hanna

2 MarAn

3 Michael

4 Sophie

... ... ... ... ... ... ...

INSERT INTO world_populaAon VALUES (Karen, Schulze, f, GER, Rostock, 11-‐15-‐2012) D AV 0 2 1 3 2 1 3 0 4 4

(78)

... ... ... ... ... ... ... Akribute Vector (AV) DicAonary (D) 0 Anton 1 Hanna 2 MarAn 3 Michael 4 Sophie D AV 0 2 1 3 2 1 3 0 4 4

INSERT (2) with new Dictionary Entry II/II

(79)

2.  Insert new value to D

... ... ... ... ... ... ... Akribute Vector (AV) DicAonary (D) 0 Anton 1 Hanna 2 Karen 3 MarAn 4 Michael 5 Sophie D AV 0 2 1 3 2 1 3 0 4 4

(80)

... ... ... ... ... ... ... Akribute Vector (AV) DicAonary (D) 0 Anton 1 Hanna 2 MarAn 3 Michael 4 Sophie D (old) AV 0 2 1 3 2 1 3 0 4 4 0 Anton 1 Hanna 2 Karen 3 MarAn 4 Michael 5 Sophie D (new)

INSERT (2) with new Dictionary Entry II/II

(81)

3.  Change ValueIDs in AV

... ... ... ... ... ... ... Akribute Vector (AV) DicAonary (D) AV (old) 0 Anton 1 Hanna 2 Karen 3 MarAn 4 Michael 5 Sophie D (new) AV (new) 0 3 1 4 2 1 3 0 4 5 0 2 1 3 2 1 3 0 4 4

(82)

3.  Change ValueIDs in AV

4.  Append new ValueID to AV

0 MarAn Albrecht m GER Berlin 08-‐05-‐1955 1 Michael Berg m GER Berlin 03-‐05-‐1970 2 Hanna Schulze f GER Hamburg 04-‐04-‐1968 3 Anton Meyer m AUT Innsbruck 10-‐20-‐1992 4 Sophie Schulze f GER Potsdam 09-‐03-‐1977 5 Karen Schulze Rostock

... ... ... ... ... ... ... Akribute Vector (AV) DicAonary (D) 0 Anton 1 Hanna 2 Karen 3 MarAn 4 Michael 5 Sophie D AV 0 3 1 4 2 1 3 0 4 5 5 2

INSERT (2) with new Dictionary Entry II/II

82

Changed Value IDs

(83)

Result

rowID fname lname gender country city birthday 0 MarAn Albrecht m GER Berlin 08-‐05-‐1955

1 Michael Berg m GER Berlin 03-‐05-‐1970

2 Hanna Schulze f GER Hamburg 04-‐04-‐1968

3 Anton Meyer m AUT Innsbruck 10-‐20-‐1992

4 Ulrike Schulze f GER Potsdam 09-‐03-‐1977

5 Karen Schulze f GER Rostock 11-‐15-‐2012

world_populaAon

INSERT INTO world_populaAon

(84)

Chapter 4:

Advanced Database

Storage Techniques

(85)

Learning Map of our

Online Lecture @ openHPI.de

(86)

(87)

Motivation

□  InserAng new tuples directly into a compressed structure can be

expensive

§  Especially when using sorted structures

§  New values can require reorganizing the dicAonary

§  Number of bits required to encode all dicAonary values can change,

(88)

Differential Buffer

¨  New values are wriken to a dedicated diﬀerenAal buﬀer (Delta)

¨  Cache SensiAve B+ Tree (CSB+) used for fast search on Delta

Table Main Store

Differential Buffer

Data Modifying Operations

Read Operations Table DicAonary Akribute Vector fname … (compr esse d) Main Store DicAonary Akribute Vector CSB+ fname …

DiﬀerenAal Buﬀer/ Delta Write Read world_population 0 0 1 1 2 1 3 3 4 2 5 1 0 Anton 1 Hanna 2 Michael 3 Sophie 0 Angela 1 Klaus 2 Andre 0 0 1 1 2 1 3 2

8 Billion entries up to 50,000 _entries

(89)

Differential Buffer

□  Inserts of new values are fast, because dicAonary and

akribute vector do not need to be resorted

□  Range selects on diﬀerenAal buﬀer are expensive

§  Unsorted dicAonary allows no direct comparison of value IDs

§  Scans with range selecAon need to lookup values in dicAonary for

comparisons

□  DiﬀerenAal Buﬀer requires more memory:

§  Akribute vector not bit-‐compressed

(90)

DiﬀerenAal Buﬀer Main Store

Tuple Lifetime

recId fname lname gender country city birthday

0 MarAn Albrecht m GER Berlin 08-‐05-‐1955 1 Michael Berg m GER Berlin 03-‐05-‐1970 2 Hanna Schulze f GER Hamburg 04-‐04-‐1968 3 Anton Meyer m AUT Innsbruck 10-‐20-‐1992 4 Ulrike Schulze f GER Potsdam 09-‐03-‐1977 5 Sophie Schulze f GER Rostock 06-‐20-‐2012 ... ... ... ... ... ... ... 8 * 109 _Zacharias _Perdopolus _m _GRE _Athen _{03-‐12-‐1979}

Main Table: world_populaAon

Michael moves from Berlin to Potsdam

UPDATE "world_populaAon" SET city="Potsdam"

WHERE fname="Michael" AND lname="Berg"

(91)

Tuple Lifetime

WHERE fname="Michael" AND lname="Berg" Main Table: world_populaAon

(92)

Tuple Lifetime

0 Michael Berg m GER Potsdam 03-‐05-‐1970

92

(93)

Tuple Lifetime

□  Tuples are now available in Main Store and DiﬀerenAal Buﬀer

□  Tuples of a table are marked by a validity vector to reduce the

required amount of reorganizaAon steps

§  AddiAonal akribute vector for validity

§  1 bit required per database tuple

□  Invalidated tuples stay in the database table, unAl the next

reorganizaAon takes place

□  Query results

§  Main and delta have to be queried

(94)

recId fname lname gender country city birthday valid

0 MarAn Albrecht m GER Berlin 08-‐05-‐1955 1 1 Michael Berg m GER Berlin 03-‐05-‐1970 0 2 Hanna Schulze f GER Hamburg 04-‐04-‐1968 1 3 Anton Meyer m AUT Innsbruck 10-‐20-‐1992 1 4 Ulrike Schulze f GER Potsdam 09-‐03-‐1977 1 5 Sophie Schulze f GER Rostock 06-‐20-‐2012 1 ... ... ... ... ... ... ...

8 * 109 _Zacharias _Perdopolus _m _GRE _Athen _{03-‐12-‐1979} ₁

Tuple Lifetime

0 Michael Berg m GER Potsdam 03-‐05-‐1970 1 DiﬀerenAal

Buﬀer Main Store

94

(95)

Tuple Lifetime

recId fname lname gender country city birthday valid

0 MarAn Albrecht m GER Berlin 08-‐05-‐1955 1 1 Michael Berg m GER Berlin 03-‐05-‐1970 0 2 Hanna Schulze f GER Hamburg 04-‐04-‐1968 1 3 Anton Meyer m AUT Innsbruck 10-‐20-‐1992 1 4 Ulrike Schulze f GER Potsdam 09-‐03-‐1977 1 5 Sophie Schulze f GER Rostock 06-‐20-‐2012 1 ... ... ... ... ... ... ...

8 * 109 _Zacharias _Perdopolus _m _GRE _Athen _{03-‐12-‐1979} ₁

0 Michael Berg m GER Potsdam 03-‐05-‐1970 1 DiﬀerenAal

Buﬀer Main Store Main Table: world_populaAon

(96)

(97)

Handling Write Operations

¨  All Write operaAons (INSERT, UPDATE) are stored within a differenAal buffer (delta) first

¨  Read-‐operaAons on diﬀerenAal buﬀer are more expensive than on main store

□  DiﬀerenAal buﬀer is merged periodically with the main store

§  To avoid performance degradaAon based on large delta

(98)

Merge Overview I/II

¨  The merge process is triggered for single tables

¨  Is triggered by:

§  Amount of tuples in buﬀer

§  Cost model to

§  Schedule

§  Take query cost into account

§  Manually

(99)

Merge Overview II/II

¨  Working on data copies allows asynchronous merge

¨  Very limited interrupAon due to short lock

¨  At least twice the memory of the table needed!

Main Store Differ-ential Buffer Before Data Modifying Operations Read Operations Main Store Differential Buffer (new)

During the Merge Process

Data Modifying Operations Read Operations Main Store (new) Differ-ential Buffer (new) After Data Modifying Operations Read Operations Main Store (new) Differ-ential Buffer Merge Operation

(100)

Attribute Merge

Step 1: Dic'onary Merge

¨  Merge main and delta dicAonary

¨  OpAonally remove unused values

¨  Merge of two sorted arrays

¨  Create mapping if valueIDs changed

Step 2: Update ATribute Vector

¨  Create new merged main parAAon

¨  Update valueIDs reﬂecAng changed dicAonary

(101)

Delta

Main Store

Example

valueID fname city

0 Albert Berlin 1 Michael London 2 Nadja recID valid 0 1 1 1 2 1

recID fname city

0 2 0

1 1 1

2 0 0

Dic'onaries ATribute Vectors Validity Vector

valueID fname city recID fname city recID valid

(102)

Delta Main Store

Example

102 recID valid 0 1 1 0 2 1

0 Michael Berlin

recID valid

0 1

0 0 0

Michael moves from London to Berlin

0 Albert Berlin

1 Michael London

2 Nadja

0 2 0

1 1 1

(103)

Main Store Delta

Example

recID valid 0 0 1 0 2 1

0 Michael Berlin

1 Nadja Potsdam

recID valid

0 1

1 1

0 0 0

1 1 1

Nadja moves from Berlin to Potsdam

0 Albert Berlin

1 Michael London

2 Nadja

0 2 0

1 1 1

(104)

Example

104 recID valid 0 0 1 0 2 1

0 Michael Berlin 1 Nadja Potsdam recID valid 0 0 1 1 2 1

0 0 0

1 1 1

2 0 1

Michael moves from Berlin to Potsdam

0 Albert Berlin

1 Michael London

2 Nadja

0 2 0

1 1 1

(105)

First Step:

Validity Detection

recID valid 0 0 1 0 2 1

0 Michael Berlin 1 Nadja Potsdam recID valid 0 0 1 1 2 1

0 0 0

1 1 1

2 0 1

Will be deleted / moved into history parAAon

0 Albert Berlin

1 Michael London

2 Nadja

0 2 0

1 1 1

(106)

Main Store

Delta

First Step:

Dictionary Merge

0 Albert Berlin

1 Michael London

2 Nadja

Dic'onaries

0 Michael Berlin

1 Nadja Potsdam

Dic'onaries

city: old to new Main

valueID (old) 0 1 valueID (new) 0 NA

fname: old to new Main

valueID (old) 0 1 2 valueID (new) 0 1 2

city: Delta to Main

valueID (Delta) 0 1 valueID (Main) 0 1

fname: Delta to Main

valueID (Delta) 0 1 valueID (Main) 1 2

Mappings

New Combined Main Store

0 Albert Berlin

1 Michael Potsdam

2 Nadja

(107)

Main Store

Delta

Second Step:

New Main Column

0 Michael Berlin

1 Nadja Potsdam

recID valid

0 1

1 1

0 1 1

1 0 1

recID valid

0 1

0 0 0

fname: old to new Main

valueID (old) 0 1 2

fname: Delta to Main

valueID (Delta) 0 1

Mappings

city: old to new Main

valueID (old) 0 1

city: Delta to Main

valueID (Delta) 0 1 valueID fname city

0 Albert Berlin

1 Michael London

(108)

Main Store

Delta

Second Step:

New Main Column

0 Albert Berlin 1 Michael Potsdam 2 Nadja recID valid 0 1 1 1 2 1

0 0 0

1 2 1

2 1 1

valueID fname city recID fname city recID valid

(109)

Chapter 5:

Implications on

(110)

Learning Map of our

Online Lecture @ openHPI.de

(111)

How does it all come together?

1. Mixed Workload combining OLTP and analytic-style queries

■  Column-Stores are best suited for analytic-style queries

■  In-memory databases enable fast tuple re-construction

■  In-memory column store allows aggregation on-the-fly

2. Sparse enterprise data

■  Lightweight compression schemes are optimal

■  Increases query execution

■  Improves feasibility of in-memory databases

3. Mostly read workload

■  Read-optimized stores provide best throughput

■  i.e. compressed in-memory column-store

■  Write-optimized store as delta partition

to handle data changes is sufficient

Changed Hardware Advances in data processing (software) Complex Enterprise Applications Our focus 111

Technical Deep-Dive in a Column-Oriented In-Memory Database