Technical Deep-Dive in a
Column-Oriented
In-Memory Database
Martin Faust
Research Group of Prof. Hasso Plattner
Hasso Plattner Institute for Software Engineering University of Potsdam
Research Group of Prof. Dr. h.c. Hasso Plattner
§ Research focuses on the technical aspects of enterprise software and
design of complex applications
§ In-Memory Data Management for Enterprise Applications
§ Human-Centered Software Design and Engineering
§ Programming Model for Enterprise
Applications
§ Maintenance and Evolution of
Service-Oriented Enterprise Software
§ RFID Technology in Enterprise Platforms
§ Teaching activities:
§ Lectures, Exercises, Seminars
§ Bachelor/Master projects
§ Master thesis
Enterprise Platform and Integration Concepts Group - Topics
The Design Thinking
Methodology
Viability Business Factors Feasibility Technical Factors Solutions Needs & Opportunities Desirability Human FactorsDemand Prediction Supply Chain Innovation Database Technology Massachusetts Institute of Technology … Software … Hardware Cloud Computing UC Berkeley Design Research In-Memory Data Management Stanford University Hasso Plattner Institute,
Enterprise Systems & Applications Chair
…
End-Users
Setup
Enables Leading Research in the Field of In-‐Memory Data Management
Learning Map of our
Online Lecture @ openHPI.de
Founda'ons for a New Enterprise Applica'on Development Era Founda'ons of Database Storage Techniques The Future of
Enterprise Compu'ng Advanced Database Storage Tech-‐ niques In-‐Memory Database Operators
Goals
Deep technical understanding of a column-‐oriented, dicAonary-‐ encoded in-‐memory database and its applicaAon in enterprise compuAng
Chapters
□ The future of enterprise compuAng
□ FoundaAons of database storage techniques
□ In-‐memory database operators
□ Advanced database storage techniques
□ ImplicaAons on ApplicaAon Development
Chapter 1:
The Future of Enterprise
Computing
New Requirements for
Enterprise Computing
□ Sensors
□ Events
□ Structured + unstructured data
□ Social networks
□ ...
Enterprise Application
Characteristics
¨ Modern enterprise resource planning (ERP) systems are challenged by mixed workloads, including OLAP-‐style queries. For example:
§ OLTP-‐style: create sales order, invoice, accounAng documents,
display customer master data or sales order
§ OLAP-‐style: dunning, available-‐to-‐promise, cross selling,
operaAonal reporAng (list open sales orders)
¨ But: Today’s data management systems are opAmized either for daily transac'onal or analy'cal workloads storing their data along rows or columns
O
n
l
ine
T
ransacAon
P
rocessing
O
P
rocessing
n
l
ine
A
nalyAcal
OLTP vs. OLAP
Drawbacks of the Separation
¨ OLAP systems do not have the latest data
¨ OLAP systems only have predefined subset of the data
¨ Cost-‐intensive ETL processes have to sync both systems
¨ SeparaAon introduces data redundancy
¨ Different data schemas introduce complexity for applicaAons combining sources
Enterprise Workloads are
Read Dominated
0 % 10 % 20 % 30 % 40 % 50 % 60 % 70 % 80 % 90 % 100% OLTP OLAP W orkload 0 % 10 % 20 % 30 % 40 % 50 % 60 % 70 % 80 % 90 % 100% TPC-C W orkload Select Insert Modification Delete Write: Read: Lookup Table Scan Range Select Insert Modification Delete Write: Read:§ Workload in enterprise applicaAons consAtutes of
§ Mainly read queries (OLTP 83%, OLAP 94%)
§ Many queries access large sets of data
Few Updates in OLTP
Pe rc en tag e of ro w s up date dCombine OLTP and OLAP data
using
modern
hardware and database systems
to create a
single source of truth,
enable
real-‐'me analy'cs
and
simplify
applicaAons and database structures.
AddiAonally,
¨ ExtracAon, transformaAon, and loading (ETL) processes
¨ Pre-‐computed aggregates and materialized views
become (almost) obsolete
.
Vision
¨ Many columns are not used even once
¨ Many columns have a low cardinality of values
¨ NULL values/default values are dominant
¨ Sparse distribuAon facilitates high compression
Standard enterprise so`ware data is sparse and wide.
Many Columns are not
Used Even Once
55% unused columns per company in average
40% unused columns across all 12 analyzed companies
0 % 10% 20% 30% 40% 50% 60% 70% 80% 1 - 32 33 - 1023 1024 - 100000000 13 % 9 % 78 % 24 % 12 % 64 %
Number of Distinct Values
Inventory Management Financial Accounting
% of Columns
A
Changes in Hardware…
… give an opportunity to re-‐think the assump'ons of yesterday because of what is possible today.
§ Main Memory becomes cheaper and larger
■ MulA-‐Core Architecture
(128 cores per server)
■ Large main memories: 2TB /
blade
■ One blade ~$50.000 =
1 enterprise class server
■ Parallel scaling across blades
■ 64-‐bit address space
■ 12TB in current servers
■ Cost-‐performance raAo rapidly
declining
■ Memory hierarchies
In the Meantime
Research has come up with…
§ Column-‐oriented data organizaAon
(the column store)
§ Sequen'al scans allow best bandwidth uAlizaAon
between CPU cores and memory
§ Independence of tuples allows easy parAAoning and
therefore parallel processing
§ Lightweight Compression
§ Reducing data amount
§ Increasing processing speed through late materializaAon
§ And more, e.g., parallel scan/join/aggregaAon
… several advances in soRware for processing data
SanssouciDB: An In-Memory
Database for Enterprise Applications
Main Memory at Blade i
Log Snapshots
Passive Data (History)
Non-Volatile Memory Recovery Logging Time travel Data aging
Query Execution Metadata TA Manager
Interface Services and Session Management
Distribution Layer at Blade i
Main Store Differential Store Active Data M er ge C ol um n C ol um n C om bi ne d C ol um n C ol um n C ol um n C om bi ne d C ol um n Indexes Inverted Object Data Guide In-‐Memory Database (IMDB)
¨ Data resides permanently in
main memory
¨ Main Memory is the primary
“persistence”
¨ SAll: logging and recovery
to/from flash
¨ Main memory access is
the new boTleneck
¨ Cache-‐conscious algorithms/
data structures are crucial
(locality is king)
Main Memory at Server i
Distribution Layer at Server i
Chapter 2:
Foundations of Database
Storage Techniques
Learning Map of our
Online Lecture @ openHPI.de
Founda'ons for a New Enterprise Applica'on Development Era Founda'ons of Database Storage Techniques The Future of
Enterprise Compu'ng Advanced Database Storage Tech-‐ niques In-‐Memory Database Operators
Memory Basics (1)
□ Memory in today’s computers has a linear address layout:
addresses start at 0x0 and go to 0xFFFFFFFFFFFFFFFF for 64bit
□ Each process has its own virtual address space
□ Virtual memory allocated by the program can distribute over
mulAple physical memory locaAons
Memory Basics (2)
□ Memory layout is only linear
□ Every higher-‐dimensional access (like two-‐dimensional
database tables) is mapped to this linear band
26
0x0 0xFFFF FFFF FFFF FFFF
FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF
Memory Hierarchy
Memory Page
Nehalem Quadcore
Core 0 Core 1 Core 2 Core 3
L3 Cache L2
L1 TLB
Main Memory Main Memory
QPI
Nehalem Quadcore
Core 0 Core 1 Core 2 Core 3
L3 Cache L2 L1 TLB QPI L1 Cacheline L2 Cacheline L3 Cacheline
Physical Data Representation
§ Row store:
§ Rows are stored consecuAvely
§ OpAmal for row-‐wise access (e.g. SELECT *)
§ Column store:
§ Columns are stored consecuAvely
§ OpAmal for akribute focused access (e.g. SUM, GROUP BY)
§ Note: concept is independent from storage type
+
28 Doc Num Doc Date Sold-To Value Status Sales Org Row 4 Row 3 Row 2 Row 1 Row-Store Column-storeRow Data Layout
□ Data is stored tuple-‐wise
□ Leverage co-‐locaAon of akributes for a single tuple
□ Low cost for tuple reconstrucAon, but higher cost for
sequenAal scan of a single akribute
C B A C B A C B A C B A C B A C B A C B A C B A Column Operation Row Operation
Columnar Data Layout
□ Data is stored akribute-‐wise
□ Leverage sequenAal scan-‐speed in main memory
for predicate evaluaAon
□ Tuple reconstrucAon is more expensive
C C C C B B B B A A A A C C C C B B B B A A A A Column Operation Row Operation
Column Column Column
Column Column Column
Row-oriented storage
A1 B1 C1 A2 B2 C2 A3 B3 C3 A4 B4 C4
Row-oriented storage
32 A1 B1 C1 A2 B2 C2 A3 B3 C3 A4 B4 C4Row-oriented storage
A1 B1 C1 A2 B2 C2
A3 B3 C3 A4 B4 C4
Row-oriented storage
34
A1 B1 C1 A2 B2 C2 A3 B3 C3
Row-oriented storage
Column-oriented storage
36 A1 B1 C1 A2 B2 C2 A3 B3 C3 A4 B4 C4Column-oriented storage
B1 C1 B2 C2 B3 C3 B4 C4 A1 A2 A3 A4Column-oriented storage
38 A1 B1 C1 A2 B2 C2 A3 B3 C3 A4 B4 C4Column-oriented storage
Motivation
□ Main memory access is the new bokleneck
□ Idea: Trade CPU Ame to compress and decompress data
□ Compression reduces number of memory accesses
□ Leads to less cache misses due to more informaAon on a
cache line
□ OperaAon directly on compressed data possible
□ Offseong with bit-‐encoded fixed-‐length data types
Dictionary Encoding Example
8 billion humans □ Akributes § first name § last name § gender § country § city § birthday à 200 byte□ Each akribute is stored
dicAonary encoded 42
Sample Data
rec ID fname lname gender city country birthday … … … …
39 John Smith m Chicago USA 12.03.1964 40 Mary Brown f London UK 12.05.1964 41 Jane Doe f Palo Alto USA 23.04.1976 42 John Doe m Palo Alto USA 17.06.1952 43 Peter Schmidt m Potsdam GER 11.11.1975
Dictionary Encoding a Column
□ A column is split into a dicAonary and an akribute vector
□ DicAonary stores all disAnct values with implicit value ID
□ Akribute vector stores value IDs for all entries in the column
□ PosiAon is implicit, not stored explicitly
□ Enables offseong with fixed-‐length data types
44
Rec ID fname
… … 39 John 40 Mary 41 Jane 42 John 43 Peter … …
Dic'onary for “fname” Value ID Value
… … 23 John 24 Mary 25 Jane 26 Peter … …
ATribute Vector for “fname” posi'on Value ID
… … 39 23 40 24 41 25 42 23 43 26 … …
Sorted Dictionary
□ DicAonary entries are sorted either by their numeric value or
lexicographically
§ DicAonary lookup complexity: O(log(n)) instead of O(n)
□ DicAonary entries can be compressed to reduce the amount
of required storage
□ SelecAon criteria with ranges are less expensive (order-‐
Data Size Examples
46
Column Cardi-‐
nality Needed Bits Item Size Plain Size (Dic'onary + Column) Size with Dic'onary Compression Factor
First
name 5 million 23 bit 50 Byte 373 GB 238.4 MB + 21.4 GB ≈ 17
Last
name 8 million 23 bit 50 Byte 373 GB 381.5 MB + 21.4 GB ≈ 17
Gender 2 1 bit 1 Byte 7 GB 2 bit + 953.7 MB ≈ 8
City 1 million 20 bit 50 Byte 373 GB 47.7 MB + 18.6 GB ≈ 20
Country 200 8 bit 47 Byte 350 GB 9.2 KB + 7.5 GB ≈ 47
Birthday 40,000 16 bit 2 Byte 15 GB 78.1 KB + 14.9 GB ≈ 1
Chapter 3:
In-Memory
Learning Map of our
Online Lecture @ openHPI.de
Founda'ons for a New Enterprise Applica'on Development Era Founda'ons of Database Storage Techniques The Future of
Enterprise Compu'ng Advanced Database Storage Tech-‐ niques In-‐Memory Database Operators
Scan Performance (1)
8 billion humans ¨ Akributes § First Name § Last Name § Gender § Country § City § Birthday è 200 byte¨ QuesAon: How many men/women?
Scan Performance (2)
50 Row 1 Row 2 Row 3 ... Row 8 x 10 First Name Last Name Gender Country City Birthday Table: humans 9Row Store – Layout
¨ Table size =
8 billion tuples x 200 bytes per tuple à ~1.6 TB
¨ Scan through all rows
with 2 MB/ms/core à ~800 seconds with 1 core
Scan Performance (3)
Row 1 Row 2 Row 3 ... First NameLast Name Country
City Birthday Table: humans Row 8 x 109 Gender
Row Store – Full Table Scan
¨ Table size =
8 billion tuples x 200 bytes per tuple à ~1.6 TB
¨ Scan through all rows
with 2 MB/ms/core à ~800 seconds with 1 core
Scan Performance (4)
52 Row 1 Row 2 Row 3 ... First NameLast Name Country
City Birthday Table: humans Data not loaded Data loaded and used Data loaded but not used Row
8 x 109
Gender
¨ 8 billion cache accesses à
64 byte
à ~512 GB
¨ Read with 2 MB/ms/core
à ~256 seconds with 1 core
Scan Performance (5)
First Name
Last Name Country
City
Birthday
Table: humans
Gender
Column Store – Layout
¨ Table size
§ Attribute vectors: ~91 GB
§ Dictionaries: ~700 MB
à Total: ~92 GB
Scan Performance (6)
54
Column Store – Full Column Scan on “Gender”
First Name
Last Name Country
City Birthday Table: humans Data not loaded Data loaded and used Data loaded but not used Gender
¨ Size of attribute vector
“gender” =
8 billion tuples x 1 bit per tuple à ~1 GB
¨ Scan through
attribute vector with 2 MB/ms/core à
Scan Performance (7)
Column Store – Full Column Scan on “Birthday”
First Name
Last Name Country
City
Birthday
Table: humans
Gender
¨ Size of attribute vector
“birthday” =
8 billion tuples x 2 Byte per tuple à ~16 GB
¨ Scan through
column with
2 MB/ms/core à
Scan Performance – Summary
56
¨ How many women, how many men?
Column
Store Row Store
Full table
scan access Stride Time in
seconds 0.5 800 256
1,600x
Tuple Reconstruction (1)
¨ All akributes are stored consecuAvely ¨ 200 byte à 4 cache accesses à 64 byte à 256 byte ¨ Read with 2 MB/ms/core à ~0.128 μs with 1 coreAccessing a record in a row store
58
Tuple Reconstruction (2)
¨ All akributes are stored in separate columns
¨ Implicit record IDs are used to
reconstruct rows
Virtual record IDs
Tuple Reconstruction (3)
¨ 1 cache access for each akribute ¨ 6 cache accesses à 64 byte à 384 byte ¨ Read with 2 MB/ms/core à ~0.192 μs with 1 core 60
Virtual record IDs
SELECT Example
SELECT fname, lname FROM world_population
WHERE country="Italy" and gender="m"
fname lname country gender
Gianluigi Buffon Italy m Lena Gercke Germany f Mario Balotelli Italy m Manuel Neuer Germany m Lukas Podolski Germany m Klaas-‐Jan Huntelaar Netherlands m
Query Plan
¨ MulAple plans exist to execute query
¨ Query OpAmizer decides which is executed
¨ Based on cost model, staAsAcs and other parameters
¨ AlternaAves
¨ Scan “country” and “gender”, posiAonal AND
¨ Scan over “country” and probe into “gender”
¨ Indices might be used
¨ Decision depends on data and query parameters like e.g. selecAvity
SELECT fname, lname FROM world_population
Query Plan (i)
PosiAonal AND:
¨ Predicates are evaluated and generate posiAon lists
¨ Intermediate posiAon lists are logically combined
¨ Final posiAon list is used for materializaAon
64
country = "Italy" gender = "m"
fname, lname position list position list positional AND π
Query Execution (i)
Posi'on 0 2 Posi'on 0 2 3 4 5 Posi'on 0 2 country = 3 ("Italy") gender = 1 ("m") ANDValue ID Dic'onary for “country” 0 Algeria 1 France 2 Germany 3 Italy 4 Netherlands … Value
ID Dic'onary for “gender”
0 f 1 m
fname lname country gender
Gianluigi Buffon 3 1 Lena Gercke 2 0 Mario Balotelli 3 1 Manuel Neuer 2 1 Lukas Podolski 2 1 Klaas-‐Jan Huntelaar 4 1
country = "Italy" Gender fname, lname position list Probe Positional Filter gender = "m"
Query Plan (ii)
Based on posiAon list produced by first selecAon,
gender column is probed.
66
Insert
□ Insert is the dominant modificaAon operaAon
§ Delete/Update can be modeled as Inserts as well (Insert-‐only
approach)
□ InserAng into a compressed in-‐memory persistence can be
expensive
§ UpdaAng sorted sequences (e.g. dicAonaries) is a challenge
§ InserAng into columnar storages is generally more expensive than
inserAng into row storages
Insert Example
rowID fname lname gender country city birthday 0 MarAn Albrecht m GER Berlin 08-‐05-‐1955
1 Michael Berg m GER Berlin 03-‐05-‐1970
2 Hanna Schulze f GER Hamburg 04-‐04-‐1968
3 Anton Meyer m AUT Innsbruck 10-‐20-‐1992
4 Sophie Schulze f GER Potsdam 09-‐03-‐1977
... ... ... ... ... ... ...
world_populaAon
INSERT INTO world_populaAon
INSERT (1) w/o new Dictionary entry
fname lname gender country city birthday
0 MarAn Albrecht m GER Berlin 08-‐05-‐1955 1 Michael Berg m GER Berlin 03-‐05-‐1970 2 Hanna Schulze f GER Hamburg 04-‐04-‐1968 3 Anton Meyer m AUT Innsbruck 10-‐20-‐1992 4 Sophie Schulze f GER Potsdam 09-‐03-‐1977 ... ... ... ... ... ... ...
INSERT INTO world_populaAon VALUES (Karen, Schulze, f, GER, Rostock, 11-‐15-‐2012)
0 0 1 1 2 3 3 2 4 3 0 Albrecht 1 Berg 2 Meyer 3 Schulze D AV
Akribute Vector (AV) DicAonary (D)
0 Albrecht
1 Berg
2 Meyer
3 Schulze
1. Look-‐up on D à entry found
INSERT INTO world_populaAon VALUES (Karen, Schulze, f, GER, Rostock, 11-‐15-‐2012)
Akribute Vector (AV) DicAonary (D)
D
fname lname gender country city birthday
0 MarAn Albrecht m GER Berlin 08-‐05-‐1955 1 Michael Berg m GER Berlin 03-‐05-‐1970 2 Hanna Schulze f GER Hamburg 04-‐04-‐1968 3 Anton Meyer m AUT Innsbruck 10-‐20-‐1992 4 Sophie Schulze f GER Potsdam 09-‐03-‐1977 ... ... ... ... ... ... ... AV 0 0 1 1 2 3 3 2 4 3
0 0 1 1 2 3 3 2 4 3 5 3
1. Look-‐up on D à entry found
2. Append ValueID to AV
INSERT INTO world_populaAon VALUES (Karen, Schulze, f, GER, Rostock, 11-‐15-‐2012)
Akribute Vector (AV) DicAonary (D)
fname lname gender country city birthday
0 MarAn Albrecht m GER Berlin 08-‐05-‐1955 1 Michael Berg m GER Berlin 03-‐05-‐1970 2 Hanna Schulze f GER Hamburg 04-‐04-‐1968 3 Anton Meyer m AUT Innsbruck 10-‐20-‐1992 4 Sophie Schulze f GER Potsdam 09-‐03-‐1977 5 Schulze ... ... ... ... ... ... ... AV 0 Albrecht 1 Berg 2 Meyer 3 Schulze D
INSERT (1) w/o new Dictionary entry
INSERT (2) with new Dictionary Entry I/II
fname lname gender country city birthday
0 MarAn Albrecht m GER Berlin 08-‐05-‐1955 1 Michael Berg m GER Berlin 03-‐05-‐1970 2 Hanna Schulze f GER Hamburg 04-‐04-‐1968 3 Anton Meyer m AUT Innsbruck 10-‐20-‐1992 4 Sophie Schulze f GER Potsdam 09-‐03-‐1977 5 Schulze
... ... ... ... ... ... ...
INSERT INTO world_populaAon VALUES (Karen, Schulze, f, GER, Rostock, 11-‐15-‐2012)
0 0 1 0 2 1 3 2 4 3 D AV 0 Berlin 1 Hamburg 2 Innsbruck 3 Potsdam
Akribute Vector (AV) DicAonary (D)
1. Look-‐up on D à no entry found
INSERT INTO world_populaAon VALUES (Karen, Schulze, f, GER, Rostock, 11-‐15-‐2012) D AV 0 0 1 0 2 1 3 2 4 3 0 Berlin 1 Hamburg 2 Innsbruck 3 Potsdam
fname lname gender country city birthday
0 MarAn Albrecht m GER Berlin 08-‐05-‐1955 1 Michael Berg m GER Berlin 03-‐05-‐1970 2 Hanna Schulze f GER Hamburg 04-‐04-‐1968 3 Anton Meyer m AUT Innsbruck 10-‐20-‐1992 4 Sophie Schulze f GER Potsdam 09-‐03-‐1977 5 Schulze
... ... ... ... ... ... ... Akribute Vector (AV) DicAonary (D)
INSERT (2) with new Dictionary Entry I/II
1. Look-‐up on D à no entry found
2. Append new value to D (no re-‐sorAng necessary)
INSERT INTO world_populaAon VALUES (Karen, Schulze, f, GER, Rostock, 11-‐15-‐2012)
0 Berlin 1 Hamburg 2 Innsbruck 3 Potsdam 4 Rostock D
fname lname gender country city birthday
0 MarAn Albrecht m GER Berlin 08-‐05-‐1955 1 Michael Berg m GER Berlin 03-‐05-‐1970 2 Hanna Schulze f GER Hamburg 04-‐04-‐1968 3 Anton Meyer m AUT Innsbruck 10-‐20-‐1992 4 Sophie Schulze f GER Potsdam 09-‐03-‐1977 5 Schulze
... ... ... ... ... ... ... Akribute Vector (AV) DicAonary (D) AV 0 0 1 0 2 1 3 2 4 3
1. Look-‐up on D à no entry found
2. Append new value to D (no re-‐sorAng necessary)
3. Append ValueID to AV
INSERT INTO world_populaAon VALUES (Karen, Schulze, f, GER, Rostock, 11-‐15-‐2012)
fname lname gender country city birthday
0 MarAn Albrecht m GER Berlin 08-‐05-‐1955 1 Michael Berg m GER Berlin 03-‐05-‐1970 2 Hanna Schulze f GER Hamburg 04-‐04-‐1968 3 Anton Meyer m AUT Innsbruck 10-‐20-‐1992 4 Sophie Schulze f GER Potsdam 09-‐03-‐1977 5 Schulze Rostock
... ... ... ... ... ... ... Akribute Vector (AV) DicAonary (D) 0 0 1 0 2 1 3 2 4 3 5 4 0 Berlin 1 Hamburg 2 Innsbruck 3 Potsdam 4 Rostock D AV
INSERT (2) with new Dictionary Entry I/II
0 Anton
1 Hanna
2 MarAn
3 Michael
4 Sophie
fname lname gender country city birthday
0 MarAn Albrecht m GER Berlin 08-‐05-‐1955 1 Michael Berg m GER Berlin 03-‐05-‐1970 2 Hanna Schulze f GER Hamburg 04-‐04-‐1968 3 Anton Meyer m AUT Innsbruck 10-‐20-‐1992 4 Sophie Schulze f GER Potsdam 09-‐03-‐1977 5 Schulze Rostock
... ... ... ... ... ... ...
INSERT INTO world_populaAon VALUES (Karen, Schulze, f, GER, Rostock, 11-‐15-‐2012) D AV 0 2 1 3 2 1 3 0 4 4
Akribute Vector (AV) DicAonary (D)
1. Look-‐up on D à no entry found
INSERT INTO world_populaAon VALUES (Karen, Schulze, f, GER, Rostock, 11-‐15-‐2012)
fname lname gender country city birthday
0 MarAn Albrecht m GER Berlin 08-‐05-‐1955 1 Michael Berg m GER Berlin 03-‐05-‐1970 2 Hanna Schulze f GER Hamburg 04-‐04-‐1968 3 Anton Meyer m AUT Innsbruck 10-‐20-‐1992 4 Sophie Schulze f GER Potsdam 09-‐03-‐1977 5 Schulze Rostock
... ... ... ... ... ... ... Akribute Vector (AV) DicAonary (D) 0 Anton 1 Hanna 2 MarAn 3 Michael 4 Sophie D AV 0 2 1 3 2 1 3 0 4 4
INSERT (2) with new Dictionary Entry II/II
1. Look-‐up on D à no entry found
2. Insert new value to D
INSERT INTO world_populaAon VALUES (Karen, Schulze, f, GER, Rostock, 11-‐15-‐2012)
fname lname gender country city birthday
0 MarAn Albrecht m GER Berlin 08-‐05-‐1955 1 Michael Berg m GER Berlin 03-‐05-‐1970 2 Hanna Schulze f GER Hamburg 04-‐04-‐1968 3 Anton Meyer m AUT Innsbruck 10-‐20-‐1992 4 Sophie Schulze f GER Potsdam 09-‐03-‐1977 5 Schulze Rostock
... ... ... ... ... ... ... Akribute Vector (AV) DicAonary (D) 0 Anton 1 Hanna 2 Karen 3 MarAn 4 Michael 5 Sophie D AV 0 2 1 3 2 1 3 0 4 4
1. Look-‐up on D à no entry found
2. Insert new value to D
INSERT INTO world_populaAon VALUES (Karen, Schulze, f, GER, Rostock, 11-‐15-‐2012)
fname lname gender country city birthday
0 MarAn Albrecht m GER Berlin 08-‐05-‐1955 1 Michael Berg m GER Berlin 03-‐05-‐1970 2 Hanna Schulze f GER Hamburg 04-‐04-‐1968 3 Anton Meyer m AUT Innsbruck 10-‐20-‐1992 4 Sophie Schulze f GER Potsdam 09-‐03-‐1977 5 Schulze Rostock
... ... ... ... ... ... ... Akribute Vector (AV) DicAonary (D) 0 Anton 1 Hanna 2 MarAn 3 Michael 4 Sophie D (old) AV 0 2 1 3 2 1 3 0 4 4 0 Anton 1 Hanna 2 Karen 3 MarAn 4 Michael 5 Sophie D (new)
INSERT (2) with new Dictionary Entry II/II
1. Look-‐up on D à no entry found
2. Insert new value to D
3. Change ValueIDs in AV
INSERT INTO world_populaAon VALUES (Karen, Schulze, f, GER, Rostock, 11-‐15-‐2012)
fname lname gender country city birthday
0 MarAn Albrecht m GER Berlin 08-‐05-‐1955 1 Michael Berg m GER Berlin 03-‐05-‐1970 2 Hanna Schulze f GER Hamburg 04-‐04-‐1968 3 Anton Meyer m AUT Innsbruck 10-‐20-‐1992 4 Sophie Schulze f GER Potsdam 09-‐03-‐1977 5 Schulze Rostock
... ... ... ... ... ... ... Akribute Vector (AV) DicAonary (D) AV (old) 0 Anton 1 Hanna 2 Karen 3 MarAn 4 Michael 5 Sophie D (new) AV (new) 0 3 1 4 2 1 3 0 4 5 0 2 1 3 2 1 3 0 4 4
1. Look-‐up on D à no entry found
2. Insert new value to D
3. Change ValueIDs in AV
4. Append new ValueID to AV
INSERT INTO world_populaAon VALUES (Karen, Schulze, f, GER, Rostock, 11-‐15-‐2012)
fname lname gender country city birthday
0 MarAn Albrecht m GER Berlin 08-‐05-‐1955 1 Michael Berg m GER Berlin 03-‐05-‐1970 2 Hanna Schulze f GER Hamburg 04-‐04-‐1968 3 Anton Meyer m AUT Innsbruck 10-‐20-‐1992 4 Sophie Schulze f GER Potsdam 09-‐03-‐1977 5 Karen Schulze Rostock
... ... ... ... ... ... ... Akribute Vector (AV) DicAonary (D) 0 Anton 1 Hanna 2 Karen 3 MarAn 4 Michael 5 Sophie D AV 0 3 1 4 2 1 3 0 4 5 5 2
INSERT (2) with new Dictionary Entry II/II
82
Changed Value IDs
Result
rowID fname lname gender country city birthday 0 MarAn Albrecht m GER Berlin 08-‐05-‐1955
1 Michael Berg m GER Berlin 03-‐05-‐1970
2 Hanna Schulze f GER Hamburg 04-‐04-‐1968
3 Anton Meyer m AUT Innsbruck 10-‐20-‐1992
4 Ulrike Schulze f GER Potsdam 09-‐03-‐1977
5 Karen Schulze f GER Rostock 11-‐15-‐2012
world_populaAon
INSERT INTO world_populaAon
Chapter 4:
Advanced Database
Storage Techniques
Learning Map of our
Online Lecture @ openHPI.de
Founda'ons for a New Enterprise Applica'on Development Era Founda'ons of Database Storage Techniques The Future of
Enterprise Compu'ng Advanced Database Storage Tech-‐ niques In-‐Memory Database Operators
Motivation
□ InserAng new tuples directly into a compressed structure can be
expensive
§ Especially when using sorted structures
§ New values can require reorganizing the dicAonary
§ Number of bits required to encode all dicAonary values can change,
Differential Buffer
¨ New values are wriken to a dedicated differenAal buffer (Delta)
¨ Cache SensiAve B+ Tree (CSB+) used for fast search on Delta
Table Main Store
Differential Buffer
Data Modifying Operations
Read Operations Table DicAonary Akribute Vector fname … (compr esse d) Main Store DicAonary Akribute Vector CSB+ fname …
DifferenAal Buffer/ Delta Write Read world_population 0 0 1 1 2 1 3 3 4 2 5 1 0 Anton 1 Hanna 2 Michael 3 Sophie 0 Angela 1 Klaus 2 Andre 0 0 1 1 2 1 3 2
8 Billion entries up to 50,000 entries
Differential Buffer
□ Inserts of new values are fast, because dicAonary and
akribute vector do not need to be resorted
□ Range selects on differenAal buffer are expensive
§ Unsorted dicAonary allows no direct comparison of value IDs
§ Scans with range selecAon need to lookup values in dicAonary for
comparisons
□ DifferenAal Buffer requires more memory:
§ Akribute vector not bit-‐compressed
DifferenAal Buffer Main Store
Tuple Lifetime
recId fname lname gender country city birthday
0 MarAn Albrecht m GER Berlin 08-‐05-‐1955 1 Michael Berg m GER Berlin 03-‐05-‐1970 2 Hanna Schulze f GER Hamburg 04-‐04-‐1968 3 Anton Meyer m AUT Innsbruck 10-‐20-‐1992 4 Ulrike Schulze f GER Potsdam 09-‐03-‐1977 5 Sophie Schulze f GER Rostock 06-‐20-‐2012 ... ... ... ... ... ... ... 8 * 109 Zacharias Perdopolus m GRE Athen 03-‐12-‐1979
Main Table: world_populaAon
Michael moves from Berlin to Potsdam
UPDATE "world_populaAon" SET city="Potsdam"
WHERE fname="Michael" AND lname="Berg"
DifferenAal Buffer Main Store
Tuple Lifetime
recId fname lname gender country city birthday
0 MarAn Albrecht m GER Berlin 08-‐05-‐1955 1 Michael Berg m GER Berlin 03-‐05-‐1970 2 Hanna Schulze f GER Hamburg 04-‐04-‐1968 3 Anton Meyer m AUT Innsbruck 10-‐20-‐1992 4 Ulrike Schulze f GER Potsdam 09-‐03-‐1977 5 Sophie Schulze f GER Rostock 06-‐20-‐2012 ... ... ... ... ... ... ... 8 * 109 Zacharias Perdopolus m GRE Athen 03-‐12-‐1979
UPDATE "world_populaAon" SET city="Potsdam"
WHERE fname="Michael" AND lname="Berg" Main Table: world_populaAon
Tuple Lifetime
recId fname lname gender country city birthday
0 MarAn Albrecht m GER Berlin 08-‐05-‐1955 1 Michael Berg m GER Berlin 03-‐05-‐1970 2 Hanna Schulze f GER Hamburg 04-‐04-‐1968 3 Anton Meyer m AUT Innsbruck 10-‐20-‐1992 4 Ulrike Schulze f GER Potsdam 09-‐03-‐1977 5 Sophie Schulze f GER Rostock 06-‐20-‐2012 ... ... ... ... ... ... ... 8 * 109 Zacharias Perdopolus m GRE Athen 03-‐12-‐1979
UPDATE "world_populaAon" SET city="Potsdam"
WHERE fname="Michael" AND lname="Berg"
DifferenAal Buffer Main Store
0 Michael Berg m GER Potsdam 03-‐05-‐1970
92
Main Table: world_populaAon
Tuple Lifetime
□ Tuples are now available in Main Store and DifferenAal Buffer
□ Tuples of a table are marked by a validity vector to reduce the
required amount of reorganizaAon steps
§ AddiAonal akribute vector for validity
§ 1 bit required per database tuple
□ Invalidated tuples stay in the database table, unAl the next
reorganizaAon takes place
□ Query results
§ Main and delta have to be queried
recId fname lname gender country city birthday valid
0 MarAn Albrecht m GER Berlin 08-‐05-‐1955 1 1 Michael Berg m GER Berlin 03-‐05-‐1970 0 2 Hanna Schulze f GER Hamburg 04-‐04-‐1968 1 3 Anton Meyer m AUT Innsbruck 10-‐20-‐1992 1 4 Ulrike Schulze f GER Potsdam 09-‐03-‐1977 1 5 Sophie Schulze f GER Rostock 06-‐20-‐2012 1 ... ... ... ... ... ... ...
8 * 109 Zacharias Perdopolus m GRE Athen 03-‐12-‐1979 1
Tuple Lifetime
UPDATE "world_populaAon" SET city="Potsdam"
WHERE fname="Michael" AND lname="Berg"
0 Michael Berg m GER Potsdam 03-‐05-‐1970 1 DifferenAal
Buffer Main Store
94
Main Table: world_populaAon
Tuple Lifetime
UPDATE "world_populaAon" SET city="Potsdam"
WHERE fname="Michael" AND lname="Berg"
recId fname lname gender country city birthday valid
0 MarAn Albrecht m GER Berlin 08-‐05-‐1955 1 1 Michael Berg m GER Berlin 03-‐05-‐1970 0 2 Hanna Schulze f GER Hamburg 04-‐04-‐1968 1 3 Anton Meyer m AUT Innsbruck 10-‐20-‐1992 1 4 Ulrike Schulze f GER Potsdam 09-‐03-‐1977 1 5 Sophie Schulze f GER Rostock 06-‐20-‐2012 1 ... ... ... ... ... ... ...
8 * 109 Zacharias Perdopolus m GRE Athen 03-‐12-‐1979 1
0 Michael Berg m GER Potsdam 03-‐05-‐1970 1 DifferenAal
Buffer Main Store Main Table: world_populaAon
Handling Write Operations
¨ All Write operaAons (INSERT, UPDATE) are stored within a differenAal buffer (delta) first
¨ Read-‐operaAons on differenAal buffer are more expensive than on main store
□ DifferenAal buffer is merged periodically with the main store
§ To avoid performance degradaAon based on large delta
Merge Overview I/II
¨ The merge process is triggered for single tables
¨ Is triggered by:
§ Amount of tuples in buffer
§ Cost model to
§ Schedule
§ Take query cost into account
§ Manually
Merge Overview II/II
¨ Working on data copies allows asynchronous merge
¨ Very limited interrupAon due to short lock
¨ At least twice the memory of the table needed!
Main Store Differ-ential Buffer Before Data Modifying Operations Read Operations Main Store Differential Buffer (new)
During the Merge Process
Data Modifying Operations Read Operations Main Store (new) Differ-ential Buffer (new) After Data Modifying Operations Read Operations Main Store (new) Differ-ential Buffer Merge Operation
Attribute Merge
Step 1: Dic'onary Merge
¨ Merge main and delta dicAonary
¨ OpAonally remove unused values
¨ Merge of two sorted arrays
¨ Create mapping if valueIDs changed
Step 2: Update ATribute Vector
¨ Create new merged main parAAon
¨ Update valueIDs reflecAng changed dicAonary
Delta
Main Store
Example
valueID fname city
0 Albert Berlin 1 Michael London 2 Nadja recID valid 0 1 1 1 2 1
recID fname city
0 2 0
1 1 1
2 0 0
Dic'onaries ATribute Vectors Validity Vector
valueID fname city recID fname city recID valid
Delta Main Store
Example
102 recID valid 0 1 1 0 2 1Dic'onaries ATribute Vectors Validity Vector
valueID fname city
0 Michael Berlin
recID valid
0 1
recID fname city
0 0 0
Dic'onaries ATribute Vectors Validity Vector
Michael moves from London to Berlin
valueID fname city
0 Albert Berlin
1 Michael London
2 Nadja
recID fname city
0 2 0
1 1 1
Main Store Delta
Example
recID valid 0 0 1 0 2 1Dic'onaries ATribute Vectors Validity Vector
valueID fname city
0 Michael Berlin
1 Nadja Potsdam
recID valid
0 1
1 1
recID fname city
0 0 0
1 1 1
Dic'onaries ATribute Vectors Validity Vector
Nadja moves from Berlin to Potsdam
valueID fname city
0 Albert Berlin
1 Michael London
2 Nadja
recID fname city
0 2 0
1 1 1
Main Store Delta
Example
104 recID valid 0 0 1 0 2 1Dic'onaries ATribute Vectors Validity Vector
valueID fname city
0 Michael Berlin 1 Nadja Potsdam recID valid 0 0 1 1 2 1
recID fname city
0 0 0
1 1 1
2 0 1
Dic'onaries ATribute Vectors Validity Vector
Michael moves from Berlin to Potsdam
valueID fname city
0 Albert Berlin
1 Michael London
2 Nadja
recID fname city
0 2 0
1 1 1
Main Store Delta
First Step:
Validity Detection
recID valid 0 0 1 0 2 1Dic'onaries ATribute Vectors Validity Vector
valueID fname city
0 Michael Berlin 1 Nadja Potsdam recID valid 0 0 1 1 2 1
recID fname city
0 0 0
1 1 1
2 0 1
Dic'onaries ATribute Vectors Validity Vector
Will be deleted / moved into history parAAon
valueID fname city
0 Albert Berlin
1 Michael London
2 Nadja
recID fname city
0 2 0
1 1 1
Main Store
Delta
First Step:
Dictionary Merge
valueID fname city0 Albert Berlin
1 Michael London
2 Nadja
Dic'onaries
valueID fname city
0 Michael Berlin
1 Nadja Potsdam
Dic'onaries
city: old to new Main
valueID (old) 0 1 valueID (new) 0 NA
fname: old to new Main
valueID (old) 0 1 2 valueID (new) 0 1 2
city: Delta to Main
valueID (Delta) 0 1 valueID (Main) 0 1
fname: Delta to Main
valueID (Delta) 0 1 valueID (Main) 1 2
Mappings
New Combined Main Store
valueID fname city
0 Albert Berlin
1 Michael Potsdam
2 Nadja
Main Store
Delta
Second Step:
New Main Column
Dic'onaries ATribute Vectors Validity Vector
valueID fname city
0 Michael Berlin
1 Nadja Potsdam
recID valid
0 1
1 1
recID fname city
0 1 1
1 0 1
Dic'onaries ATribute Vectors Validity Vector
recID valid
0 1
recID fname city
0 0 0
fname: old to new Main
valueID (old) 0 1 2
fname: Delta to Main
valueID (Delta) 0 1
Mappings
city: old to new Main
valueID (old) 0 1
city: Delta to Main
valueID (Delta) 0 1 valueID fname city
0 Albert Berlin
1 Michael London
Main Store
Delta
Second Step:
New Main Column
valueID fname city0 Albert Berlin 1 Michael Potsdam 2 Nadja recID valid 0 1 1 1 2 1
recID fname city
0 0 0
1 2 1
2 1 1
Dic'onaries ATribute Vectors Validity Vector
valueID fname city recID fname city recID valid
Dic'onaries ATribute Vectors Validity Vector
Chapter 5:
Implications on
Learning Map of our
Online Lecture @ openHPI.de
Founda'ons for a New Enterprise Applica'on Development Era Founda'ons of Database Storage Techniques The Future of
Enterprise Compu'ng Advanced Database Storage Tech-‐ niques In-‐Memory Database Operators
How does it all come together?
1. Mixed Workload combining OLTP and analytic-style queries
■ Column-Stores are best suited for analytic-style queries
■ In-memory databases enable fast tuple re-construction
■ In-memory column store allows aggregation on-the-fly
2. Sparse enterprise data
■ Lightweight compression schemes are optimal
■ Increases query execution
■ Improves feasibility of in-memory databases
3. Mostly read workload
■ Read-optimized stores provide best throughput
■ i.e. compressed in-memory column-store
■ Write-optimized store as delta partition
to handle data changes is sufficient
Changed Hardware Advances in data processing (software) Complex Enterprise Applications Our focus 111