Outline of the course
Introduction
Database Systems
(E. Almeida)
Distributed Hash Tables and P2P
(C. Cassagne)
NewSQL
(D. Kim and J. Meira)
NoSQL
(D. Kim)
MapReduce
(E. Almeida and E. Lucas)
Data Streaming
(C. Cassagne)
Machine Learning
(C. Henard)
Outline of the course
Introduction
Database Systems
(E. Almeida)
Distributed Hash Tables and P2P
(C. Cassagne)
NewSQL
(D. Kim and J. Meira)
NoSQL
(D. Kim)
MapReduce
(E. Almeida and E. Lucas)
Data Streaming
(C. Cassagne)
Machine Learning
(C. Henard)
Readings
- Joseph Hellerstein, Michael Stonebraker and James Hamilton.
Architecture of a Database System.
Material at:
http://db.cs.berkeley.edu/papers/fntdb07-architecture.pdf
- Hector Garcia-Molina, Jeffrey Ullman, Jennifer Widom.
Database Systems: The Complete Book.
History
1970's
●INGRES Project at Berkeley
●
System-R Project at IBM
-> INGRES corp, Sybase, MS SQL Server, PostgreSQL
History
1980's
●DB2
●Oracle
●Informix
●Teradata
●Sybase
History
1990's
●
Postgres
●
MySQL
●
Illustra (from Postgres)
History
2000's
●
MonetDB (as open-source)
●
Vertica
●
Greenplum
●
EnterpriseDB
History
2010's
●
VoltDB (from H-Store)
●
Facebook Presto
●
Google F1
●
Google Megastore
Database Genealogy
General DBMS Architecture
SQL Interface, Apps DBMS Database Database Comm. manager Query processor Parser OptimizerAuth. manager Executor
Trans. Manager Buffer Lock Access Methods Log Shared components Catalog Adm & Monitoring
Access control
Scheduling
DB Logic Tablespace Segments Extents DBMS page File OS page Physical
Database Layout
Tab customers
Tab products Index
Tab customers
Tab products
Index on prodID
slot1 N Slot2 (empty) Slot n ... N of slots 1 0 1
Fixed size pages
(rid, attr1, attr2, …, attr n)
Record
...
rid=(i,...) Data * Number of entries n 24 16 rid=(i,...) Free space
Variable size pages
Variable record size
For write intensive applications
N-ary Storage Model (NSM)
rid=(i,...) Data * n 24 16 rid=(i,...) Free space
NSM Example
SELECT *
FROM CUSTOMER
WHERE ID=2;
INSERT INTO T
VALUES(2, 'MARIA', 18, CURITIBA, F);
SELECT AVG(AGE)
FROM CUSTOMER;
rid=(i,...) Data * n 24 16 rid=(i,...) Free space
NSM Example
SELECT *
FROM CUSTOMER
WHERE ID=2;
- Few write operations
- Read the entire tuple at once
- Aggregations may read the entire DB!!
INSERT INTO T
VALUES(2, 'MARIA', 18, CURITIBA, F);
SELECT AVG(AGE)
FROM CUSTOMER;
(rid1,attr1) (rid1,attr2)
Columnar page
(rid1,attr3) (rid2,attr1) (rid2,attr2) (rid2,attr3) Page: Attr1 Page: Attr2 Page: Attr3
... ... ...
(rid_n,attr1) (rid_n,attr2) (rid_n,attr3)
Improves aggregation and compression
Decomposition Storage Model (DSM)
[Ailamaki, VLDB02]
(rid1,attr1) (rid1,attr2) (rid1,attr3) (rid2,attr1) (rid2,attr2) (rid2,attr3) Page: Attr1 Page: Attr2 Page: Attr3
... ... ...
(rid_n,attr1) (rid_n,attr2) (rid_n,attr3)
DSM Example
SELECT *
FROM CUSTOMER
WHERE ID=2;
INSERT INTO T
VALUES(2, 'MARIA', 18, CURITIBA, F);
SELECT AVG(AGE)
FROM CUSTOMER;
(rid1,attr1) (rid1,attr2) (rid1,attr3) (rid2,attr1) (rid2,attr2) (rid2,attr3) Page: Attr1 Page: Attr2 Page: Attr3
... ... ...
(rid_n,attr1) (rid_n,attr2) (rid_n,attr3)
DSM Example
SELECT *
FROM CUSTOMER
WHERE ID=2;
INSERT INTO T
VALUES(2, 'MARIA', 18, CURITIBA, F);
SELECT AVG(AGE)
FROM CUSTOMER;
- More write operations than NSM
- Read one attribute at a time
- Aggregations are fast do not read the entire DB!!
- Agressive compression
Quiz
What types of page layout are more likely for the following
queries? Why?
What types of page layout are more likely for the following
queries? Why?
Q1
SELECT *
FROM CUSTOMERS;
Q2
Q3
SELECT SUM(BALANCE)
FROM ACCOUNT;
SELECT A.ID, A.BALANCE, C.NAME
FROM CUSTOMERS C, ACCOUNT A
WHERE A.ID=C.ID
Q4
SELECT C.CITY, SUM(A.BALANCE)
FROM CUSTOMERS C, ACCOUNT A
Adam Betty Smith
P0 k1 P1 k2 P2 ... ... km Pm
Entrada de índice
17
5 13 27
2* 3* 5* 11* 14*
Paginas não folha (só para direcionar)
Paginas folha (entradas de dados) >17
00 01 10 11 4* 12* 32* 1* 21* 10* Directory Buckets
If insert 5*, then 5=101, bucket '01' (read backwards)
Non-clustered Clustered Records
Data Index entries
Physically sorted: only one clustered index per table
Quiz
SELECT NAME, AGE
FROM CUSTOMER
WHERE AGE BETWEEN 18 AND 25;
Considering the following queries, what types of index need
to be implemented? Why?
Considering the following queries, what types of index need
to be implemented? Why?
Q1
SELECT NAME, AGE
FROM CUSTOMER
WHERE CUST_ID= 1234;
Q2
SELECT NAME, CITY, AGE, SSN
FROM CUSTOMER
WHERE AGE BETWEEN 18 AND 25
AND CITY ;
Query processing
SELECT BALANCE
FROM ACCOUNT
WHERE BALANCE > 2500;
Query
∏ BALANCE (
σ
BALANCE > 2500 (ACCOUNT))σ
BALANCE > 2500 ( ∏ BALANCE (ACCOUNT)) Query expressionQuery processing
SELECT BALANCE
FROM ACCOUNT
WHERE BALANCE > 2500;
Query
∏ BALANCE (
σ
BALANCE > 2500 (ACCOUNT))May have n plans!!!
How does the DBMS rewrite a query?
σ
BALANCE > 2500 ( ∏ BALANCE (ACCOUNT)) Query expressionQuery rewrite
Rule: Selection operations are
commutative
Query rewrite
Rule: Join operations are commutative if
the order of the attributes is not taken
Query rewrite
Query rewrite
Rule: Natural-joins are associative.
For theta-joins, this only holds if the
following join involves attributes only
from T2 and T3.
Query rewrite
Rule: It distributes when all the
attributes in the selection condition
involve only the attributes of one of the
expressions (say T1)
Query rewrite
Rule: Conjunctive selection operations
can be split into a sequence of individual
selections
Further rules can be found in the
readings.
Quiz
Please, rewrite the following expressions using the
presented rules and the following relations:
Please, rewrite the following expressions using the
presented rules and the following relations:
Q1
Q2
Q3
∏ BALANCE, C_ID (σ BALANCE > 2500 (ACCOUNT X CUSTOMER)) ∏ BALANCE, ACC_ID (CUSTOMER X ACCOUNT)∏ BALANCE, ACC_ID (CUSTOMER X (ACCOUNT X BRANCH)) ACCOUNT(BALANCE, ACC_ID, C_ID, B_ID), CUSTOMER(C_ID), BRANCH(B_ID, B_NAME)
Q4
Q5
σ BALANCE > 2500 AND C_ID > 1000 (ACCOUNT X CUSTOMER)
Quiz
∏ BALANCE, ACC_ID (CUSTOMER X ACCOUNT)
σ BALANCE > 2500 ( ∏ BALANCE, C_ID (ACCOUNT X CUSTOMER))
∏ BALANCE, ACC_ID (BRANCH x (CUSTOMER X ACCOUNT ))
ACCOUNT(BALANCE, ACC_ID, C_ID, B_ID), CUSTOMER(C_ID), BRANCH(B_ID, B_NAME)
Please, rewrite the following expressions using the
presented rules and the following relations:
Please, rewrite the following expressions using the
presented rules and the following relations:
Q1
Q2
Q3
Q4
Q5
σ BALANCE > 2500 (σ C_ID > 1000 (ACCOUNT X CUSTOMER))
Database Catalog
DBMS and BD
Data structures
Buffer and page sizes
Indices
Views
The catalog is a meta-database that stores
informations about:
Database Catalog
Statistics
Cardinality of tables and indices
Domain values
Page sizes for tables and indices
The catalog is a meta-database that stores
informations about:
Database Catalog
Operators
I/O cost for Selection ( ∏ )
Scan
Scan on sorted data
Index scan (Clustered index)
O(M), where M = number of pages
O(log M), where M = number of pages
O(h + 1), where h = height of the index tree
Index scan (Non-Clustered index)
O(h + n), where n = number of fetched tuples.
Quiz
Compute of the cost for a query selecting 10% of a table 'R'
based on the following information from the catalog:
Compute of the cost for a query selecting 10% of a table 'R'
based on the following information from the catalog:
Relation 'R' 10,000 tuples Tuples/page ratio 100 tuples
Scan
Scan on sorted Clustered index scan
Non-clustered index scan
Quiz
Compute of the I/O cost for a query selecting 10% of a table
'R' based on the following information from the catalog:
Compute of the I/O cost for a query selecting 10% of a table
'R' based on the following information from the catalog:
Relation 'R' 10,000 tuples Tuples/page ratio 100 tuples
Scan 100 I/O Scan on sorted 6.6 I/O Clustered index scan 3 I/O
Non-clustered
Transactions
Definition: “A transaction is an atomic unit of work
that is either completed in its entirety or not done at
all.” [Navathe and Elmasri, 2005]
T1 Database r(balance) balance = 200 balance += 100 r(x)
active commitedPartially Failed Terminated Begin End Read, write abort Commit commited
Transaction State Transition
ACID properties
Atomicity
: a transaction is either completed in its
entirety or not done at all
Consistency
: a transaction is consistency preserving
if its complete execution takes the database from one
consistent state to another
Isolation
: The execution of a transaction should not be
interfered
Durability
: the changes applied by a commit must
persist in the database
Schedules
A schedule orders the execution of concurrent
transactions in an interleaving fashion.
T1 T2
r(x)
r(x) w(x) w(x)
Concurrency Conflicts
Concurrency problems: dirty read and lost update
T1 T2 r(balance) balance += 100 r(x) w(balance) w(x) w(x) r(balance) T1 T2 r(balance) r(balance) balance += 100 w(x) w(x) balance -= 100 w(balance) w(balance) abort balance-=100 w(balance)
Concurrency Conflicts
Concurrency problems: dirty read and lost update
T1 T2 r(balance) balance += 100 w(balance) r(balance) T1 T2 r(balance) r(balance) balance += 100 balance -= 100 w(balance) w(balance) abort balance-=100 w(balance)