• No results found

datalab Databases

N/A
N/A
Protected

Academic year: 2022

Share "datalab Databases"

Copied!
49
0
0

Loading.... (view fulltext now)

Full text

(1)

Databases

03. B. Relational algebra (ch 2.4)

2018. Fall Instructor: Joonho Kwon [email protected] Data Science Lab @ PNU

datalab

(2)

How does a SQL engine work ?

RDBMS Architecture

2

SQL Query

Relational Algebra (RA) Plan

Optimized

RA Plan Execution

Declarative query (from user)

Translate to

relational algebra expresson

Find logically equivalent- but more efficient- RA expression

Execute each operator of the optimized plan!

(3)

How does a SQL engine work ?

RDBMS Architecture

SQL Query

Relational Algebra (RA) Plan

Optimized

RA Plan Execution

Relational Algebra allows us to translate declarative (SQL) queries into precise and optimizable expressions!

(4)

Five basic operators:

Selection: 𝜎

Projection: Π

Cartesian Product: ×

Union: ∪

Difference: −

Derived or auxiliary operators:

Intersection, complement

Joins (natural, equi-join, theta join, semi-join)

Renaming: ρ

Division

Relational Algebra (RA)

4

We’ll look at these first!

And also at one example of a derived operator (natural join) and a special operator (renaming)

(5)

RDBMSs use multisets

However in relational algebra formalism we will consider sets!

Also: we will consider the named perspective, where every attribute must have a unique name

attribute order does not matter…

Keep in mind: RA operates on sets!

Now on to the basic RA operators…

(6)

Returns all tuples which satisfy a condition

Notation:

sc

(R)

Examples

sSalary > 40000 (Employee)

sname = “Smith” (Employee)

The condition c can be =, <,

, >,

, <>

1. Selection (𝜎)

6

SELECT *

FROM Students WHERE gpa > 3.5;

SQL:

𝜎RA:𝑔𝑝𝑎 >3.5(𝑆𝑡𝑢𝑑𝑒𝑛𝑡𝑠)

Students(sid,sname,gpa)

(7)

Selection Example

sSalary > 40000 (Employee)

SSN Name Salary

1234545 John 200000

5423341 Smith 600000

4352342 Fred 500000

SSN Name Salary

5423341 Smith 600000

4352342 Fred 500000

Another example:

(8)

Eliminates columns, then removes duplicates

Notation:

PA1,…,An

(R)

Example: project social-

security number and names:

P SSN, Name (Employee)

Output schema: Answer(SSN, Name)

2. Projection (Π)

8

SELECT DISTINCT sname, gpa

FROM Students;

SQL:

ΠRA:𝑠𝑛𝑎𝑚𝑒,𝑔𝑝𝑎(𝑆𝑡𝑢𝑑𝑒𝑛𝑡𝑠)

Students(sid,sname,gpa)

(9)

Projection Example

P Name,Salary (Employee)

SSN Name Salary

1234545 John 200000

5423341 John 600000

4352342 John 200000

Name Salary

John 200000

John 600000

Another example:

(10)

Note that RA Operators are Compositional!

10

SELECT DISTINCT sname,

gpa

FROM Students WHERE gpa > 3.5;

Students(sid,sname,gpa)

How do we represent this query in RA?

Π𝑠𝑛𝑎𝑚𝑒,𝑔𝑝𝑎(𝜎𝑔𝑝𝑎>3.5(𝑆𝑡𝑢𝑑𝑒𝑛𝑡𝑠))

𝜎𝑔𝑝𝑎>3.5𝑠𝑛𝑎𝑚𝑒,𝑔𝑝𝑎( 𝑆𝑡𝑢𝑑𝑒𝑛𝑡𝑠))

Are these logically equivalent?

(11)

Each tuple in R1 with each tuple in R2

Notation: R1

R2

Example:

Employee  Dependents

Rare in practice; mainly used to express joins

3. Cross-Product (×)

SELECT *

FROM Students, People;

SQL:

𝑆𝑡𝑢𝑑𝑒𝑛𝑡𝑠 × 𝑃𝑒𝑜𝑝𝑙𝑒RA:

Students(sid,sname,gpa) People(ssn,pname,address)

(12)

Cross-Product Example

12

ssn pname address 1234545 John 216 Rosse 5423341 Bob 217 Rosse

sid sname gpa

001 John 3.4

002 Bob 1.3

𝑆𝑡𝑢𝑑𝑒𝑛𝑡𝑠 × 𝑃𝑒𝑜𝑝𝑙𝑒

×

ssn pname address sid sname gpa

1234545 John 216 Rosse 001 John 3.4

5423341 Bob 217 Rosse 001 John 3.4

1234545 John 216 Rosse 002 Bob 1.3

5423341 Bob 216 Rosse 002 Bob 1.3

People Students

Another example:

(13)

Ordering of columns is unimportant as far as contents are concerned

So cross product is commutative

for any 𝑅 and 𝑆, 𝑅×𝑆 = 𝑆×𝑅 (up to the ordering of columns)

Note: a column ordering

ssn pname address sid sname gpa

1234545 John 216 Rosse 001 John 3.4

5423341 Bob 217 Rosse 001 John 3.4

1234545 John 216 Rosse 002 Bob 1.3

5423341 Bob 216 Rosse 002 Bob 1.3

sid sname gpa ssn pname address

001 John 3.4 1234545 John 216 Rosse

001 John 3.4 5423341 Bob 217 Rosse

002 Bob 1.3 1234545 John 216 Rosse

002 Bob 1.3 5423341 Bob 216 Rosse

=

(14)

Changes the schema, not the instance

A ‘special’ operator- neither basic nor derived

Notation:

r B1,…,Bn

(R)

Note: this is shorthand for the proper form (since

names, not order matters!):

r A1B1,…,AnBn (R)

4. Renaming (𝜌)

14

SELECT

sid AS studId, sname AS name, gpa AS gradePtAvg FROM Students;

SQL:

𝜌RA:𝑠𝑡𝑢𝑑𝐼𝑑,𝑛𝑎𝑚𝑒,𝑔𝑟𝑎𝑑𝑒𝑃𝑡𝐴𝑣𝑔(𝑆𝑡𝑢𝑑𝑒𝑛𝑡𝑠)

Students(sid,sname,gpa)

We care about this operator because we are working in a named perspective

(15)

Renaming example

sid sname gpa

001 John 3.4

002 Bob 1.3

𝜌𝑠𝑡𝑢𝑑𝐼𝑑,𝑛𝑎𝑚𝑒,𝑔𝑟𝑎𝑑𝑒𝑃𝑡𝐴𝑣𝑔(𝑆𝑡𝑢𝑑𝑒𝑛𝑡𝑠)

Students

studId name gradePtAvg

001 John 3.4

002 Bob 1.3

Students

Another example:

(16)

Notation: R

1

⋈ R

2

Joins R

1

and R

2

on equality of

all shared attributes

If R1 has attribute set A, and R2 has attribute set B, and they share attributes A⋂B = C, can also be written: R1 𝐶 R2

5. Natural Join (⋈) (1/3)

16

SELECT DISTINCT ssid, S.name, gpa, ssn, address

FROM

Students S, People P

WHERE S.name = P.name;

SQL:

𝑆𝑡𝑢𝑑𝑒𝑛𝑡𝑠 ⋈ 𝑃𝑒𝑜𝑝𝑙𝑒RA:

Students(sid,name,gpa) People(ssn,name,address)

(17)

Our first example of a derived RA operator:

Meaning:

R1 ⋈ R2 = PA U B (sC=D(𝜌𝐶→𝐷(R1)  R2))

Where:

The rename 𝜌𝐶→𝐷 renames the shared attributes in one of the relations

The selection sC=D checks equality of the shared attributes

The projection PA U B eliminates the duplicate common attributes

5. Natural Join (⋈) (2/3)

SELECT DISTINCT ssid, S.name, gpa, ssn, address

FROM

Students S, People P

WHERE S.name = P.name;

SQL:

𝑆𝑡𝑢𝑑𝑒𝑛𝑡𝑠 ⋈ 𝑃𝑒𝑜𝑝𝑙𝑒RA:

Students(sid,name,gpa) People(ssn,name,address)

(18)

Input: two tables 𝑅 and 𝑆

Notation: 𝑅 ⋈ 𝑆

Purpose: relate rows from two tables, and

Enforce equality between identically named columns

Eliminate one copy of identically named columns

Shorthand for 𝜋

𝐿

(𝑅 ⋈

𝑝

𝑆), where

𝑝 equates each pair of columns common to 𝑅 and 𝑆

𝐿 is the union of column names from 𝑅 and 𝑆 (with duplicate columns removed)

5. Natural Join (⋈) (3/3)

18

(19)

Natural Join example1

ssn P.name address 1234545 John 216 Rosse 5423341 Bob 217 Rosse

sid S.name gpa

001 John 3.4

002 Bob 1.3

𝑆𝑡𝑢𝑑𝑒𝑛𝑡𝑠 ⋈ 𝑃𝑒𝑜𝑝𝑙𝑒

sid S.name gpa ssn address

001 John 3.4 1234545 216 Rosse

002 Bob 1.3 5423341 216 Rosse

People P Students S

Another example:

(20)

Natural Join example2 (1/2)

20

uid name age pop

142 Bart 10 0.9

123 Milhouse 10 0.2

857 Lisa 8 0.7

456 Ralph 8 0.3

gid name

abc Book Club

gov Student Government dps Dead Putting Society

User U Group G

uid gid 142 dps 123 gov 857 abc 857 gov 456 abc 456 gov

Member M

Ordering of rows doesn’t matter

(even though output is always in someorder)

(21)

Natural Join example2 (2/2)

21

𝑈 ⋈ 𝑀 = 𝜋?(𝑈 ⋈? 𝑀)

uid gid 123 gov 857 abc 857 gov

uid name age pop gid

123 Milhouse 10 0.2 gov

857 Lisa 8 0.7 abc

857 Lisa 8 0.7 gov

uid name age pop

123 Milhouse 10 0.2

857 Lisa 8 0.7

(22)

Given schemas R(A, B, C, D), S(A, C, E), what is the schema of R ⋈ S ?

Given R(A, B, C), S(D, E), what is R ⋈ S ?

Given R(A, B), S(A, B), what is R ⋈ S ?

Natural Join

22

(23)

Example: Converting SFW Query -> RA

SELECT DISTINCT gpa,

address

FROM Students S, People P

WHERE gpa > 3.5 AND sname = pname;

How do we represent this query in RA?

Π𝑔𝑝𝑎,𝑎𝑑𝑑𝑟𝑒𝑠𝑠(𝜎𝑔𝑝𝑎>3.5(𝑆 ⋈ 𝑃))

Students(sid,sname,gpa) People(ssn,sname,address)

(24)

Given relations R(A,B) and S(B,C):

Here, projection & selection commute:

𝜎𝐴=5𝐴(𝑅)) = Π𝐴(𝜎𝐴=5(𝑅))

What about here?

𝜎𝐴=5𝐵(𝑅)) ? = Π𝐵(𝜎𝐴=5(𝑅))

Logical Equivalence of RA Plans

24

We’ll look at this in more depth later in the lecture…

(25)

How does a SQL engine work ?

RDBMS Architecture

SQL Query

Relational Algebra (RA) Plan

Optimized

RA Plan Execution

We saw how we can transform declarative SQL queries into precise, compositional RA plans

(26)

How does a SQL engine work ?

RDBMS Architecture

26

SQL Query

Relational Algebra (RA) Plan

Optimized

RA Plan Execution

We’ll look at how to then optimize these plans later in this lecture

(27)

How is the RA “plan” executed?

RDBMS Architecture

SQL Query

Relational Algebra (RA) Plan

Optimized

RA Plan Execution

We already know how to execute all the basic operators!

(28)

Natural Join / Join:

We saw how to use memory & IO cost considerations to pick the correct algorithm to execute a join with (BNLJ, SMJ, HJ…)!

Selection:

We saw how to use indexes to aid selection

Can always fall back on scan / binary search as well

Projection:

The main operation here is finding distinct values of the project tuples;

we briefly discussed how to do this with e.g. hashing or sorting

RA Plan Execution

28

(29)

Five basic operators:

Selection: 𝜎

Projection: Π

Cartesian Product: ×

Union: ∪

Difference: −

Derived or auxiliary operators:

Intersection, complement

Joins (natural, equi-join, theta join, semi-join)

Renaming: ρ

Relational Algebra (RA)

We’ll look at these

And also at some of these derived operators

(30)

Input: two tables 𝑅 and 𝑆

Notation: R  S

𝑅 and 𝑆 must have identical schema

Output:

Has the same schema as 𝑅 and 𝑆

Contains all rows in 𝑅 and all rows in 𝑆 (with duplicate rows removed)

Example:

ActiveEmployees  RetiredEmployees

1. Union ()

30

R S

(31)

Input: two tables 𝑅 and 𝑆

Notation: R – S

𝑅 and 𝑆 must have identical schema

Output:

Has the same schema as 𝑅 and 𝑆

Contains all rows in 𝑅 that are not in 𝑆

Example:

AllEmployees -- RetiredEmployees

2. Difference (–)

R S

(32)

Derived operator: intersection

Input: two tables 𝑅 and 𝑆

Notation: 𝑅 ∩ 𝑆

𝑅 and 𝑆 must have identical schema

Shorthand for R  S = R – (R – S)

Also equivalent to 𝑆 − (𝑆 − 𝑅)

And to 𝑅 ⋈ 𝑆

Output:

Has the same schema as 𝑅 and 𝑆

Contains all rows that are in both 𝑅 and 𝑆

Example

UnionizedEmployees RetiredEmployees

3. What about Intersection () ?

32

R S

(33)

A join that involves a predicate

R1 ⋈

q

R2 =

s q

(R1

R2)

Here q can be any condition

4. Theta Join (⋈

q

)

SELECT * FROM

Students,People WHERE q;

SQL:

𝑆𝑡𝑢𝑑𝑒𝑛𝑡𝑠 ⋈RA: 𝜃 𝑃𝑒𝑜𝑝𝑙𝑒

Students(sid,sname,gpa) People(ssn,pname,address)

Note that natural join is a theta join + a projection.

(34)

A theta join where

q

is an equality

R1 ⋈

A=B

R2 =

s A=B

(R1

R2)

Example:

Employee ⋈ SSN=SSN Dependents

5. Equi-join (⋈

A=B

)

34

SELECT * FROM

Students S, People P

WHERE sname = pname;

SQL:

RA:

𝑆 ⋈𝑠𝑛𝑎𝑚𝑒=𝑝𝑛𝑎𝑚𝑒 𝑃

Students(sid,sname,gpa) People(ssn,pname,address)

Most common join in practice!

(35)

R ⋉ S =

P A1,…,An

(R ⋈ S)

Where A

1

, …, A

n

are the attributes in R

Example:

Employee ⋉ Dependents

6. Semijoin (⋉)

SELECT DISTINCT sid,sname,gpa FROM

Students,People WHERE

sname = pname;

SQL:

RA:

𝑆𝑡𝑢𝑑𝑒𝑛𝑡𝑠 ⋉ 𝑃𝑒𝑜𝑝𝑙𝑒

Students(sid,sname,gpa) People(ssn,pname,address)

(36)

Semijoins are often used to compute natural joins in distributed databases

Semijoins in Distributed Databases

36

SSN Name

. . . . . .

SSN Dname Age

. . . . . .

Employee

Dependents

network

Employee ⋈ ssn=ssn (s age>71 (Dependents))

T = P SSN s age>71 (Dependents) R = Employee T

Answer = R Dependents

Send less data to reduce network bandwidth!

(37)

RA Expressions Can Get Complex!

Person Purchase Person Product

sname=fred sname=gizmo P pid

P ssn

seller-ssn=ssn

pid=pid buyer-ssn=ssn

P name

(38)

Names of users in Lisa’s groups

An exercise

38

𝜎𝑛𝑎𝑚𝑒 = "𝐿𝑖𝑠𝑎"

User 1. Who’s Lisa?

2.Lisa’s group 𝜋𝑔𝑖𝑑

Member

𝜋𝑢𝑖𝑑

Member

3.Users in Lisa’s groups

𝜋𝑛𝑎𝑚𝑒

User

4.Their names

Writing a query bottom-up:

(39)

IDs of groups that Lisa doesn’t belong to

Another exercise

Writing a query top-dwon:

All group IDs IDs of Lisa’s groups

𝜋𝑔𝑖𝑑 Group

𝜎𝑛𝑎𝑚𝑒 = "𝐿𝑖𝑠𝑎"

User 𝜋𝑔𝑖𝑑

Member

(40)

Who are the most popular?

Who do NOT have the highest pop rating?

Whose pop is lower than somebody else’s?

A trickier exercise

40

𝜋𝑢𝑖𝑑

User

𝜌𝑈2(𝑈𝑠𝑒𝑟)

User 𝜋𝑈1.𝑢𝑖𝑑

𝑈1 𝑈1.𝑝𝑜𝑝<𝑈2.𝑝𝑜𝑝 𝑈2

𝜌𝑈1(𝑈𝑠𝑒𝑟)

A deeper question: User

When (and why) is “ needed?

(41)

If some old output rows may need to be removed

Then the operator is non-monotone

Otherwise the operator is monotone

That is, old output rows always remain “correct” when more rows are added to the input

Formally, for a monotone operator 𝑜𝑝:

Monotone operators (exercise 2.4.6)

41

RelOp Add more rows

to the input...

What happens to the output?

(42)

Selection: 𝜎

𝑝

(𝑅)

Projection: 𝜋

𝐿

(𝑅)

Cross product: 𝑅 × 𝑆

Join: 𝑅 ⋈

𝑝

𝑆

Natural join: 𝑅 ⋈ 𝑆

Union: 𝑅 ∪ 𝑆

Difference: 𝑅 − 𝑆

Intersection: 𝑅 ∩ 𝑆

Monotone

Monotone

Monotone

Monotone

Monotone

Monotone

Monotone w.r.t. 𝑅;

non-monotone w.r.t 𝑆

Monotone

Classification of relationaloperators

42

(43)

Composition of monotone operators produces a monotone query

Old output rows remain “correct” when more rows are added to the input

Is the “highest” query monotone?

No!

Current highest pop is 0.9

Add another row with pop 0.91

Old answer is invalidated

So it must use difference!

Why is “−” needed for “highest”?

(44)

Difference

The only non-monotone operator

Projection

The only operator that removes columns

Cross product

The only operator that adds columns

Union

The only operator that allows you to add rows?

A more rigorous argument?

Selection?

Homework problem

Why do we need core operator 𝑋?

44

(45)

Duplicate handling (“bag algebra”)

Grouping and aggregation

“Extension” (or “extended projection”) to allow new column values to be computed

All these will come up when we talk about SQL

But for now we will stick to standard relational algebra without these extensions

Extensions to relational algebra

(46)

Simple

A small set of core operators

Semantics are easy to grasp

Declarative?

Yes, compared with older languages like CODASYL

Though operators do look somewhat “procedural”

Complete?

With respect to what?

Why is RA a good query language?

46

(47)

Cannot compute “transitive closure”

Find all direct and indirect relatives of Fred

Cannot express in RA !!!

Need to write C program, use a graph engine, or modern SQL…

RA has Limitations ! (1/2)

Name1 Name2 Relationship

Fred Mary Father

Mary Joe Cousin

Mary Bill Spouse

Nancy Lou Sister

(48)

But why not?

Optimization becomes undecidable

Simplicity is empowering

Besides, you can always implement it at the

application level, and recursion is added to SQL nevertheless!

RA has Limitations ! (2/2)

48

(49)

Q&A

References

Related documents