Databases
03. B. Relational algebra (ch 2.4)
2018. Fall Instructor: Joonho Kwon [email protected] Data Science Lab @ PNU
datalab
How does a SQL engine work ?
RDBMS Architecture
2
SQL Query
Relational Algebra (RA) Plan
Optimized
RA Plan Execution
Declarative query (from user)
Translate to
relational algebra expresson
Find logically equivalent- but more efficient- RA expression
Execute each operator of the optimized plan!
How does a SQL engine work ?
RDBMS Architecture
SQL Query
Relational Algebra (RA) Plan
Optimized
RA Plan Execution
Relational Algebra allows us to translate declarative (SQL) queries into precise and optimizable expressions!
Five basic operators:
Selection: 𝜎
Projection: Π
Cartesian Product: ×
Union: ∪
Difference: −
Derived or auxiliary operators:
Intersection, complement
Joins (natural, equi-join, theta join, semi-join)
Renaming: ρ
Division
Relational Algebra (RA)
4
We’ll look at these first!
And also at one example of a derived operator (natural join) and a special operator (renaming)
RDBMSs use multisets
However in relational algebra formalism we will consider sets!
Also: we will consider the named perspective, where every attribute must have a unique name
attribute order does not matter…
Keep in mind: RA operates on sets!
Now on to the basic RA operators…
Returns all tuples which satisfy a condition
Notation:
sc(R)
Examples
sSalary > 40000 (Employee)
sname = “Smith” (Employee)
The condition c can be =, <,
, >,
, <>
1. Selection (𝜎)
6
SELECT *
FROM Students WHERE gpa > 3.5;
SQL:
𝜎RA:𝑔𝑝𝑎 >3.5(𝑆𝑡𝑢𝑑𝑒𝑛𝑡𝑠)
Students(sid,sname,gpa)
Selection Example
sSalary > 40000 (Employee)
SSN Name Salary
1234545 John 200000
5423341 Smith 600000
4352342 Fred 500000
SSN Name Salary
5423341 Smith 600000
4352342 Fred 500000
Another example:
Eliminates columns, then removes duplicates
Notation:
PA1,…,An(R)
Example: project social-
security number and names:
P SSN, Name (Employee)
Output schema: Answer(SSN, Name)
2. Projection (Π)
8
SELECT DISTINCT sname, gpa
FROM Students;
SQL:
ΠRA:𝑠𝑛𝑎𝑚𝑒,𝑔𝑝𝑎(𝑆𝑡𝑢𝑑𝑒𝑛𝑡𝑠)
Students(sid,sname,gpa)
Projection Example
P Name,Salary (Employee)
SSN Name Salary
1234545 John 200000
5423341 John 600000
4352342 John 200000
Name Salary
John 200000
John 600000
Another example:
Note that RA Operators are Compositional!
10
SELECT DISTINCT sname,
gpa
FROM Students WHERE gpa > 3.5;
Students(sid,sname,gpa)
How do we represent this query in RA?
Π𝑠𝑛𝑎𝑚𝑒,𝑔𝑝𝑎(𝜎𝑔𝑝𝑎>3.5(𝑆𝑡𝑢𝑑𝑒𝑛𝑡𝑠))
𝜎𝑔𝑝𝑎>3.5(Π𝑠𝑛𝑎𝑚𝑒,𝑔𝑝𝑎( 𝑆𝑡𝑢𝑑𝑒𝑛𝑡𝑠))
Are these logically equivalent?
Each tuple in R1 with each tuple in R2
Notation: R1
R2
Example:
Employee Dependents
Rare in practice; mainly used to express joins
3. Cross-Product (×)
SELECT *
FROM Students, People;
SQL:
𝑆𝑡𝑢𝑑𝑒𝑛𝑡𝑠 × 𝑃𝑒𝑜𝑝𝑙𝑒RA:
Students(sid,sname,gpa) People(ssn,pname,address)
Cross-Product Example
12
ssn pname address 1234545 John 216 Rosse 5423341 Bob 217 Rosse
sid sname gpa
001 John 3.4
002 Bob 1.3
𝑆𝑡𝑢𝑑𝑒𝑛𝑡𝑠 × 𝑃𝑒𝑜𝑝𝑙𝑒
×
ssn pname address sid sname gpa
1234545 John 216 Rosse 001 John 3.4
5423341 Bob 217 Rosse 001 John 3.4
1234545 John 216 Rosse 002 Bob 1.3
5423341 Bob 216 Rosse 002 Bob 1.3
People Students
Another example:
Ordering of columns is unimportant as far as contents are concerned
So cross product is commutative
for any 𝑅 and 𝑆, 𝑅×𝑆 = 𝑆×𝑅 (up to the ordering of columns)
Note: a column ordering
ssn pname address sid sname gpa
1234545 John 216 Rosse 001 John 3.4
5423341 Bob 217 Rosse 001 John 3.4
1234545 John 216 Rosse 002 Bob 1.3
5423341 Bob 216 Rosse 002 Bob 1.3
sid sname gpa ssn pname address
001 John 3.4 1234545 John 216 Rosse
001 John 3.4 5423341 Bob 217 Rosse
002 Bob 1.3 1234545 John 216 Rosse
002 Bob 1.3 5423341 Bob 216 Rosse
=
Changes the schema, not the instance
A ‘special’ operator- neither basic nor derived
Notation:
r B1,…,Bn(R)
Note: this is shorthand for the proper form (since
names, not order matters!):
r A1B1,…,AnBn (R)
4. Renaming (𝜌)
14
SELECT
sid AS studId, sname AS name, gpa AS gradePtAvg FROM Students;
SQL:
𝜌RA:𝑠𝑡𝑢𝑑𝐼𝑑,𝑛𝑎𝑚𝑒,𝑔𝑟𝑎𝑑𝑒𝑃𝑡𝐴𝑣𝑔(𝑆𝑡𝑢𝑑𝑒𝑛𝑡𝑠)
Students(sid,sname,gpa)
We care about this operator because we are working in a named perspective
Renaming example
sid sname gpa
001 John 3.4
002 Bob 1.3
𝜌𝑠𝑡𝑢𝑑𝐼𝑑,𝑛𝑎𝑚𝑒,𝑔𝑟𝑎𝑑𝑒𝑃𝑡𝐴𝑣𝑔(𝑆𝑡𝑢𝑑𝑒𝑛𝑡𝑠)
Students
studId name gradePtAvg
001 John 3.4
002 Bob 1.3
Students
Another example:
Notation: R
1⋈ R
2
Joins R
1and R
2on equality of
all shared attributes If R1 has attribute set A, and R2 has attribute set B, and they share attributes A⋂B = C, can also be written: R1 ⋈ 𝐶 R2
5. Natural Join (⋈) (1/3)
16
SELECT DISTINCT ssid, S.name, gpa, ssn, address
FROM
Students S, People P
WHERE S.name = P.name;
SQL:
𝑆𝑡𝑢𝑑𝑒𝑛𝑡𝑠 ⋈ 𝑃𝑒𝑜𝑝𝑙𝑒RA:
Students(sid,name,gpa) People(ssn,name,address)
Our first example of a derived RA operator:
Meaning:
R1 ⋈ R2 = PA U B (sC=D(𝜌𝐶→𝐷(R1) R2))
Where:
The rename 𝜌𝐶→𝐷 renames the shared attributes in one of the relations
The selection sC=D checks equality of the shared attributes
The projection PA U B eliminates the duplicate common attributes
5. Natural Join (⋈) (2/3)
SELECT DISTINCT ssid, S.name, gpa, ssn, address
FROM
Students S, People P
WHERE S.name = P.name;
SQL:
𝑆𝑡𝑢𝑑𝑒𝑛𝑡𝑠 ⋈ 𝑃𝑒𝑜𝑝𝑙𝑒RA:
Students(sid,name,gpa) People(ssn,name,address)
Input: two tables 𝑅 and 𝑆
Notation: 𝑅 ⋈ 𝑆
Purpose: relate rows from two tables, and
Enforce equality between identically named columns
Eliminate one copy of identically named columns
Shorthand for 𝜋
𝐿(𝑅 ⋈
𝑝𝑆), where
𝑝 equates each pair of columns common to 𝑅 and 𝑆
𝐿 is the union of column names from 𝑅 and 𝑆 (with duplicate columns removed)
5. Natural Join (⋈) (3/3)
18
Natural Join example1
ssn P.name address 1234545 John 216 Rosse 5423341 Bob 217 Rosse
sid S.name gpa
001 John 3.4
002 Bob 1.3
𝑆𝑡𝑢𝑑𝑒𝑛𝑡𝑠 ⋈ 𝑃𝑒𝑜𝑝𝑙𝑒
⋈
sid S.name gpa ssn address
001 John 3.4 1234545 216 Rosse
002 Bob 1.3 5423341 216 Rosse
People P Students S
Another example:
Natural Join example2 (1/2)
20
uid name age pop
142 Bart 10 0.9
123 Milhouse 10 0.2
857 Lisa 8 0.7
456 Ralph 8 0.3
… … … …
gid name
abc Book Club
gov Student Government dps Dead Putting Society
… …
User U Group G
uid gid 142 dps 123 gov 857 abc 857 gov 456 abc 456 gov
… …
Member M
Ordering of rows doesn’t matter
(even though output is always in someorder)
Natural Join example2 (2/2)
21
𝑈 ⋈ 𝑀 = 𝜋?(𝑈 ⋈? 𝑀)
uid gid 123 gov 857 abc 857 gov
… …
uid name age pop gid
123 Milhouse 10 0.2 gov
857 Lisa 8 0.7 abc
857 Lisa 8 0.7 gov
… … … … …
uid name age pop
123 Milhouse 10 0.2
857 Lisa 8 0.7
… … … …
⋈
Given schemas R(A, B, C, D), S(A, C, E), what is the schema of R ⋈ S ?
Given R(A, B, C), S(D, E), what is R ⋈ S ?
Given R(A, B), S(A, B), what is R ⋈ S ?
Natural Join
22
Example: Converting SFW Query -> RA
SELECT DISTINCT gpa,
address
FROM Students S, People P
WHERE gpa > 3.5 AND sname = pname;
How do we represent this query in RA?
Π𝑔𝑝𝑎,𝑎𝑑𝑑𝑟𝑒𝑠𝑠(𝜎𝑔𝑝𝑎>3.5(𝑆 ⋈ 𝑃))
Students(sid,sname,gpa) People(ssn,sname,address)
Given relations R(A,B) and S(B,C):
Here, projection & selection commute:
𝜎𝐴=5(Π𝐴(𝑅)) = Π𝐴(𝜎𝐴=5(𝑅))
What about here?
𝜎𝐴=5(Π𝐵(𝑅)) ? = Π𝐵(𝜎𝐴=5(𝑅))
Logical Equivalence of RA Plans
24
We’ll look at this in more depth later in the lecture…
How does a SQL engine work ?
RDBMS Architecture
SQL Query
Relational Algebra (RA) Plan
Optimized
RA Plan Execution
We saw how we can transform declarative SQL queries into precise, compositional RA plans
How does a SQL engine work ?
RDBMS Architecture
26
SQL Query
Relational Algebra (RA) Plan
Optimized
RA Plan Execution
We’ll look at how to then optimize these plans later in this lecture
How is the RA “plan” executed?
RDBMS Architecture
SQL Query
Relational Algebra (RA) Plan
Optimized
RA Plan Execution
We already know how to execute all the basic operators!
Natural Join / Join:
We saw how to use memory & IO cost considerations to pick the correct algorithm to execute a join with (BNLJ, SMJ, HJ…)!
Selection:
We saw how to use indexes to aid selection
Can always fall back on scan / binary search as well
Projection:
The main operation here is finding distinct values of the project tuples;
we briefly discussed how to do this with e.g. hashing or sorting
RA Plan Execution
28
Five basic operators:
Selection: 𝜎
Projection: Π
Cartesian Product: ×
Union: ∪
Difference: −
Derived or auxiliary operators:
Intersection, complement
Joins (natural, equi-join, theta join, semi-join)
Renaming: ρ
Relational Algebra (RA)
We’ll look at these
And also at some of these derived operators
Input: two tables 𝑅 and 𝑆
Notation: R S
𝑅 and 𝑆 must have identical schema
Output:
Has the same schema as 𝑅 and 𝑆
Contains all rows in 𝑅 and all rows in 𝑆 (with duplicate rows removed)
Example:
ActiveEmployees RetiredEmployees
1. Union ( )
30
R S
Input: two tables 𝑅 and 𝑆
Notation: R – S
𝑅 and 𝑆 must have identical schema
Output:
Has the same schema as 𝑅 and 𝑆
Contains all rows in 𝑅 that are not in 𝑆
Example:
AllEmployees -- RetiredEmployees
2. Difference (–)
R S
Derived operator: intersection
Input: two tables 𝑅 and 𝑆
Notation: 𝑅 ∩ 𝑆
𝑅 and 𝑆 must have identical schema
Shorthand for R S = R – (R – S)
Also equivalent to 𝑆 − (𝑆 − 𝑅)
And to 𝑅 ⋈ 𝑆
Output:
Has the same schema as 𝑅 and 𝑆
Contains all rows that are in both 𝑅 and 𝑆
Example
UnionizedEmployees RetiredEmployees
3. What about Intersection ( ) ?
32
R S
A join that involves a predicate
R1 ⋈
qR2 =
s q(R1
R2)
Here q can be any condition
4. Theta Join (⋈
q)
SELECT * FROM
Students,People WHERE q;
SQL:
𝑆𝑡𝑢𝑑𝑒𝑛𝑡𝑠 ⋈RA: 𝜃 𝑃𝑒𝑜𝑝𝑙𝑒
Students(sid,sname,gpa) People(ssn,pname,address)
Note that natural join is a theta join + a projection.
A theta join where
qis an equality
R1 ⋈
A=BR2 =
s A=B(R1
R2)
Example:
Employee ⋈ SSN=SSN Dependents
5. Equi-join (⋈
A=B)
34
SELECT * FROM
Students S, People P
WHERE sname = pname;
SQL:
RA:
𝑆 ⋈𝑠𝑛𝑎𝑚𝑒=𝑝𝑛𝑎𝑚𝑒 𝑃
Students(sid,sname,gpa) People(ssn,pname,address)
Most common join in practice!
R ⋉ S =
P A1,…,An(R ⋈ S)
Where A
1, …, A
nare the attributes in R
Example:
Employee ⋉ Dependents
6. Semijoin (⋉)
SELECT DISTINCT sid,sname,gpa FROM
Students,People WHERE
sname = pname;
SQL:
RA:
𝑆𝑡𝑢𝑑𝑒𝑛𝑡𝑠 ⋉ 𝑃𝑒𝑜𝑝𝑙𝑒
Students(sid,sname,gpa) People(ssn,pname,address)
Semijoins are often used to compute natural joins in distributed databases
Semijoins in Distributed Databases
36
SSN Name
. . . . . .
SSN Dname Age
. . . . . .
Employee
Dependents
network
Employee ⋈ ssn=ssn (s age>71 (Dependents))
T = P SSN s age>71 (Dependents) R = Employee ⋉ T
Answer = R ⋈ Dependents
Send less data to reduce network bandwidth!
RA Expressions Can Get Complex!
Person Purchase Person Product
sname=fred sname=gizmo P pid
P ssn
seller-ssn=ssn
pid=pid buyer-ssn=ssn
P name
Names of users in Lisa’s groups
An exercise
38
𝜎𝑛𝑎𝑚𝑒 = "𝐿𝑖𝑠𝑎"
User 1. Who’s Lisa?
2.Lisa’s group 𝜋𝑔𝑖𝑑
Member
⋈
𝜋𝑢𝑖𝑑
Member
⋈
3.Users in Lisa’s groups
𝜋𝑛𝑎𝑚𝑒
User
⋈
4.Their names
Writing a query bottom-up:
IDs of groups that Lisa doesn’t belong to
Another exercise
Writing a query top-dwon:
All group IDs − IDs of Lisa’s groups
𝜋𝑔𝑖𝑑 Group
𝜎𝑛𝑎𝑚𝑒 = "𝐿𝑖𝑠𝑎"
User 𝜋𝑔𝑖𝑑
Member
⋈
Who are the most popular?
Who do NOT have the highest pop rating?
Whose pop is lower than somebody else’s?
A trickier exercise
40
− 𝜋𝑢𝑖𝑑
User
𝜌𝑈2(𝑈𝑠𝑒𝑟)
User 𝜋𝑈1.𝑢𝑖𝑑
𝑈1 ⋈𝑈1.𝑝𝑜𝑝<𝑈2.𝑝𝑜𝑝 𝑈2
𝜌𝑈1(𝑈𝑠𝑒𝑟)
A deeper question: User
When (and why) is “−” needed?
If some old output rows may need to be removed
Then the operator is non-monotone
Otherwise the operator is monotone
That is, old output rows always remain “correct” when more rows are added to the input
Formally, for a monotone operator 𝑜𝑝:
′ ′ ′
Monotone operators (exercise 2.4.6)
41
RelOp Add more rows
to the input...
What happens to the output?
Selection: 𝜎
𝑝(𝑅)
Projection: 𝜋
𝐿(𝑅)
Cross product: 𝑅 × 𝑆
Join: 𝑅 ⋈
𝑝𝑆
Natural join: 𝑅 ⋈ 𝑆
Union: 𝑅 ∪ 𝑆
Difference: 𝑅 − 𝑆
Intersection: 𝑅 ∩ 𝑆
Monotone
Monotone
Monotone
Monotone
Monotone
Monotone
Monotone w.r.t. 𝑅;
non-monotone w.r.t 𝑆
Monotone
Classification of relationaloperators
42
Composition of monotone operators produces a monotone query
Old output rows remain “correct” when more rows are added to the input
Is the “highest” query monotone?
No!
Current highest pop is 0.9
Add another row with pop 0.91
Old answer is invalidated
So it must use difference!
Why is “−” needed for “highest”?
Difference
The only non-monotone operator
Projection
The only operator that removes columns
Cross product
The only operator that adds columns
Union
The only operator that allows you to add rows?
A more rigorous argument?
Selection?
Homework problem
Why do we need core operator 𝑋?
44
Duplicate handling (“bag algebra”)
Grouping and aggregation
“Extension” (or “extended projection”) to allow new column values to be computed
All these will come up when we talk about SQL
But for now we will stick to standard relational algebra without these extensions
Extensions to relational algebra
Simple
A small set of core operators
Semantics are easy to grasp
Declarative?
Yes, compared with older languages like CODASYL
Though operators do look somewhat “procedural”
Complete?
With respect to what?
Why is RA a good query language?
46
Cannot compute “transitive closure”
Find all direct and indirect relatives of Fred
Cannot express in RA !!!
Need to write C program, use a graph engine, or modern SQL…
RA has Limitations ! (1/2)
Name1 Name2 Relationship
Fred Mary Father
Mary Joe Cousin
Mary Bill Spouse
Nancy Lou Sister
But why not?
Optimization becomes undecidable
Simplicity is empowering
Besides, you can always implement it at the
application level, and recursion is added to SQL nevertheless!
RA has Limitations ! (2/2)
48