Databases
ch1. Introduction to Databases
2018. Fall Instructor: Joonho Kwon [email protected] Data Science Lab @ PNU
data
data science laboratoryThese slides are based on the slides prepared by Peter Bailis.
Turing Awards in Data Management
Charles Bachman, 1973 IDS and CODASYL
Ted Codd, 1981 Relational model
You could be next!!
Jim Gray, 1998
Transaction processing
Lectures: A note about format of notes
3
These are asides / notes (still need to know these in general!)
Main point of slide / key takeaway at bottom
Definitions in blue with concept being defined bold & underlined
Warnings- pay attention here!
Take note!!
Definition of DBMS
Data models & the relational data model
Schemas & data independence
Outline
Database
A large, integrated collection of data
Models a real-world enterprise
Entities (e.g., Students, Courses)
Relationships (e.g., Alice is enrolled in 145)
What is DBMS?
5
A Database Management System (DBMS) is a piece of software designed to store and
manage databases
Give examples of DBMSs
Oracle, IBM DB2, Microsoft SQL Server, Altibase, Cubrid
Open source: MySQL (Sun/Oracle), PostgreSQL, MariaDB
Open source library: SQLite
We will focus on relational DBMSs most quarter
Database Management System
Consider building a course management system (CMS):
Students
Courses
Professors
Who takes what
Who teaches what
A Motivating, Running Example
7
Entities
Relationships
A data model is a collection of concepts for describing data
The relational model of data is the most widely used model today
Main Concept: the relation- essentially, a table
A schema is a description of a particular
collection of data, using the given data model
E.g. every relation in a relational data model has a schema describing types, etc.
Data models
Logical Schema
Students(sid: string, name: string, gpa: float)
Courses(cid: string, cname: string, credits: int)
Enrolled(sid: string, cid: string, grade: string)
Modeling the CMS
9
sid Name Gpa
101 Bob 3.2
123 Mary 3.8
Students
cid cname credits
564 564-2 4
308 417 2
Courses
sid cid Grade
123 564 A
Enrolled
Relations
Logical Schema
Students(sid: string, name: string, gpa: float)
Courses(cid: string, cname: string, credits: int)
Enrolled(sid: string, cid: string, grade: string)
Modeling the CMS
sid Name Gpa
101 Bob 3.2
123 Mary 3.8
Students
cid cname credits
564 564-2 4
308 417 2
Courses
sid cid Grade
123 564 A
Corresponding keys
Physical Schema: describes data layout
Relations as unordered files
Some data in sorted order (index)
Logical Schema: Previous slide
External Schema: (Views)
Course_info(cid: string, enrollment: integer)
Derived from other tables
Other Schemata…
11
Applications Administrators
Concept: Applications do not need to worry about how the data is structured and stored
Data independence
Logical data independence:
protection from changes in the logical structure of the data
Physical data independence:
protection from physical layout changes
I.e. should not need to ask: can we add a new entity or attribute without rewriting the application?
I.e. should not need to ask: which disks are the data stored on? Is the data indexed?
Transactions
Concurrency & locking
Atomicity & logging
Summary
Overview of DBMS topics
13
Suppose that our CMS application serves 1000’s of users or more- what are some challenges?
Challenges with Many Users
DBMS allows user to write programs as if they were the only user
Disk/SSD access is slow, DBMS hide the latency by doing more CPU work concurrently
• Security: Different users, different roles
• Performance: Need to provide concurrent access
• Consistency: Concurrency can lead to update problems
We won’t look at too much in this course, but is extremely important
A key concept is the transaction (TXN): an atomic sequence of db actions (reads/writes)
Transactions
15
Atomicity: An action either completes entirely or not at all
Acct Balance a10 20,000 a20 15,000
Transfer $3k from a10 to a20:
1. Debit $3k from a10 2. Credit $3k to a20
• Crash before 1,
• After 1 but before 2,
• After 2.
Written naively, in which states is atomicity
preserved?
DB Always preserves atomicity!
Acct Balance a10 17,000 a20 18,000
A key concept is the transaction (TXN): an atomic sequence of db actions (reads/writes)
If a user cancels a TXN, it should be as if nothing happened!
Transactions leave the DB in a consistent state
Users may write integrity constraints, e.g., ‘each course is assigned to exactly one room’
Transactions
Atomicity: An action either completes entirely or not at all
Consistency: An action results in a state which conforms to all integrity constraints
The DBMS ensures that the execution of {T1,…,Tn} is equivalent to some serial execution
One way to accomplish this: Locking
Before reading or writing, transaction requires a lock from DBMS, holds until the end
Key Idea: If Ti wants to write to an item x and Tj wants to read x, then Ti, Tj conflict. Solution via locking:
• only one winner gets the lock
• loser is blocked (waits) until winner finishes
Challenge: Scheduling Concurrent Transactions
17
A set of TXNs is isolated if their effect is as if all were executed serially
What if Ti and Tj need X and Y, and Ti asks for X before Tj, and Tj asks for Y before Ti? -> Deadlock! One is
aborted…
All concurrency issues handled by the DBMS…
DBMS ensures atomicity even if a TXN crashes!
One way to accomplish this: Write-ahead logging (WAL)
Key Idea: Keep a log of all the writes done.
After a crash, the partially executed TXNs are undone using the log
Ensuring Atomicity & Durability
Write-ahead Logging (WAL):
Before any action is finalized, a corresponding log entry is forced to disk
We assume that the log is on
“stable” storage
Persistent storage of data
Logical data model; declarative queries and updates
→ physical data independence
Multi-user concurrent access
Safety from system failures
Performance, performance, performance
Massive amounts of data (terabytes~petabytes)
High throughput (thousands~millions transactions/hour)
High availability (≥ 99.999% uptime)
Standard DBMS features
19
Much of the OS may be bypassed for performance and safety
Standard DBMS architecture
Application
DBMS
OS
Disk(s)
Queries/modifications Answers/responses
File system interface
Storage system interface
DBMS Components
21
Query Optimization and Execution Relational Operators
Access Methods
Buffer Management Disk Space Management
•Makes efficient use of disk space
Think 300,000,000 accounts!
•Makes efficient use of RAM
Think 1,000,000 simultaneous requests!
•Provides generic ways to combine data
Do you want a list of customers and accounts or the total account balance of all customers?
•Figures out the best way to answer a question
There is always more than 1 way to skin a cat…!
•Provides efficient ways to extract data
Do you need 1 record or a bunch?
Database application
•Talks to DBMS to manage data for a specific task -> e.g. app to withdraw/deposit money or provide a history of the account
Disk(s)
End users and DBMS vendors
Reduces cost and makes money
DB application programmers
Can handle more users, faster, for cheaper, and with better reliability / security guarantees!
Database administrators (DBA)
Easier time of designing logical/physical schema, handling security/authorization, tuning, crash
A Well-Designed DBMS makes many people happy!
Must still understand DB internals
DB application developer
writes programs that query and modify data
DB designer
establishes schema
DB administrator
loads data, tunes system, keeps whole thing running
Data analyst
data mining, data integration
DBMS implementor
builds the DBMS
Research on new systems
The players
23
DBMS are used to maintain, query, and manage large datasets.
Provide concurrency, recovery from crashes, quick application development, integrity, and security
Key abstractions give data independence
DBMS R&D is one of the broadest, most exciting fields in CS. Fact!
Summary of DBMS
Q&A
25