Operating System Support for
Database Management
notes from Stonebraker’s paper that appeared in Computing Practices, 1981
Anastassia Ailamaki
http://www.cs.cmu.edu/~natassa
2
© 2001 Anastassia Ailamaki
The World According to the OS
Operating System
App1
processor memory disks network
App2
App3
3
© 2001 Anastassia Ailamaki
Today’s talk
What are these DBMSs anyway?
OS Issues for DB Systems
Conclusions/Discussion…
4
© 2001 Anastassia Ailamaki
What we see
Database Management System
Queries/ Answers
Storage Client apps
DBMS: the software that reads data & answers questions
Data
Banking DB Application
Design (schema):
CUSTOMER(NAME str, AGE int, ACCNT int)
ACCOUNT(ACCT_ID int, BALANCE real)
EMPLOYEE(NAME str, ADDRESS str, SALARY int)
Query:
What is the average balance of customers under 30?
Transaction:
Transfer $1,000 from checking C01 to savings S02
Components of a DBMS
Query Compiler
query
Execution Engine Logging/Recovery
LOCK TABLE Concurrency Control Storage Manager BUFFER POOL BUFFERS Buffer Manager Schema Manager Data Definition
DBMS: a set of cooperating software modules
Transaction Manager
7
© 2001 Anastassia Ailamaki
Interface with the OS
Query Compiler
query
Execution Engine Logging/Recovery
LOCK TABLE Concurrency Control Storage Manager BUFFER POOL BUFFERS Buffer Manager Schema Manager Data Definition
Crucial modules use OS services
Transaction Manager transaction
OS INTERFACE
OS
INTERFACE
8 © 2001 Anastassia AilamakiSimilarities with the OS
Query Compiler
query
Execution Engine Logging/Recovery
LOCK TABLE Concurrency Control Storage Manager BUFFER POOL BUFFERS Buffer Manager Schema Manager Data Definition
Almost every part has an OS counterpart
Transaction Manager transaction
US
ER
S
PR
OC
ES
SE
S
FI
LE
S
M
AI
N
M
EM
OR
Y
9 © 2001 Anastassia AilamakiPast and Present Situation
Operating System
Database Management System
processor memory disks network
10
© 2001 Anastassia Ailamaki
OS Issues for DB Systems
Buffer pool management
File System
Scheduling, processes, IPC
Concurrency/Recovery
Virtual Memory
Buffer Management
Typical Unix provisions:
All file I/O goes through main memory
LRU (or approximation) stack for replacement
Prefetch on sequential access
Transparent to clients (except for “force all”)
read X
Y
Y
main memory cacheX
1. Performance
Overhead: Can be terrible for each page read
System call
Core-to-core data move
read X
X
DB cacheX
X
Main memory13
© 2001 Anastassia Ailamaki
2. Replacement policy
Typical access patterns:
Sequential scan
What is the avg. balance of customers < 30?
Cyclic (looping) sequential scan
Which employees are also our customers?
Random accesses (once)
Random accesses (many times)
Same as above, with index
Which is the best replacement for each?
14
© 2001 Anastassia Ailamaki
Replacement policy (cont.)
Sequential scan
MRU (one page)
Cyclic (looping) sequential scan:
MRU (one page) or
“fix n+1” pages
Random accesses (once)
MRU
Random accesses (many times)
LRU
Need provision for DB hints (or manage own BP)
LRU is the worst here!!!
15
© 2001 Anastassia Ailamaki
3. Prefetch
DBMS knows what it wants next
It is not always sequential
More hints needed for good performance
Further issue: Prefetched pages might replace
needed ones (why???)
16
© 2001 Anastassia Ailamaki
4. Crash Recovery: Example
Transfer $1,000 from checking A to savings B
begin transaction write begin record
read balance A into X X = X – 1000
write X into balance A write update record
read balance B into Y Y = Y – 1000
write Y into balance B write update record
commit transaction write commit record
print receipt
Crash Recovery
Deferred Updates
Force intentions list to disk
Force commit flags (after intentions list!!!)
Do updates from intentions list
WAL (Write Ahead Loggind)
Force undo/redo records
Need facilities for
Selected force out
Ordering of physical writes
Buffer Management Summary
Performance
Replacement policy
Prefetching
Crash Recovery
…desired services done “not quite right”, therefore remain unused
19
© 2001 Anastassia Ailamaki
File System Issues
In current UNIX file systems
File = byte stream
Logical order little relation to physical order
Indirect blocks (trees) + directory trees
Consequences: + Small files cheap + Large files possible + Byte model for programmers
– Large files costly
– Many physical reads/logical – Loss of sequentiality – Byte model for DBMS – Too many trees!
20
© 2001 Anastassia Ailamaki
Preferred DBMS approach
Physical contiguity
OS-level B+ trees, hashing
Let DBMS know about blocks of file
Implement records at the low level
Provide higher-level services on top of this
What really happens today:
Extents
File I/O vs. raw I/O
21
© 2001 Anastassia Ailamaki
Scheduling, Processing, IPC
DBMS needs
Shared buffer pool
Shared lock table
Critical sections
Q: Does UNIX now have shared data segments?
Shared memory? 22 © 2001 Anastassia Ailamaki
Structure Alternatives
DBMS Process DBMS Process DBMS Processuser 1 user k user 1 user k
DISK
DISK
…
…
Process-per-user Server DBMSEvaluation
Process-per-user structure Expensive context-switchingPreemption at “bad places”
DBMS’s critical sections – convoy
DBMS server
Duplication of OS services
must do own multi-tasking
Messages cost several thousand instructions
Structure Alternatives (cont.)
DBMS Process
DBMS Process
user 1 user k user 1 user k
DISK
…
…
Server Pool Disk Server
…
DISK DISK Disk Process Disk Process DBMS Process25 © 2001 Anastassia Ailamaki
Evaluation (cont.)
Server pool Internal parallelism Avoid multitasking Similar to process-per-user Disk ServerTrades messages for task switches
May be more expensive
Still has queued-up requests to locked items
Multi-agent, multi-device mgr is used today
26
© 2001 Anastassia Ailamaki
DBMS – yet another wish list
Reduced message/task overheads
Sockets
No-preemption scheduling (can the OS do this?)
“fast-path” for context-switching among DBMS
procs
Threads, threads, threads!
(first appeared on IBM MVS in the 70’s - sigh)
27
© 2001 Anastassia Ailamaki
Recovery/CC issues
OS provides:
File-level locks – too coarse
Page-level 2PL – no special index CC possible
Problems OS-supported transactions
Transaction commit point
Duplicate functions due to buffer manager
Ordering Dependencies
Xtion result independent from execution order
28
© 2001 Anastassia Ailamaki
Virtual Memory
Why not map DBMS into virtual memory?
VM approach requires (warning: old data):
4 bytes overhead/VM page
100 MB file means 100 KB page table
If page table not resident, “two-touch” page access
Extent-based files system approach
1000 consecutive blocks represented in <addr, len> (versus 100KB above to store all addresses)
4 bytes overhead/file ctl blocks can stay in memory
Super-pages
Virtual Memory (cont.)
Bind chunks of file:
DBMS must keep track of binding
Bind/unbind in Xtion very expensive
Overhead comparable to file open
Plus, all the problems from buffering!
Conclusions
OSs have problems with DBMS wish lists:
Buffer management (policies, ordering, overhead)
File systems (abstraction, sequentiality, overhead)
Process issues (structure, task/msg overhead,
scheduling)
CC/Recovery (buffer pool problems)
31
© 2001 Anastassia Ailamaki
What the DBMS wants
Operating System
Database Management System
processor memory disks network
32
© 2001 Anastassia Ailamaki
What about modern DBMS/OSs?
“no-cache” file system option in DB2
NT:
“VirtualLock” API (override some buffer
policies)
“FlushViewOfFile” API (flush portions of file)
Physical contiguity
Unix FFS tries to place a file’s data blocks in
the same cylinder group
64-bit systems will allow to map some files in VM
33
© 2001 Anastassia Ailamaki
References
Reading
Stonebraker, M., Operating System Support for
Database Management, Communications of the ACM, 24(7), 1981 (and all its “references” section!
Gray, J., and Reuter, A., Transaction Processing:
Concepts and Techniques, Morgan Kaufmann,
1993.
CMU Courses
15-415, Database Applications
15-721, Database Management Systems
15-823, Advanced Topics in DB System Performance