645: Database Design and
Implementation
Yanlei Diao
University of Massachusetts Amherst
1/20/11 Yanlei Diao, University of Massachusetts Amherst
Outline
• Course topics and requirements
• Overview of databases and DBMS’s
1/20/11 Yanlei Diao, University of Massachusetts Amherst
Database Management Systems
• Fundamentals
– Data modeling – Relational design – Query language SQL – Database theory• Database implementation
– Storage, indexing– Query processing, query optimization – Transaction management
• Networked information systems
– XML, XQuery, XML query processing – Data stream systems
– Probabilistic databases
1/20/11 Yanlei Diao, University of Massachusetts Amherst
Prerequisites
• An undergraduate database or operating
systems course
• Or consent of the instructor
– Data structures and algorithms
– Computer systems
– Sufficient programming experience
1/20/11 Yanlei Diao, University of Massachusetts Amherst
Course Web Site
http://avid.cs.umass.edu/courses/645/s2011/
Or
Yanlei’s web page →
Teaching →
645 Spring 2011
1/20/11 Yanlei Diao, University of Massachusetts Amherst
Staff and Contact Information
• Instructor: Yanlei Diao
– Email: yanlei at cs.umass.edu
– Office hours: Tu Th 10:15-11:15 am, CS 232
• TA: Liping Peng
– Email: lppeng at cs.umass.edu
– Office hour: TBD
• Mailing list
– cs645 at edlab-mail.cs.umass.edu
– course related information, relevant to the whole class
• All available on the
home
page
1/20/11 Yanlei Diao, University of Massachusetts Amherst
Textbook
Database Management Systems
3
rd
Edition
Ramakrishnan and Gehrke
Lecture notes will be posted on the
schedule page before class.
- no laptop in class
1/20/11 Yanlei Diao, University of Massachusetts Amherst
Grading
• Homework: 35%
• Course Project: 30%
• Midterm: 30%
• Class participation: 5%
1/20/11 Yanlei Diao, University of Massachusetts AmherstHomework: 35%
• 5 assignments
– Written problem sets
– Programming exercises
• Posted on the
assignments
page
• Watch the due date!
• Submission: hardcopy before class on due date
• Policy on late submissions
1/20/11 Yanlei Diao, University of Massachusetts Amherst
Exam (30%)
• Midterm exam
– In-class, closed-book exam
– April 5, location TBD
• No final exam!
1/20/11 Yanlei Diao, University of Massachusetts Amherst
Project: 30%
• Groups of 2 or work individually
• Research-oriented problem
– Nobody has solved the problem
– Scientific value
• Milestones & deliverables: see the
projects
page
• Submission: via email
– Proposal, status report: before class on due date
– Final report: 5 pm on the due date
• In-class presentation
1/20/11 Yanlei Diao, University of Massachusetts Amherst
Participation (5%)
• Attend every class
1/20/11 Yanlei Diao, University of Massachusetts Amherst
Academic Honesty
• All submitted work must be your own!
– Although students are encouraged to study together, each student much produce his or her own solution to each homework. – Copying or using sections of someone else’s program or
assignment (even if it has been modified by you), or copying a solution from an external source, is not acceptable.
– The University guidelines for academic misconduct: http://www.umass.edu/dean_students/code_conduct/ acad_honest.htm
– The staff of CS 645 will be vigorous in enforcing them.
1/20/11 Yanlei Diao, University of Massachusetts Amherst
Outline
• Course topics and requirements
• Overview of databases and DBMS’s
1/20/11 Yanlei Diao, University of Massachusetts Amherst
Databases and DBMS’s
• A database is a large, integrated collection of data
• A database management system (DBMS) is a
software system designed to store and manage a
large amount of data
– Declarative interface to define data stored, add data,
update data, and query data
– Efficient querying
– Concurrent users
– Reliable storage and crash recovery
– Access control…
1/20/11 Yanlei Diao, University of Massachusetts Amherst
Early DBMS’s
• Early DBMS’s (1960’s) evolved from file systems
• Many small data items, many queries and updates
– Banking
– Airline reservations
• 1960s Navigational DBMS
– Tree-based or graph-based data model
– Manual navigation to find what you want
– No support for “search” (“search” ≠ “program”)
1/20/11 Yanlei Diao, University of Massachusetts Amherst
Relational DBMS
• Relational model (E.F. Codd, 1970)
– Data independence: hides details of physical storage
from users
– Declarative query language: say what you want, not
how to compute it
– Mathematical foundation: what queries mean,
possible implementations
• Query optimization (1970’s till now)
– Earliest: System R at IBM, INGRES at UC Berkeley
– Queries can be efficiently executed despite data
independence and declarative queries!
1/20/11 Yanlei Diao, University of Massachusetts Amherst
Commercial DBMS’s
System R
INGRES
Material in this slide based on wikipedia
Sybase
Informix
Postgres
MS SQL Server
IBM DB2
Oracle
MySQL
1/20/11 Yanlei Diao, University of Massachusetts Amherst
Recent Database Applications
• Data warehouses
– Large amounts of data over years, complex queries,
designed for analysis and reporting
– Sales data analysis, e.g., Walmart, Target, …
– Fraud analysis, e.g., credit card use, insurance
– Call record analysis, e.g., AT&T
– Schema design, data cleaning and loading, indexing,
aggregation, materialized views, data mining…
1/20/11 Yanlei Diao, University of Massachusetts Amherst
Recent Database Applications
• Electronic commerce
– E.g., amazon.com, ebay.com
– How do we integrate thousands of catalogs?
– How do we store data of sparse attributes
– How do we handle the scale?
• Social networking
– E.g., facebook.com, myspace.com, with 500 million
users at a popular site
– Real-time analysis of user behavior
– Database Security…
1/20/11 Yanlei Diao, University of Massachusetts Amherst
Recent Database Applications
• Large-scale sensing applications
– Temperature, light, humidity…
– Global Positioning System
– Radio Frequency Identification (RFID)
– Radar sensor networks
– Digital sky surveys
– How do we store, merge, and query sensing data?
– How do we support mission-critical applications?
T V
1/20/11 Yanlei Diao, University of Massachusetts Amherst
Recent Database Applications
• Bioinformatics
– Amino acid sequences: e.g., SWISS-PROT
– 3D molecular structure: e.g., PDB, CSD
– Data integration
– Pattern matching
– Approximate matching, ranking
– Automatic inference…
1/20/11 Yanlei Diao, University of Massachusetts Amherst
How does one build a database?
1/20/11 Yanlei Diao, University of Massachusetts Amherst
Example: The Internet Shop*
• DBDudes Inc.: a well-known database consulting
firm
• Barns and Nobble (B&N): a large bookstore
specializing in books on horse racing
• B&N decides to go online but needs help
• Step 0: DBDudes makes B&N agree to
– pay steep fees and
– schedule a lunch meeting for requirements analysis
* The example and all related material was taken from “Database Management Systems” Edition 3.1/20/11 Yanlei Diao, University of Massachusetts Amherst
Step 1: Requirements Analysis
• “I’d like my customers to be able to browse my catalog of
books and place orders online.”
– Books:
• For each book, B&N’s catalog contains its ISBN number, title, author, price, year of publication, …
– Customers:
• Most customers are regulars with names and addresses registered with B&N.
• New customers must first call and establish an account. – On the new website:
• Customers identify themselves before browsing and ordering. • Each order contains the ISBN of a book and a quantity. – Shipping:
• For each order, B&N ships all copies of a book together once they become available.
1/20/11 Yanlei Diao, University of Massachusetts Amherst
• A high level description of the data in terms of the
Entity-Relationship (ER)
model.
• Design review:
– What if a customer places two orders of the same book in one day? – What if a customer wants to have different books in the same order?
Step 2: Conceptual Design
cardnum ship_date
Books Orders Customers
isbn title
author qty_in_stock price
Year cid address
cname order_date
qty
ordernum
1/20/11 Yanlei Diao, University of Massachusetts Amherst
Step 3: Logical Design
• Mapping the ER diagram to the
relational model
• Access control: use
views
to restrict the access of certain
employees to customer sensitive information
CREATE TABLE Books
(isbn CHAR(10), title CHAR(80), author CHAR(80), qty_in_stock INTEGER, price REAL, year INTEGER, PRIMARY KEY(isbn)) CREATE TABLE Customers (cid INTEGER, cname CHAR(80), address CHAR(200), PRIMARY KEY(cid))
CREATE TABLE Orders (ordernum INTEGER, isbn CHAR(10), cid INTEGER, cardnum CHAR(16), qty INTEGER, order_date DATE, ship_date DATE, PRIMARY KEY(ordernum, isbn), FOREIGN KEY (isbn) REFERENCES Books, FOREIGN KEY (cid) REFERENCES Customers)
CREATE VIEW OrderInfo (isbn, cid, qty, order_date, ship_date) AS SELECT O.isbn, O.cid, O.qty,
O.order_date, O.ship_date FROM Orders O
1/20/11 Yanlei Diao, University of Massachusetts Amherst
Step 4: Schema Refinement
ordernum isbn cid cardnum qty order_date ship_date 120 0-07-11 123 40241160 2 Jan 3, 2006 Jan 6, 2006 120 1-12-23 123 40241160 1 Jan 3, 2006 Jan 11, 2006 120 0-07-24 123 40241160 3 Jan 3, 2006 Jan 26, 2006 120 ordernum 120 120 isbn 0-07-11 1-12-23 0-07-24 qty 2 1 3 Jan 26, 2006 ship_date Jan 6, 2006 Jan 11, 2006 ordernum 120 cid 123 cardnum 40241160 order_date Jan 3, 2006 Orders
Orders
Redundant Storage!
Orderlists1/20/11 Yanlei Diao, University of Massachusetts Amherst
Step 5: Internet Application Development
Database System
(MySQL, DB2…)
Application Server
(Apache Tomcat…)
Client Program
(Web Browser)
Presentation tier
Application logic tier
Data management tier
B&N Data:• Books • Customers (User login) • Orders • Orderlists B&N Business logic: • Home page • Login page • Search page • Cart page • Confirm page HTML, Javascript, Cookies PHP, JSP, Servlets, XSLT Relational, XML B&N Client: • User input • Display of output 1/20/11 Yanlei Diao, University of Massachusetts Amherst1/20/11 Yanlei Diao, University of Massachusetts Amherst
Example SQL Queries
SELECT isbn, title, author, price FROM Books
WHERE author = '<SearchString>' ORDER BY title
Search Page
SELECT cid, username, password FROM Customers
WHERE username = '<SpecifiedUsername>' Login Page
1/20/11 Yanlei Diao, University of Massachusetts Amherst
Step 6: Physical Design
• Auxiliary data structures,
indexes
, to speed up
searches, e.g., hash index, B+tree, R-tree
Books
Hash Indexon Books.isbn
Spectacular Bid title Legacies of the Turf Seattle Slew isbn 0-07-11 1-12-23 0-07-24 qty 10 0 3 Timothy Capps author Edward L. Bowen Dan Mearns price 29.95 24.95 16.95 year 2003 2000 2001 … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … H1 1 2 3 … … … N-1 isbn number 1/20/11 Yanlei Diao, University of Massachusetts Amherst
What is inside DBMS?
1/20/11 Yanlei Diao, University of Massachusetts Amherst
DBMS
DBMS Architecture
Query Parser Query Rewriter Query Optimizer Query Executor Query Processor Transactional Storage Manager Disk Space ManagerDB Access Methods
Buffer Manager
Lock Manager Log Manager
Disk Manager
1/20/11 Yanlei Diao, University of Massachusetts Amherst
Query Processor
Query Parser Query Rewriter Query Optimizer
Query Executor
SELECT C.cname, F.ordernum, F.order_date FROM Customers C, OrderInfo F WHERE C.cname = “John”
C.cid = F.cid SELECT C.cname, F.ordernum,
F.order_date FROM Customers C, OrderInfo F WHERE C.cname = “John”
C.cid = F.cid • Syntax checking • Internal representation • Handling views • Logical/semantic rewriting • Flattening subqueries
• Building a query execution plan • Efficient, if not optimal • Pull-based execution of a plan • Iterator model
SELECT C.cname, O.ordernum, O.order_date FROM Customers C, Orders O WHERE C.cname = “John”
C.cid = O.cid IndexScan Customers cname=“John” IndexScan Orders cid=cid C.cname, O.ordernum, O.order_date (Indexed Join) (On-the-fly) CREATE VIEW OrderInfo
(ordernum, cid, order_date) AS
SELECT O.ordernum, O.cid, O.order_date, FROM Orders O
1/20/11 Yanlei Diao, University of Massachusetts Amherst
Transactional Storage Manager
Access Methods
Buffer Manager
Lock Manager Log Manager
IndexScan Customers cname=“John” IndexScan Orders cid=cid C.cname, O.ordernum, O.order_date (Indexed Join) (On-the-fly) Replacement policy Concurrency: 2PL… Recovery: WAL File, B+tree, R-tree, Hash
1/20/11 Yanlei Diao, University of Massachusetts Amherst
Disk Manager
Disk Space Manager Buffer Manager
Allocate/Deallocate a page Read/Write a page Contiguous seq. of pages
Data
Indexes Catalog
Log Database
1/20/11 Yanlei Diao, University of Massachusetts Amherst
DBMS: Theory + Systems
Theory! Query Parser Query Rewriter Query Optimizer Query ExecutorDisk Space Manager DB Access Methods
Buffer Manager
Lock Manager Log Manager