Dr. Uwe Röhm School of Information Technologies
INFO3404/3504 - Database Systems II
Week 1:
Introduction to Database Systems II
Example: Facebook
INFO3404/3504 "Database Systems II" - 2012 (U. Röhm) 1-2
Some Facebook Statistics
INFO3404/3504 "Database Systems II" - 2012 (U. Röhm) 1-3 In September 2011, Facebook reached
over 750 million users worldwide. Over past two years:
7x growth in raw user data. Over Halloween weekend 2011: 1 billion photos were uploaded. Infrastructure:
- data centers with nx10,000 servers - several specialised data stores - sharded MySQL database for actual user database
http://en.wikipedia.org/wiki/Facebook_statistics/ http://www.socialbakers.com/facebook-statistics/
http://gigaom.com/cloud/facebook-shares-some-secrets-on-making-mysql-scale/
Wikipedia – Hardware and Software
!
Wikipedia: (
as of July 2012 - http://stats.wikimedia.org/EN/Sitemap.htm)
! 271 languages, millions of articles(English: 4 million articles with more than 12GB data)
! 21 million page views per hour (sum over all languages) 11.3 million article edits per month (all languages)
!
Software:
! January 2001:
! Wikipedia on UseModWiki (written in Perl) on Linux => articles in files ! In 2002:
! Development of own wiki platform: MediaWiki
! PHP-based with either MySQL or PostgreSQL (since v1.8) as backend ! As of 2011: MediaWiki v1.19.1
! External text search engine (Apache Lucene)
! 15 Jan 2011: 10th birthday
INFO3404 "Database Systems II" - 2011 (U. Röhm and H. Jung) 12-5
Sources: http://en.wikipedia.org/wiki/History_of_Wikipedia
12-5
Wikipedia Hardware
INFO3404/3504 "Database Systems II" - 2012 (U. Röhm) 1-6
!
Hardware:
! Clustered architecture, over 400 servers around the globe
! As of 2012: Two global locations
! Tampa, Florida
! Amsterdam, Netherlands
! (formerly also a third one in Korea) ! commodity x86_64 Linux servers
Source: http://meta.wikimedia.org/wiki/Wikimedia_servers http://wikitech.wikimedia.org/view/Server_roles
Example: News Websites
!
SPIEGEL online
(big German news paper)
! March 1994: 175,000 hits p.m.
! July 2006: 437,018,954 hits p.m.
! July 2011: 77million views per day
!
In Australia:
! smh.com.au:2.93 million visitors in April 2012 ! NineMSN:
1,558,776,000 page hits p.m. (May 2012)
(Source: Nielsen Online Ratings April 2012)
INFO3404/3504 "Database Systems II" - 2012 (U. Röhm) 1-7
Source: www.spiegel.de
Example: Scale of Today s Workloads
!
Ebay (in 2005):
!
ca. 5 billion SQL/day
!
More than 100 back-end databases
!
VISA
!
In 2002, claimed 8 minutes downtime
The Scale of Today s Data
!
It is common now to deal in units of petabytes (10
15)
!
Some examples from Wikipedia (July 2010)
! Yahoo! claimed record with 1 Petabyte Database – back in 2005 ! Google processes 24 petabytes/day
! Teradata database has capacity of 50 petabytes of compressed data
! World of Warcraft uses 1.3 petabytes of storage
! Large Hadron Collider (LHC) will generate 15 petabytes per year.
!
Manage and access large data
!
Go to seek.com.au and search for the keyword “data
architect”
!
This course will teach you to be a vendor-neutral “data
architect”
INFO3404/3504 "Database Systems II" - 2012 (U. Röhm) 1-9
Grand Theme of INFO3404/3504
!
How to efficiently deal with SCALE?
! Large collections of data (hundreds of gigabytes)
! both structured (tuples)
! and unstructured (text or (key,value) pairs)
! we are interested on cases where data does not fit into memory…. ! Shared access by large numbers of concurrent users (thousands)
! Availability – always ON
!
Questions:
! How to efficiently manage large amounts of data?
! How to efficiently find data in those collections?
! How to efficiently serve thousands of concurrent users?
! How to efficiently (and correct) execute concurrent updates and retrievals?
Key Principles
!
Data Independence
! Applications are decoupled from structure of data ! Logical data model is decoupled from physical model
!
Declarative interface
! Specify “what rather than how.”
! Separate “interface from implementation.”
!
Space can be reused but not time
! Speed-up lookups and joins!
Transparent concurrency:
Internal mechanisms to manage concurrency using
“transactions” or groups of statements.
INFO3404/3504 "Database Systems II" - 2012 (U. Röhm) 1-11
What is a DBMS?
!
A Database is a collection of data central to some
organisation or enterprise
! Essential to operation of enterprise
! State of database mirrors state of enterprise ! An improtant asset on its own
!
A Database Management System (DBMS) is a software
package that manages a database:
! Stores the database on some mass storage (+backup +recovery). ! Supports a high-level access language (e.g. SQL).
! Application describes database accesses using that language.
! DBMS interprets statements of language to perform requested
Levels of Abstraction
! Many views, single conceptual (logical) schema and physical schema.
! Views describe how users see the data.
! Conceptual schema defines logical
structure
! Physical schema describes the files and indexes used.
! In INFO3404/3504, we will look at
! the physical layer
! the translation between conceptual and physical schema, and
! cross-layer aspects of DBMS.
INFO3404/3504 "Database Systems II" - 2012 (U. Röhm) 1-1 3
Physical Schema
Conceptual Schema
View 1 View 2 View 3
Knowledge of the database internals is the pre-requisite for effective performance tuning.
Data Independence
!
Applications are insulated from how data is structured and
stored.
!
Logical data independence:
Protection from changes in logical structure of data.
!
Physical data independence:
Protection from changes in physical structure of data.
!
INFO3404:
! Which physical design choices do we have available?
! What are the advantage / disadvantages of each structure?
INFO3404/3504 "Database Systems II" - 2012 (U. Röhm) 1-15
The Nature of SQL
!
In the programming world new languages are constantly
being invented – Java, C#, Python, Ruby etc.
!
In the database world attempts to replace SQL with other
languages have consistently failed – SQL ends up being
extended.
!
Why is SQL so versatile? => The declarative nature of SQL
!
In modern parlance SQL was probably the first widely
accepted language to successfully decouple the interface
design from the implementation.
!
SQL syntax is “syntactic sugar for relational “calculus .
(“most practical thing is to have a good theory”)
INFO3404/3504 "Database Systems II" - 2012 (U. Röhm) 1-16
SQL Example
!
Find all students that are enrolled in INFO3404.
! SELECT s.sid, s.nameFROM students s, enrolled e
WHERE s.sid = e.sid AND e.cid = INFO3404
!
Given this formal description of what we want, how does a
DBMS efficiently find the information?
! Is there an index? Shall we use it? How to join the two tables?
!
Query Processing:
! Translation into internal representation ! Query optimization
Query Optimization
INFO3404/3504 "Database Systems II" - 2012 (U. Röhm) 1-18
Are they all doing the same? Which is more efficient? project sid,name
select cid= INFO3404
join
students enrolled
project sid,name
select cid= INFO3404 join
students enrolled project sid,name
select cid= INFO3404 join
students enrolled
project sid,name
Example: Perl Script vs SQL
#!perl use strict;
my $in = shift or die; # illumina sequence file
my $out = shift or die; # output file for binned sequences open (FH, "$in") or die;
my @lines = <FH>; close FH;
my %reads;
for my $line (@lines) { chomp $line;
my $read;
if ($line =~ /\:([A|T|C|G]+)\:/ || $line =~ /^([A|T|G|C]+)/) { $read = $1; if ($reads{$read}->{count}) { $reads{$read}->{count}++; } else { $reads{$read}->{count} = 1; } }} @lines = "";
open (OUT, ">$out") or warn; my $m =1; # for rank
for my $read (sort { $reads{$b}->{count} <=> $reads{$a}->{count} } keys %reads) {
print OUT ">$reads{$read}->{count}\-$m\n$read\n"; # count, rank $m++;
}
close OUT;INFO3404/3504 "Database Systems II" - 2012 (U. Röhm) 1-19
A relatively simple Perl script from a Bioinformatics project that processes a file
The Same in SQL
!
Basically a group-by/aggregation query
!
Tricky part is to determine the ‘rank’ of a result
INFO3404/3504 "Database Systems II" - 2012 (U. Röhm) 1-20
INSERT
INTO
UniqueRead
SELECT
ROW_NUMBER
()
OVER
(
ORDER
BY COUNT
(*)
DESC
),
COUNT
(*),
r_sequence
FROM
ShortReads
WHERE
r_e_id
=
1
AND
r_sg_id
=
2
AND
r_s_id
=
1
AND
CHARINDEX
('N',r_sequence)
= 0
GROUP
BY
r_sequence
Perl Script Performance
SQL Query Performance
INFO3404/3504 "Database Systems II" - 2012 (U. Röhm) 1-22
SQL => automatic
parallel query execution
(4 cores available)
Concurrency Control
!
Concurrent execution of user programs is essential for good
DBMS performance.
! Because disk accesses are frequent, and relatively slow, it is
important to keep the cpu humming by working on several user programs concurrently.
!
Interleaving actions of different user programs can lead to
inconsistency: e.g., check is cleared while account balance
is being computed.
!
DBMS ensures such problems don’t arise: users can
pretend they are using a single-user system.
!
How is this done?
!
What are the advantage / disadvantages of each approach?
Transactions
!
Key concept is
transaction,
which is an
atomic
sequence of
database actions (reads/writes).
!
Each transaction, executed completely, must leave the DB
in a
consistent state
if DB is consistent when the transaction
begins.
! Atomicity : either all statements of a transaction take effect or none.
! Consistency: If a database starts in a consistent state and each
individual transaction is consistent, then database ends in a consistent state after transactions are executed.
! Isolation: A Transaction executes as if it were the only one active in
the system.
! Durability: Transactions should survive system and media failure.
INFO3404/3504 "Database Systems II" - 2012 (U. Röhm) 1-24
Transaction Example
! T1 (John): 1. Read(x) from DB 2. If x >= 50 Then x := x -50 3. Dispense cash 4. Write(x) to DB ! T2 (Jane): 1. Read(x) from DB 2. Read(y) from DB 3. If x >= 100 Then x := x-100 4. y := y +100 5. Write(x) to DB 6. Write(y) to DB X=200 -> T1 -> T2 -> X=50 X=200 -> T1.1 -> T2.1 -> T2.2 -> T2.3 -> T2.5 -> T1.2 -> T1.4 -> X=150 Ups…Challenge: Multi-Core CPUs
!
For example, some latest ‘toy’:
! 4 x eight-core Opteron CPUs (2GHz) ! 128 GB RAM! 4 TB HDDs
! 2 x 250GB SDDs
! 2 x Gigabit Ethernet
!
Servers with 64 cores and more are available already
!
These systems can execute multiple transactions and
queries in parallel on different CPU cores
! But we have to synchronize them when they access shared data!
! The challenge: How to do this fast and correctly at the same time?
INFO3404/3504 "Database Systems II" - 2012 (U. Röhm) 1-26
Internal Structure of a DBMS
! A typical DBMS has a layered architecture ! This is one of several possible architectures; each system has its own variationsINFO3404/3504 "Database Systems II" - 2012 (U. Röhm) 1-2 7
Web Forms Application Front Ends SQL Interface
Parser Plan Executor
Optimizer Operator Evaluator
File and Access Methods
Buffer Manager
Disk Space Manager
Recovery Manager System Catalog Index Files Data Files DATABASE DBMS SQL Commands Query Evaluation Engine Transaction Manager Lock Manager Concurrency Control Source: Ramakrishnan,Gehrke, 2003."
All DBMS Have the Same Architecture?
INFO3404/3504 "Database Systems II" - 2012 (U. Röhm) 1-28
Excurse:
Operating System Process Model
!
Operating System Process
! Program execution unit! Own private address space
! Own resource handles (e.g. open files, network connections) ! Unit of OS scheduling
!
Operating System Thread
! Program execution unit without private address space
! Address space and OS resource handles are shared among all threads of the same multi-threaded process
! Scheduled by OS (‘kernel threads’)
!
Lightweight Thread Package
! an application-level construct that supports multiple threads within a
single OS process; scheduled by application-own scheduler ! Advantage: faster thread switching as done in user space
Mapping of DBMS Functionality
to OS Process Model
!
DBMS Client
! software component that communicates with a DBMS
! Either in-process (DBMS is just a linked library, eg. BerkeleyDB or SQLite)
! or via network or IPC to separate DBMS server
!
DBMS Worker Thread
! thread of execution in DBMS side to does work on behalf of a DBMS
client: executes SQL and returns result to client
! 1:1 mapping between DBMS client and one Worker Thread
!
DBMS Internal Threads
! Several housekeeping tasks that need independent execution ! Typically: decoupled writing of updated data and snapshot creation
!
Question: How to map DBMS Threads to OS Threads?
INFO3404/3504 "Database Systems II" - 2012 (U. Röhm) 1-30
Classification of DB Servers
INFO3404/3504 "Database Systems II" - 2012 (U. Röhm) 1-31 DB Server
Monolithic Server single DB process
thread per DBMS worker
Multiple Server multiple DB processes
process per DBMS worker
Symmetric per client one
DB process Asymmetric dynamic assignment of client to DB processes process pool
Monolithic Server
!
Exactly one server process for all clients (“One-to-many”)
! DBMS server process typically prioritised by operating system ! Server uses multi-threading! Own DBMS resource management
!
Examples: Sybase, MS SQL Server, Oracle on Windows
INFO3404/3504 "Database Systems II" - 2012 (U. Röhm) 1-32
Client i multi-threaded DBMS Server Client 1 CL CL Client k CL …
operating system 1 operating system n
network
Multiple Server
!
DBMS consists of several processes
! Scheduling done by the operating system!
Communication between
! server processes: via shared memory
! clients and servers: via operating system or network
!
Two variants:
! Symmetric - each client is mapped to exactly one server process:
static mapping with a certain number of n servers pre-generated " maximal degree of parallelism is n
! Asymmetric / process pool - a dispatcher connects a client to a
server process.
Again, a certain number of servers are instantiated beforehand, but degree of parallelism can be higher
Example: Asymmetric Multiple Server
INFO3404/3504 "Database Systems II" - 2012 (U. Röhm) 1-34
DBMS Server k AP Client i Server DBMS 1 Operating system 1 AP Client 1 CL CL AP Client m CL Operating system n network … … D ispa tch er shared mem
Examples: Sybase and Oracle
!
Sybase
: Monolithic Server
! single server process
! own lightweight thread-scheduling
! less shared memory
!
Oracle
: Multiple Server
! Configurable as either symmetric or asymmetric
! dedicated server
! shared server with dispatcher ! DB processes with different tasks:
! Recoverer, Process Monitor,System Monitor, Database Writer, Log Writer, Archiver, Checkpoint, Dispatcher, Lock
Example: Oracle Process Overview
INFO3404/3504 "Database Systems II" - 2012 (U. Röhm) 1-36
System Global Area LCKn RECO PMON SMON
Database
Buffer Cache Redo Log Buffer
User Offline Storage ARC0 LGWR CKPT D000 DBW0 Shared
Server Dedicated Server
User Control
Redo Log Data
Oracle Process Overview (cont d)
! User - user clients (JDBC programs, Oracle tools, etc.)
! LCKn - Lock Process (Parallel Server Option)
! RECO - Recoverer (deals with hanging distributed transactions)
! PMON - Process Monitor (cleans up after process failure)
! SMON - System Monitor (startup recovery, garbage collection)
! D000 - Dispatcher (responsible for communication with clients)
! Snnn - Share Server Processes
! DBWn - Database Writer (writes db buffer blocks back to disk)
! CKPT - Checkpoint
! LGWR - Logwriter (writes Redo information to the disk)
Information Retrieval (IR)
!
While DBMS focus on structured data (set of tuples),
Information Retrieval work with (unstructured) text
documents
!
Search and Rank Queries
!
Keyword Search Database Internals
!
Questions:
1. How should data be organized to answer the query efficiently.
2. In what order should the results be displayed.
!
Challenge: again Scale
! large document collections, large texts
! e.g. Wikipedia, USYD library, library of congress etc.
INFO3404/3504 "Database Systems II" - 2012 (U. Röhm) 1-38
IR Example
!
Find all information about DBMS Internals
!
SQL: SELECT *
FROM books
WHERE content like %DBMS%Internals% (title?)
ORDER BY ???
!
IR: keyword query DBMS internals
! Result: 1. DBMS ….DBMS Internals … DBMS … Internals …
2. DB Systems … DBMS … Internals of DBMS … 3. My Book …DBMS Internals…
…
IR versus DBMS
!
Seem like very different beasts:
!
Both support queries over large datasets, use indexing.
!
In practice, you currently have to choose between the two.
INFO3404/3504 "Database Systems II" - 2012 (U. Röhm) 1-40
IR DBMS
Imprecise Semantics Precise Semantics Keyword search SQL
Unstructured data format Structured data
Read-Mostly. Add docs occasionally Expect reasonable number of updates Page through top k results Generate full answer
Summary
!
DBMS used to maintain & query large datasets
!
Main Benefits:
! Program-Data Independence ! Controlled Data Redundancy ! Declarative Queries
! Transactions
!
DBMS internals not only of interest to DBAs,
but for every SW developer facing large-scale data
problems
!
Big Challenges: Internet-Scale and latest hardware
Next Week
!
Storage Layer
! Disks, Blocks and Pages ! Buffer Management
! Row and page structures
! Data Compression
! Row Store vs. Column Store
!
Readings:
! Ramakrishnan/Gehrke, Chapter 9
! Kifer/Bernstein/Lewis, Chapter 9
INFO3404/3504 "Database Systems II" - 2012 (U. Röhm) 1-42
Tasks for this week
!
INFO3404:
! Check for eLearning updates
!
INFO3504:
! Check for eLearning and Timetable updates
! Read Chapters 1 and 2 of
“Architecture of a Database System” by Joe Hellerstein, Michael Stonebraker and James Hamilton (available on course website) ! Download PostgreSQL 9.1 source code (latest stable version) from
postgresql.org and try to compile/make