IV Distributed Databases
IV Distributed Databases
--
Motivation & Introduction
Motivation & Introduction
-
-I OODBS II XML DB
III Inf Retr DModel •
Motivation
• Expected Benefits
• Technical issues
• Types of distributed DBS
• 12 Rules of C. Date
• Parallel vs Distributed DBS
References
References
M.T. Özsu and P. Valduriez. Principles of Distributed Database Systems, 2nd edition. Prentice-Hall,1999.
Rahm, E.: Mehrrechner-Datenbanksysteme, Addison-Wesley, 1994
G. Vossen, G. Weikum: Transactional Information Systems: Theory, Algorithms, and the Practice of Concurrency Control and Recovery,
Morgan Kaufmann, 2001, ISBN ISBN: 1558605088 Gray, J.; Reuter, A.: Transaction Processing - Concepts and
Techniques, Morgan Kaufmann Publishers, San Matteo, 1993 Bernstein, P.A., Hadzilacos, V., Goodman, N.: Concurrency
Control and Recovery in Database Systems, Addison-Wesley, 1987 (pdf)
Bernstein, P.A., Newcomer, E.: Principles of Transaction Processing, Morgan Kaufmann, San Matteo, 1997
Material used from B. Kemme (McGill), H. Garcia-Molina (Stanford), A. Zaslavsky et al.(Monash), G. Alonso (ETH)
hs / FUB dbsII-03-10DDBIntro-3
Motivation
Motivation
Application
: Data "naturally" distributed
Companies with different branches Airlines
Financial Business University / faculties
Any organization with a decentralized organizational structure
Technology
: Network infrastructure, processors, RAM
Economy
: Hardware cost
Software supporting Distributed Processing, e.g RPC
Ö
Huge number of interconnected systems
Recent challenge: Web-based Computing
Ö
E-Commerce
hs / FUB dbsII-03-10DDBIntro-4
Goals: Improvement of non functional characteristics
Goals: Improvement of non functional characteristics
Performance:
the more computing power, the better
Primary goal for parallel DBS, not necessary distributed DB
Reliability:
Substitute faulty components (HW, software… … and network) seamlessly
Fault tolerance: the ability to hide failures from users Related to higher availability
95,8 % too low? Definitely: 1 hour / day !
Scalability
upscale / downscale your system incrementally
Central components and algorithms counter productive ÖDistributed algorithms
hs / FUB dbsII-03-10DDBIntro-5
The dark side of distribution
The dark side of distribution
Systems often less reliable
"You will never make a system of unreliable components more reliable by adding more unreliable components" However: hot standby
But: data copies must be kept consistent, complex software, unreliable network.
Scalability
DS inherently complex
High development cost -> middleware efforts High administration cost
Ö
lack of flexibility
The dark side …
The dark side …
Performance
Double resources do not guarantee double performance Network performance?
Q Transfer time not only depends on bandwidth
Transfer of 4 KB page
latency Bandwidth transfer - 100 m 0.5 µs 10 Mbps 5 ms - 100 m 0.5 µs 100 Mbps 0.5 ms - 1 km 5 µs 100 Mbps 0.5 ms - 100 km 0.5 ms 100 Mbps 1 ms - 1000 km 5 ms 100 Mbps 5.5 ms - 10000 km 50 ms 1 Gbps 50 ms
Q Distance > 100 km Ösignal propagation time dominates
hs / FUB dbsII-03-10DDBIntro-7
What is a Distributed Database?
What is a Distributed Database?
A distributed database (DDB) is a collection of
multiple,
logically interrelated
databases distributed
over a
computer network.
A distributed database management system (D–
DB
M
S) is the
software that manages the DDB
and
provides an access mechanism that makes this
distribution transparent to the users
.
Distributed database system (DDBS) = DDB + D–
DB
M
S
Def. by P. Valduriez, T. Öszu
hs / FUB dbsII-03-10DDBIntro-8
Example (1)
Example (1)
Transparency of distribution:
one
logical DB
UPDATE empl
SET sal = sal*1.1 WHERE proj.dur>12 AND emp.id = ass.eid AND proj.id=ass.pid Expl. by B. Kemme Berlin New York Munic Muc projects Muc employees Muc assigments NY employees All projects Berlin employees All assigments net
hs / FUB dbsII-03-10DDBIntro-9
Example (2)
Example (2)
Cooperation: autonomous DB cooperating on
particular tasks
SELECT flights
WHERE departure = Montreal AND arrival = Munich
AND date = 12/9/2002 AND price < 800$ lufthansa.com air-canada.com Travel-overland.com net
Example(3)
Example(3)
Autonomous, heterogenous systems, logically
identical data types
Select emplSET sal = sal*0.9 WHERE jobTitle =
"product manager"
Daimler / Stuttg.
Daimler / Bremen
Chrysler / Detroit
Only Detroit data Oracle 9i
Only Bremen data MySQL
OnlyStuttgart data IBM DB2
hs / FUB dbsII-03-10DDBIntro-11
Example (4)
Example (4)
Sophisticated Client / Server computing
client
client clientclient
Application Server A Application Server B
Possible R/W conflict
hs / FUB dbsII-03-10DDBIntro-12
Classification criteria
Classification criteria
Distribution
Physically independent systems
Peer-to-peer: data distribution and sharing
Client / Server: function distribution e.g. parsing in client
Heterogeneity
DBMS software
Database schema (Types) and languages (SQL variants)
Autonomy
No global control
Local DBS operations may not influenced by global operations (e.g. of a global transaction) Note: subsumes completely independent or
hs / FUB dbsII-03-10DDBIntro-13
Classification cube
Classification cube
by P. Valduriez, T. Öszu
Distributed DB: looks like one DB
Federated: more autonomy but not independent (Expl. 3) Multi DB: independent, cooperative (Expl. 2)
Scenarios and common problems
Scenarios and common problems
Not just one distributed database systems
.. but indefinitely many
Understand common problems
e.g. how to guarantee one state for replicated data
from the user point of view
Solve by developing distributed algorithms
e.g. transaction commit
Any unsolvable problems?
Example: Internet marriagepriest
bride groom
Distributed transaction: YES of NO,
this is the question
All participants and communication unreliable
Main issue: Partial failure
hs / FUB dbsII-03-10DDBIntro-15
12 +1 rules for DDBS (C. Date)
12 +1 rules for DDBS (C. Date)
Rule 0: A DDB looks like a central DB to users
Rule 1: sites should be as independent as possible –local autonomy Rule 2: There should not be a central master all sites are
dependent on -No reliance on central site
Rule 3: Never a need for complete shutdown –continuous operation Rule 4: Users should not need to know where data are stored
- location transparency (independence)
Rule 5: If data are split (e.g. columns of one relation) and distributed over several sites, user's should not be aware of it
-fragmentation transparency
hs / FUB dbsII-03-10DDBIntro-16
12 rules…
12 rules…
Rule 6: Users should not be aware of replicated data -replication independence
Rule 7: Efficient distributed query processing Rule 8: Global concurrency control and recovery
–distributed transaction management Rule 9: Hardware independence
Rule 10: OS independence Rule 11: Network independence Rule 12: DBMS independence
hs / FUB dbsII-03-10DDBIntro-17
Parallel versus Distributed Databases
Parallel versus Distributed Databases
More similarities than differences
Similar to Parallel / Distributed Processing
distinction
Parallel DBS
Not geographically distributed Goal: High Performance Homogenous Software Fast interconnect
Distributed DBS
Data geographically distributed Goal: Data sharing
Disconnected operation possible -> autonomy
Transparency
Parallel / distributed DBS
Parallel / distributed DBS
Query processing in parallel DBS
Distribute operators (sort, filter,…) an data over processor to make complex processing fast
Join (R, S) { // |R| >> | S|
1. Split R into n-1 partitions Ri and assign to Mi/Pi; Assign S to processor / memory Pn / Mn;
2. Sort Ri and S; ( //n parallel
3. Join (n-1) + 1 streams } e.g. join on a shared disk MP system M1 Mn P P P P
hs / FUB dbsII-03-10DDBIntro-19
Parallel / distributed DBS
Parallel / distributed DBS
Distributed QP
Given a data distribution
Find strategy to evaluate query with minimal cost,
in particular communication cost
|S| = 100000 records
Compute with minimal cost (time): R ZY S ZY T |R| = 10000 records |T| = 1000 records 10000 km 100 km hs / FUB dbsII-03-10DDBIntro-20