IV Distributed Databases

(1)

--

Motivation & Introduction

-

-I OODBS II XML DB

III Inf Retr DModel •

Motivation

• Expected Benefits

• Technical issues

• Types of distributed DBS

• 12 Rules of C. Date

• Parallel vs Distributed DBS

References

M.T. Özsu and P. Valduriez. Principles of Distributed Database Systems, 2nd edition. Prentice-Hall,1999.

Rahm, E.: Mehrrechner-Datenbanksysteme, Addison-Wesley, 1994

G. Vossen, G. Weikum: Transactional Information Systems: Theory, Algorithms, and the Practice of Concurrency Control and Recovery,

Morgan Kaufmann, 2001, ISBN ISBN: 1558605088 Gray, J.; Reuter, A.: Transaction Processing - Concepts and

Techniques, Morgan Kaufmann Publishers, San Matteo, 1993 Bernstein, P.A., Hadzilacos, V., Goodman, N.: Concurrency

Control and Recovery in Database Systems, Addison-Wesley, 1987 (pdf)

Bernstein, P.A., Newcomer, E.: Principles of Transaction Processing, Morgan Kaufmann, San Matteo, 1997

Material used from B. Kemme (McGill), H. Garcia-Molina (Stanford), A. Zaslavsky et al.(Monash), G. Alonso (ETH)

(2)

hs / FUB dbsII-03-10DDBIntro-3

Motivation

Application

: Data "naturally" distributed

Companies with different branches Airlines

Financial Business University / faculties

Any organization with a decentralized organizational structure

Technology

: Network infrastructure, processors, RAM

Economy

: Hardware cost

Software supporting Distributed Processing, e.g RPC

Ö

Huge number of interconnected systems

Recent challenge: Web-based Computing

Ö

E-Commerce

Goals: Improvement of non functional characteristics

Performance:

the more computing power, the better

Primary goal for parallel DBS, not necessary distributed DB

Reliability:

Substitute faulty components (HW, software… … and network) seamlessly

Fault tolerance: the ability to hide failures from users Related to higher availability

95,8 % too low? Definitely: 1 hour / day !

Scalability

upscale / downscale your system incrementally

Central components and algorithms counter productive ÖDistributed algorithms

(3)

The dark side of distribution

Systems often less reliable

"You will never make a system of unreliable components more reliable by adding more unreliable components" However: hot standby

But: data copies must be kept consistent, complex software, unreliable network.

Scalability

DS inherently complex

High development cost -> middleware efforts High administration cost

Ö

lack of flexibility

The dark side …

Performance

Double resources do not guarantee double performance Network performance?

Q Transfer time not only depends on bandwidth

Transfer of 4 KB page

latency Bandwidth transfer - 100 m 0.5 µs 10 Mbps 5 ms - 100 m 0.5 µs 100 Mbps 0.5 ms - 1 km 5 µs 100 Mbps 0.5 ms - 100 km 0.5 ms 100 Mbps 1 ms - 1000 km 5 ms 100 Mbps 5.5 ms - 10000 km 50 ms 1 Gbps 50 ms

Q Distance > 100 km Ösignal propagation time dominates

(4)

What is a Distributed Database?

A distributed database (DDB) is a collection of

multiple,

logically interrelated

databases distributed

over a

computer network.

A distributed database management system (D–

DB

M

S) is the

software that manages the DDB

and

provides an access mechanism that makes this

distribution transparent to the users

.

Distributed database system (DDBS) = DDB + D–

DB

M

S

Def. by P. Valduriez, T. Öszu

Example (1)

Transparency of distribution:

one

logical DB

UPDATE empl

SET sal = sal*1.1 WHERE proj.dur>12 AND emp.id = ass.eid AND proj.id=ass.pid Expl. by B. Kemme Berlin New York Munic Muc projects Muc employees Muc assigments NY employees All projects Berlin employees All assigments net

(5)

Example (2)

Cooperation: autonomous DB cooperating on

particular tasks

SELECT flights

WHERE departure = Montreal AND arrival = Munich

AND date = 12/9/2002 AND price < 800$ lufthansa.com air-canada.com Travel-overland.com net

Example(3)

Autonomous, heterogenous systems, logically

identical data types

Select empl

SET sal = sal*0.9 WHERE jobTitle =

"product manager"

Daimler / Stuttg.

Daimler / Bremen

Chrysler / Detroit

Only Detroit data Oracle 9i

Only Bremen data MySQL

OnlyStuttgart data IBM DB2

(6)

Example (4)

Sophisticated Client / Server computing

client

client client_client

Application Server A Application Server B

Possible R/W conflict

Classification criteria

Distribution

Physically independent systems

Peer-to-peer: data distribution and sharing

Client / Server: function distribution e.g. parsing in client

Heterogeneity

DBMS software

Database schema (Types) and languages (SQL variants)

Autonomy

No global control

Local DBS operations may not influenced by global operations (e.g. of a global transaction) Note: subsumes completely independent or

(7)

Classification cube

by P. Valduriez, T. Öszu

Distributed DB: looks like one DB

Federated: more autonomy but not independent (Expl. 3) Multi DB: independent, cooperative (Expl. 2)

Scenarios and common problems

Not just one distributed database systems

.. but indefinitely many

Understand common problems

e.g. how to guarantee one state for replicated data

from the user point of view

Solve by developing distributed algorithms

e.g. transaction commit

Any unsolvable problems?

Example: Internet marriage

priest

bride groom

Distributed transaction: YES of NO,

this is the question

All participants and communication unreliable

Main issue: Partial failure

(8)

12 +1 rules for DDBS (C. Date)

Rule 0: A DDB looks like a central DB to users

Rule 1: sites should be as independent as possible –local autonomy Rule 2: There should not be a central master all sites are

dependent on -No reliance on central site

Rule 3: Never a need for complete shutdown –continuous operation Rule 4: Users should not need to know where data are stored

- location transparency (independence)

Rule 5: If data are split (e.g. columns of one relation) and distributed over several sites, user's should not be aware of it

-fragmentation transparency

12 rules…

Rule 6: Users should not be aware of replicated data -replication independence

Rule 7: Efficient distributed query processing Rule 8: Global concurrency control and recovery

–distributed transaction management Rule 9: Hardware independence

Rule 10: OS independence Rule 11: Network independence Rule 12: DBMS independence

(9)

Parallel versus Distributed Databases

More similarities than differences

Similar to Parallel / Distributed Processing

distinction

Parallel DBS

Not geographically distributed Goal: High Performance Homogenous Software Fast interconnect

Distributed DBS

Data geographically distributed Goal: Data sharing

Disconnected operation possible -> autonomy

Transparency

Parallel / distributed DBS

Query processing in parallel DBS

Distribute operators (sort, filter,…) an data over processor to make complex processing fast

Join (R, S) { // |R| >> | S|

1. Split R into n-1 partitions R_i and assign to M_i/P_i; Assign S to processor / memory P_n / M_n;

2. Sort Ri and S; ( //n parallel

3. Join (n-1) + 1 streams } e.g. join on a shared disk MP system M₁ M_n P P P P

(10)

Parallel / distributed DBS

Distributed QP

Given a data distribution

Find strategy to evaluate query with minimal cost,

in particular communication cost

|S| = 100000 records

Compute with minimal cost (time): R ZY S ZY T |R| = 10000 records |T| = 1000 records 10000 km 100 km hs / FUB dbsII-03-10DDBIntro-20

IV Distributed Databases - Motivation & Introduction -