COMP 302 Valentina Tamma
Distributed Databases
Connolly & Begg. Chapter 22. Third edition
COMP 302 Valentina Tamma
Distributed Databases – Basic Concepts
• Concepts.
• Advantages and disadvantages of distributed
databases.
• Functions and architecture for a DDBMS.
• Distributed database design.
• Levels of transparency.
• Comparison criteria for Distributed DBMSs.
COMP 302 Valentina Tamma
Why distributed databases?
Some initial motivations:
• The development of computer networks promotes decentralization.
• In a company, the database organization might reflect the organizational structure, which is distributed into units. Each unit maintains its own database.
• Sharing of data can be achieved by developing a distributed database system which:
• makes data accessible by all units
• stores data close to where it is most frequently used.
COMP 302 Valentina Tamma3
Concepts
Distributed Database
A logically interrelated collection of shared data (and a description of this data), physically distributed over a computer network.
Distributed DBMS (DDBMS)
Software system that permits the management of the distributed database and makes the distribution transparent to users.
COMP 302 Valentina Tamma
An example of DDBMS
COMP 302 Valentina Tamma4
DDBMS - characteristics
• Collection of logically-related shared data.
• Data split into fragments.
• Fragments may be replicated.
• Fragments/replicas allocated to sites.
• Sites linked by a communications network.
• Data at each site is under control of a DBMS.
• DBMSs handle local applications autonomously.
• Each DBMS participates in at least one global
application.
COMP 302 Valentina Tamma6
These are not DDBMSs
Distributed Processing A centralized database that can be accessed over a computer network.
COMP 302 Valentina Tamma7
These are not DDBMSs
Parallel DBMS
A DBMS running across multiple processors and
disks designed to execute operations in parallel,
whenever possible, to improve performance.
• Based on premise that single processor systems can no longer meet requirements for cost-effective scalability, reliability, and performance.
• Parallel DBMSs link multiple, smaller machines to achieve same throughput as single, larger machine, with greater scalability and reliability.
COMP 302 Valentina Tamma
Parallel DBMS
Main architectures for parallel DBMSs are:
• Shared memory, • Shared disk, • Shared nothing.
COMP 302 Valentina Tamma
Parallel DBMS
(a) shared memory (b) shared disk (c) shared nothingCOMP 302 Valentina Tamma10
Advantages of DDBMSs
• Reflects Organizational Structure
• Improved Sharing and Local Autonomy
• Improved Availability
A failure does not make the entire system inoperable • Improved Reliability
Data may be replicated
• Improved Performance
Data are local to the site of “greatest demand”
• Economics
Many small computers cost less than a big one!
• Modular Growth
easy to add new modules
COMP 302 Valentina Tamma11
Disadvantages of DDBMSs
• Complexity
• Cost
Especially in system management • Security
network must be made secure • Integrity Control More Difficult
• Lack of Standards
• Lack of Experience
• Database Design More Complex
due to fragmentation, allocation of fragments to a specific site, …..
COMP 302 Valentina Tamma12
Types of DDBMS
Homogeneous DDBMS
• All sites use same DBMS product (eg.Oracle) • Fairly easy to design and manage.
Heterogeneous DDBMS
• Sites may run different DBMS products (eg. Oracle and Ingress)
• Possibly different underlying data models (eg. relational DB and OO database)
• Occurs when sites have implemented their own databases and integration is considered later.
• We won’t consider heterogeneous DDBMSs here.
COMP 302 Valentina Tamma
Multidatabase System (MDBS)
• DDBMS in which each site maintains completeautonomy.
• DBMS that resides transparently on top of existing database and file systems and presents a single database to its users.
• Allows users to access and share data without requiring physical database integration.
• Unfederated MDBS (no local users) and federated MDBS.
COMP 302 Valentina Tamma
Overview of Networking
• Network - Interconnected collection of autonomous computers, capable of exchanging information. • Local Area Network (LAN) intended for connecting
computers at same site.
• Wide Area Network (WAN) used when computers or LANs need to be connected over long distances.
• WAN relatively slow and less reliable than LANs. DDBMS using LAN provides much faster response time than one using WAN.
COMP 302 Valentina Tamma
COMP 302 Valentina Tamma
Functions of a DDBMS
• Expect DDBMS to have at least the functionality of
a DBMS (see
Connolly & Begg. Chapter 2. Third edition)• Also to have following functionality:
• Extended communication services. • Extended Data Dictionary. • Distributed query processing. • Extended concurrency control. • Extended recovery services. • Extended security control.
COMP 302 Valentina Tamma
Reference Architecture for DDBMS
• Due to diversity, no accepted architecture equivalent toANSI/SPARC 3-level architecture for DBMSs. • A possible reference architecture consists of:
• Set of global external schemas. • Global conceptual schema (GCS).
• Fragmentation schema and allocation schema.
• Set of schemas for each local DBMS conforming to 3-level
ANSI/SPARC .
• Some levels may be missing, depending on levels of transparency supported.
COMP 302 Valentina Tamma
Reference Architecture for DDBMS
COMP 302 Valentina Tamma
Reference Architecture for DDBMS
•
Global Conceptual Schema
is the logical
description of the DB as if it were not distributed. It
contains definitions of entities, relationships,
constraints, security, and integrity information.
• Fragmentation and Allocation Schemas
describe
how data are logically partitioned, and where they
are located, taking replication into account.
• Local Schemas
are the logical descriptions of the
COMP 302 Valentina Tamma
Components of a DDBMS
COMP 302 Valentina Tamma
Distributed Databases
Issues in Distributed Database Design
COMP 302 Valentina Tamma27
Issues in Distributed Database Design
Three key issues we have to consider:
• Data Allocation: where are data placed? Data should be stored at site with "optimal" distribution.
• Fragmentation: relation may be divided into a number of sub-relations (called fragments) , which are stored in different sites.
• Replication: copy of fragment may be maintained at several sites.
COMP 302 Valentina Tamma
Issues in Distributed Database Design
• Definition and allocation of fragments carried out
strategically to achieve:
• Locality of Reference
• Improved Reliability and Availability • Improved Performance
• Balanced Storage Capacities and Costs • Minimal Communication Costs.
• Involves analysing most important transactions,
based on quantitative/qualitative information.
COMP 302 Valentina Tamma
Fragmentation
• Quantitative information may include:
• frequency with which a transaction is run;
• site from which a transaction is run;
• performance criteria for transactions.
• Qualitative information may include transactions that are executed such as:
• type of access (read or write); • predicates of read operations.
COMP 302 Valentina Tamma30
Data Allocation
• Four strategies regarding placement of data:
• Centralized
• Partitioned (or Fragmented)
• Complete Replication
• Selective Replication
COMP 302 Valentina Tamma31
Data Allocation
• Centralized: Consists of single database stored at one site with users distributed across the network.
(This is not a DDB but distributed processing!!)
• Partitioned: Database partitioned into disjoint fragments, each fragment assigned to one site.
• Complete Replication: Consists of maintaining complete copy of database at each site.
• Selective Replication:Combination of partitioning, replication, and centralization.
COMP 302 Valentina Tamma28
Fragmentation
A relation R is divided into fragments r
1, r
2, …r
n,
which contain enough information to allow
reconstruction of R
Example:
We have a relation Sells(pub, address,price,type) Type is “bitter” or “lager”.
We can split Sells into twp dfferent fragments: • SellsBitter= σtype = “bitter”(Sells)
COMP 302 Valentina Tamma
Comparison of Strategies for Data Distribution
COMP 302 Valentina Tamma
Why Fragment?
• Usage• Applications work with views rather than entire relations.
• Efficiency
• Data is stored close to where it is most frequently used.
• Data that is not needed by local applications is not stored
• Parallelism
• With fragments as unit of distribution, transaction can be divided into several sub-queries that operate on fragments.
• Security
• Data not required by local applications is not stored and so not available to unauthorized users.
COMP 302 Valentina Tamma
Why Fragment?
Two main disadvantages:
•
Performance
•Integrity.
COMP 302 Valentina Tamma
Types of Fragmentation
• Four types of fragmentation:
• Horizontal, • Vertical, • Mixed, • Derived.
• Other possibility is no fragmentation:
• If relation is small and not updated frequently, may be better not to fragment relation.
COMP 302 Valentina Tamma
Two types of fragmentation:
• Horizontal
• Vertical
Horizontal and Vertical Fragmentation
COMP 302 Valentina Tamma
Mixed Fragmentation
© Pearson Education Limited 1995, 2005
COMP 302 Valentina Tamma43
Horizontal Fragmentation
• Each fragment consists of a subset of the tuples
of a relation R.
• Defined using Selection operation of relational algebra: • σp(R)
• Example:
• Relation: Sells(pub, address,price,type) • Fragments:
» SellsBitter= σtype = “bitter”(Sells)
» SellsLager= σtype = “lager”(Sells)
COMP 302 Valentina Tamma
Horizontal Fragmentation
• This strategy is determined by looking at predicates used by transactions.
• Involves finding set of minimal (complete and relevant) predicates.
• Set of predicates is complete, if and only if, any two tuples in same fragment are referenced with same probability by any application.
• Predicate is relevant if there is at least one application that accesses fragments differently.
COMP 302 Valentina Tamma45
Vertical Fragmentation
• Each fragment consists of a subset of attributes of a relation R.
• Defined using projection operation of relational algebra:
• Πa1,…an(R)
• Determined by establishing affinity of one attribute to another. • Example:
• Relation: Bars(name,address,licence,employees,owner) • Fragments:
» Πname,address,licence (Bars
)
» Πname,address,employees,owner(Bars
)
COMP 302 Valentina Tamma46
Mixed Fragmentation
• We can also mix horizontal and vertical fragmentation. • We obtain a fragment that consist of an horizontal
fragment that is vertically fragmented, or a vertical fragment that is horizontally fragmented.
• Defined using Selection and Projection operations of relational algebra.
σp(Πa1,…an(R))
Πa1,…an(σp(R))
COMP 302 Valentina Tamma
Example - Mixed Fragmentation
S
1= ∏
staffNo, position, sex, DOB, salary(Staff)
S
2= ∏
staffNo, fName, lName, branchNo(Staff)
S
21= σ
branchNo=‘B003’(S
2)
S
22= σ
branchNo=‘B005’(S
2)
S
23= σ
branchNo=‘B007’(S
2)
COMP 302 Valentina Tamma
Derived Horizontal Fragmentation
• A horizontal fragment that is based on
horizontal fragmentation of a parent relation.
• Ensures that fragments that are frequently
joined together are at same site.
• Defined using Semijoin operation of relational
algebra:
COMP 302 Valentina Tamma
Example - Derived Horizontal Fragmentation
S3 = σ branchNo=‘B003’(Staff)S4 = σ branchNo=‘B005’(Staff)
S5 = σ branchNo=‘B007’(Staff)
Could use derived fragmentation for Property: Pi = PropertyForRent >branchNo Si, 3 ≤ i ≤ 5
COMP 302 Valentina Tamma
Derived Horizontal Fragmentation
• If relation contains more than one foreign key, need toselect one as parent.
• Choice can be based on fragmentation used most frequently or fragmentation with better join characteristics.
COMP 302 Valentina Tamma37
How can we define fragments correctly?
In defining fragments we have to be very careful.
Three correctness rules:
• Completeness
• Reconstruction
• Disjointness.
COMP 302 Valentina Tamma38
Correctness of Fragmentation
Completeness
: If relation R is decomposed into
fragments r
1, r
2, …r
n, each data item that can be
found in R must appear in at least one fragment.
This ensures no loss of data during fragmentation.
COMP 302 Valentina Tamma
Correctness of Fragmentation
Recostruction: we must be able to reconstruct the entire R from fragments.
For horizontal fragmentation is union operation. R = r1 ∪ r2 ∪ … ∪rn,
For vertical fragmentation is natural join operation. R = r1 >< r2 >< … ><rn,
To ensure reconstruction we have to include primary key attributes in all fragments.
COMP 302 Valentina Tamma39
Correctness of Fragmentation
Disjointness: if data item x appears in fragment ri, then it should not appear in any other fragment.
Exception: vertical fragmentation, where primary key attributes must be repeated to allow reconstruction. For horizontal fragmentation, data item is a tuple For vertical fragmentation, data item is an attribute.
COMP 302 Valentina Tamma43
Correctness of Horizontal Fragmentation
Relation: Sells(pub, address,price,type) type={Bitter, Lager} Fragments:
• SellsBitter= σtype = “bitter”(Sells)
• SellsLager= σtype = “lager”(Sells)
Correctness rules
• Completeness: Each tuple in the relation appears either in SellsBitter, or in SellsLager
• Reconstruction: The Sells relation can be reconstructed from the fragments
Sells = SellsBitter ∪ SellsLager
• Disjointness: The two fragments are disjoint, there can be no beer that is both “Lager” and “Bitter”
COMP 302 Valentina Tamma43
Correctness of Vertical Fragmentation
Relation: Bars(name,address,licence,employees,owner) Fragments:
• r1 =Πname,address,licence (Bars)
• r2 = Πname,address,employees,owner(Bars)
Correctness rules
• Completeness: Each attribute in the Bars relation appears either in r1 or in r2
• Reconstruction: The Bars relation can be reconstructed from the fragments
Bars = r1 >< r2
• Disjointness: The two fragments are disjoint, except for the primary key, name, which is necessary for reconstruction
COMP 302 Valentina Tamma
Distributed Databases
Transparency in Distributed databases
COMP 302 Valentina Tamma
Transparencies in a DDBMS
• Distribution Transparency
• Transaction Transparency
• Performance Transparency
• DBMS Transparency
COMP 302 Valentina Tamma51
Distribution Transparency
The user has to perceive the DDB as a single,
logical entity
• Fragmentation Transparency: the user does not need to know that data is fragmented
• Location Transparency: the user does not need to know the location of data items
• Replication Transparency: the user is unaware of relication of data.
• Naming transparency: items in a database must have a unique name, but users don’t need to worry about it.
COMP 302 Valentina Tamma54
Naming Transparency
• Each item in a DDB must have a unique name.
• DDBMS must ensure that no two sites create a
database object with same name.
Solution 1: create central name server.
Disadvantages:
• loss of some local autonomy; • central site may become a bottleneck;
• low availability; if the central site fails, remaining sites cannot create any new objects.
COMP 302 Valentina Tamma55
Naming Transparency
Solution 2: prefix object with identifier of site that
created it.
Example: Beer created at site S1 might be named
S1.Beer.
Disadvantage: loss of distribution transparency.
COMP 302 Valentina Tamma56
Naming Transparency
Solution 3: use aliases for each database object.
Example: S1.Beer might be known as local_Beer
by user at site S1.
The DDBMS has task of mapping an alias to
appropriate database object.
COMP 302 Valentina Tamma57
Transaction Transparency
• Ensures that all distributed transactions maintain distributed database’s integrity and consistency.
• Distributed transaction accesses data stored at more than one location.
• Each transaction is divided into number of sub-transactions, one for each site that has to be accessed. • DDBMS must ensure the indivisibility of both the global
transaction and each sub-transactions.
• Must ensure both concurrency transparency, and failure transparency
COMP 302 Valentina Tamma58
Example - Distributed Transaction
Relation: Sells(pub, beer,price,type) Fragments:
• SellsBitter= σtype = “bitter”(Sells)
• SellsLager= σtype = “lager”(Sells)
The two fragments are at two different sites.
Transaction T prints out the names of all pubs in the relation sells. This transaction is split into two sub-transactions, one for each fragment.
COMP 302 Valentina Tamma
Example - Distributed Transaction
T prints out names of all staff, using schema
defined above as S
1, S
2, S
21, S
22, and S
23.
Define three subtransactions T
S3, T
S5, and T
S7to represent agents at sites 3, 5, and 7.
COMP 302 Valentina Tamma59
Concurrency Transparency
• All transactions must execute independently and be logically consistent with results obtained if transactions executed one at a time, in some arbitrary serial order. • Same fundamental principles as for centralised DBMS. • DDBMS must ensure both global and local transactions do
not interfere with each other.
• Similarly, DDBMS must ensure consistency of all sub-transactions of global transaction.
• Techniques for concurrency control. Usually different from the ones for DBMS.
COMP 302 Valentina Tamma
Concurrency Transparency
• Replication makes concurrency more complex.
• If a copy of a replicated data item is updated,
update must be propagated to all copies.
• Could propagate changes as part of original
transaction, making it an atomic operation.
• However, if one site holding copy is not reachable,
then transaction is delayed until site is reachable.
COMP 302 Valentina Tamma
Concurrency Transparency
• Could limit update propagation to only those sites
currently available. Remaining sites updated when
they become available again.
• Could allow updates to copies to happen
asynchronously, sometime after the original
update. Delay in regaining consistency may range
from a few seconds to several hours.
COMP 302 Valentina Tamma
Failure Transparency
• DDBMS must ensure atomicity and durability of
global transaction.
• Means ensuring that sub-transactions of global
transaction either all commit or all abort.
• Thus, DDBMS must synchronize global
transaction to ensure that all sub-transactions
have completed successfully before recording a
final COMMIT for global transaction.
• Must do this in presence of site and network
failures.
COMP 302 Valentina Tamma63
Performance Transparency
DDBMS must perform as if it were a
centralized DBMS:
•
DDBMS should not suffer any performance
degradation due to distributed architecture.
•
DDBMS should determine most cost-effective
strategy to execute a request.
COMP 302 Valentina Tamma64
Performance Transparency
• Distributed Query Processor (DQP) maps data
request into ordered sequence of operations on
local databases.
• It must consider fragmentation, replication, and
allocation schemas.
• DQP has to decide:
• which fragment to access; • which copy of a fragment to use; • which location to use.
COMP 302 Valentina Tamma65
Performance Transparency
• DQP produces execution strategy optimised
with respect to some cost function.
• Typically, costs associated with a distributed
request include:
• I/O cost; • CPU cost;
COMP 302 Valentina Tamma66
Performance Transparency - Example
Property(Pno, City) 10000 records in London Renter(Rno,Max_Price) 100000 records in Glasgow Viewing(Pno, Rno) 1000000 records in LondonSELECT p.pno
FROM property p INNER JOIN
(renter r INNER JOIN viewing v ON r.rno = v.rno) ON p.pno = v.pno
WHERE p.city=‘Aberdeen’ AND r.max_price > 200000;
COMP 302 Valentina Tamma67
Performance Transparency - Example
Assume:
• Each tuple in each relation is 100 characters long. • 10 renters with maximum price greater than
£200,000.
• 100 000 viewings for properties in Aberdeen. • Computation time negligible compared to
communication time.
COMP 302 Valentina Tamma68
Performance Transparency - Example
COMP 302 Valentina Tamma69
Date’s 12 Rules for a DDBMS
0. Fundamental PrincipleTo the user, a distributed system should look exactly like a non-distributed system
.
1. Local Autonomy
2. No Reliance on a Central Site 3. Continuous Operation 4. Location Independence 5. Fragmentation Independence 6. Replication Independence
COMP 302 Valentina Tamma69
Date’s 12 Rules for a DDBMS
0. Fundamental Principle
To the user, a distributed system should look exactly like a non-distributed system.
7. Distributed Query Processing 8. Distributed Transaction Processing 9. Hardware Independence 10. Operating System Independence 11. Network Independence 12. Database Independence Note: last four rules are ideal!
COMP 302 Valentina Tamma4
Distributed Transaction Management
• DDBMS must ensure:
• synchronization of sub-transactions with other local transactions executing concurrently at a site; • synchronization of sub-transactions with global
transactions running simultaneously at same or different sites.
• Global transaction manager (transaction
coordinator) at each site, to coordinate global and
local transactions initiated at that site.
COMP 302 Valentina Tamma