BBM467 Data Intensive ApplicaAons

(1)

Hace7epe Üniversitesi

Bilgisayar Mühendisliği Bölümü

BBM467

Data Intensive ApplicaAons

Dr. Fuat Akal

(2)

FoundaAons of Data[base] Clusters

• Database Clusters

• Hardware Architectures

• Data Design Schemes

• ReplicaAon Schemes

• Query Parallelism

• Logical Cluster OrganizaAon

(3)

Database Clusters

• A cluster of computers can be thought as a single

compuAng resource.

– 

It uAlizes mulAple machines to provide a more powerful

compuAng environment through a single system image.

• There are two types clusters

– 

high availability clusters (HA)

(4)

Hardware Architectures: Shared Memory

•  All processors have access to the main memory and the disk, respecAvely.

•  The processors are Aghtly coupled inside the

same box and interconnected with a special switch. •  The interprocess communicaAon is done by using

a shared memory.

•  The shared-‐memory approach presents simplicity and allows for load balancing as well as

inter-‐query parallelism which comes for free. •  However, it is too expensive since it requires a

special interconnect among the processors.

•  Its performance and scalability are limited with the available memory and communicaAon bandwidths.

P P P

(5)

Hardware Architectures: Shared Disk

•  In the shared-‐disk approach, all processors have their own memory, but they share disks.

•  The interprocess communicaAon occurs over a common high-‐speed bus.

•  Provides high availability. All data is sAll accessible even when a node fails.

•  Since each node has its own data cache, cache coherency must be maintained, e.g. by means of a lock manager, which results in reduced performance. •  Shared-‐disk systems have limited scalability due to

bandwidth of the high-‐speed bus and potenAal bo7lenecks of shared hardware.

D D D

M M

(6)

Hardware Architectures: Shared Nothing

•  In a shared-‐nothing architecture, each node is a complete stand-‐alone computer with its own memory and disk.

•  The nodes are connected via switch or LAN. But, they do not share anything.

•  The main advantages of such systems are very good scalability and high availability.

•  However, the management of data is complicated and the programming with this model is harder due to importance of data parAAoning and allocaAon.

D D M M P P D M P

(7)

ParAAoning Schemes

•  Ver$cal Par$$oning: VerAcal parAAoning divides the columns of a table into separate tables.

–  VerAcal parAAoning makes projecAons and joins easier and helps opAmizing access to the cache by reducing size of the tuples. However, access to the whole table may be required anyway, when execuAng queries.

•  Horizontal Par$$oning: Horizontal parAAoning divides a table along its tuples. Its basic

advantage is to allow parallel scans or projects.

–  The hash par55oning is based on a hash funcAon that distributes the tuples according to a hashing key.

•  useful for parallel exact match queries and hash-‐join operaAons.

•  not appropriate for range queries and operaAons on other than parAAoning keys. –  The range par55oning is made based on value intervals of parAAoning keys.

•  uAlizes evaluaAons of range queries.

•  the performance of the range parAAoning depends on the interval size.

–  The round robin parAAoning technique distributes the tuples on each of the parAAons. This approach is also called striping. The number of logically con-‐secuAve tuples forms a striping unit.

•  The relaAve size of the striping unit directly aﬀects the performance.

•  Small striping units result in more I/O parallelism for scans and long range queries. •  Larger striping units, on the other hand, may cause latency to complete scans.

(8)

ParAAoning Schemes

A B C 1 2 3 4 5 6 7 8 9 10 A B 1 2 3 4 5 6 7 8 9 10 A C 1 2 3 4 5 6 7 8 9 10 A B C 1 2 3 4 5 A B C 6 7 8 9 10 A B C 1 4 5 A B C 7 8 10 A B C 2 3 6 9 A B C 1 4 7 10 A B C 2 5 8 A B C 3 6 9 Original Table a) Vertical Partitioning b) Hash Partitioning c) Range Partitioning d) Round-Robin Partitioning

(9)

Virtual ParAAoning

• Virtual parAAoning, also called query parAAoning,

assumes that all tables are fully replicated on each

cluster node.

• In this approach, a query is decomposed into

subqueries which access small pieces of data by

appending range predicates to the where clause of

that query.

• Each subquery then deals with only a small part of

the data.

(10)

Virtual ParAAoning (Example)

LineItem LineItem

node A node B

SELECT Sum(L_ExtendedPrice*L_Discount) AS Revenue

FROM LineItem

WHERE L_Discount BETWEEN 0.03 AND 0.05

AND L_OrderKey BETWEEN 0 AND 3000000

FROM LineItem

AND L_OrderKey BETWEEN 300001 AND 6000000

FROM LineItem

original query

subquery2 subquery1

(11)

ReplicaAon Schemes

• Full Replica$on: Tables are duplicated on each cluster node.

That is, each node holds an exact copy of the original

database.

• Par$al Replica$on: ParAal replicaAon means that only parts

of original database are replicated on the diﬀerent cluster

nodes.

• Mixed Replica$on: Both full and parAal replicaAon at the

(12)

ReplicaAon Schemes

a) Full Replica$on b) Par$al Replica$on

c) Mixed Replica$on Original Database

(13)

Global Database Scheme

Node 1 Node 2 Node 3 Node 4 Node 5

Node Group 1 NG 2 NG 3 Database Cluster Co-existing Design Schemes 1 2 3

Mixed Data Design

-‐ Organize as node groups (NG) -‐ Freely design every NG

(14)

Query Parallelism in a Cluster

• inter-‐query parallelism: The capability of the

database management system to accept queries

from mulAple users simultaneously. Each query is

executed independently of the others.

• intra-‐query parallelism: Achieved by decomposing

queries into subqueries and evaluaAng them

simultaneously.

(15)

Data Q₁ Q₂ Database (Partition) Data Q₃ Database Partition

b) intra-query & intra-partition a) inter-query

Data

Q₄

Database Partition

c) intra-query & inter-partition

Data

Q₅

c) intra-query & intra-partition & inter-partition

Data

(16)

Logical Cluster OrganizaAon

•  Flat Cluster Architecture: Allows any cluster node to be accessible by

clients.

–  Forms a federated database of disAnct databases running on independent servers.

•  Connected by a LAN, no resource sharing, such as disks.

–  Provides high availability and simple design.

•  ReplicaAon is diﬃcult to implement with this model.

•  Middleware Based Cluster Architecture: A client can only interact with

the cluster through a coordinaAon middleware.

–  The middleware is responsible for scheduling and rouAng of the clients requests. –  The middleware has the knowledge about underlying cluster.

•  It can be used to ensure correct execuAons of concurrent updates and reads.

•  It also allows to improve overall throughput by choosing be7er components, e.g. with less load to perform client requests.

–  It is subject to single point of failure.

•  If the middleware fails, the cluster will become useless.

(17)

Logical Cluster OrganizaAon

Coordination Middleware Clients Database Cluster

(18)

ReplicaAon Management

• ReplicaAon is an essenAal technique to improve availability

and scalability by fully or parAally duplicaAng data objects

among the nodes of a distributed system.

• ReplicaAon management is responsible for the maintenance

of replicas and ensures consistency of mulAple copies of the

same data object residing on diﬀerent nodes.

• That is, replicaAon management is not simply copying data

objects onto diﬀerent nodes of a distributed system.

(19)

SynchronizaAon of Updates

• There are two possibiliAes for the locaAon of updates:

–  Updates can either be centralized on one primary copy

–  Or, be distributed on (a subset of) all replicas (update everywhere).

• SynchronizaAon of updates can be done in two ways: eager

and lazy

a) Primary Copy b) Update Everywhere

: update

: updatable object : propagation : read-only object

(20)

SynchronizaAon of Updates

• Eager (or synchronous) replicaAon.

–  All copies of an object are synchronized within the same database

transacAon.

–  Allows early detecAon of conﬂicts and presents a simple soluAon to

provide consistency.

–  Has drawbacks regarding performance and due to the high

communicaAon overhead among the replicas and the high probability of deadlocks.

• Lazy (or asynchronous) replicaAon.

–  Replica maintenance is decoupled from the original database transacAon.

–  The transacAons keeping the replicas up-‐to-‐date and consistent run as

separate and independent database transacAons aler the original transacAon has commi7ed.

–  Compared to eager replicaAon approaches, lazy approaches require

(21)

(22)

(23)

Lazy Primary Copy ReplicaAon with

Immediate Updates

(24)

Lazy Primary Copy ReplicaAon with

Deferred Updates

(25)

BBM467 Data Intensive ApplicaAons

BBM467

Data Intensive ApplicaAons

Dr. Fuat Akal

FoundaAons of Data[base] Clusters

•

Database Clusters

•

Hardware Architectures

•

Data Design Schemes

•

ReplicaAon Schemes

•

Query Parallelism

•

Logical Cluster OrganizaAon

Database Clusters

•

A cluster of computers can be thought as a single

compuAng resource.

–

It uAlizes mulAple machines to provide a more powerful

compuAng environment through a single system image.

•

There are two types clusters

–

high availability clusters (HA)

Hardware Architectures: Shared Memory

Hardware Architectures: Shared Disk

Hardware Architectures: Shared Nothing

ParAAoning Schemes

ParAAoning Schemes

Virtual ParAAoning

•

Virtual parAAoning, also called query parAAoning,

assumes that all tables are fully replicated on each

cluster node.

•

In this approach, a query is decomposed into

subqueries which access small pieces of data by

appending range predicates to the where clause of

that query.

•

Each subquery then deals with only a small part of

the data.

Virtual ParAAoning (Example)

ReplicaAon Schemes

•

Full Replica$on: Tables are duplicated on each cluster node.

That is, each node holds an exact copy of the original

database.

•

Par$al Replica$on: ParAal replicaAon means that only parts

of original database are replicated on the diﬀerent cluster

nodes.

•

Mixed Replica$on: Both full and parAal replicaAon at the

ReplicaAon Schemes

Query Parallelism in a Cluster

•

inter-­‐query parallelism: The capability of the

database management system to accept queries

from mulAple users simultaneously. Each query is

executed independently of the others.

•

intra-­‐query parallelism: Achieved by decomposing

queries into subqueries and evaluaAng them

simultaneously.

Logical Cluster OrganizaAon

Logical Cluster OrganizaAon

ReplicaAon Management

•

ReplicaAon is an essenAal technique to improve availability

and scalability by fully or parAally duplicaAng data objects

among the nodes of a distributed system.

•

ReplicaAon management is responsible for the maintenance

of replicas and ensures consistency of mulAple copies of the

same data object residing on diﬀerent nodes.

• 

• 

• 

• 

• 

• 

• 

– 

• 

– 

• 

• 

• 

• 

• 

• 

• 

inter-‐query parallelism: The capability of the

• 

intra-‐query parallelism: Achieved by decomposing

• 

• 

• 

• 

• 

• 

•