• No results found

BBM467 Data Intensive ApplicaAons

N/A
N/A
Protected

Academic year: 2021

Share "BBM467 Data Intensive ApplicaAons"

Copied!
25
0
0

Loading.... (view fulltext now)

Full text

(1)

Hace7epe  Üniversitesi  

Bilgisayar  Mühendisliği  Bölümü  

BBM467  

Data  Intensive  ApplicaAons  

Dr.  Fuat  Akal  

(2)

FoundaAons  of  Data[base]  Clusters  

• 

Database  Clusters  

• 

Hardware  Architectures  

• 

Data  Design  Schemes  

• 

ReplicaAon  Schemes  

• 

Query  Parallelism  

• 

Logical  Cluster  OrganizaAon  

(3)

Database  Clusters  

• 

A  cluster  of  computers  can  be  thought  as  a  single  

compuAng  resource.  

– 

It  uAlizes  mulAple  machines  to  provide  a  more  powerful  

compuAng  environment  through  a  single  system  image.    

 

• 

There  are  two  types  clusters  

– 

high  availability  clusters  (HA)  

(4)

Hardware  Architectures:  Shared  Memory  

•  All  processors  have  access  to  the  main  memory   and  the  disk,  respecAvely.  

•  The  processors  are  Aghtly  coupled  inside  the  

same  box  and  interconnected  with  a  special  switch.     •  The  interprocess  communicaAon  is  done  by  using  

a  shared  memory.  

•  The  shared-­‐memory  approach  presents  simplicity   and  allows  for  load  balancing  as  well  as    

inter-­‐query  parallelism  which  comes  for  free.   •  However,  it  is  too  expensive  since  it  requires  a    

special  interconnect  among  the  processors.  

•  Its  performance  and  scalability  are  limited  with  the  available  memory  and   communicaAon  bandwidths.    

P P P

(5)

Hardware  Architectures:  Shared  Disk  

•  In  the  shared-­‐disk  approach,  all  processors  have   their  own  memory,  but  they  share  disks.  

•  The  interprocess  communicaAon  occurs  over  a   common  high-­‐speed  bus.  

•  Provides  high  availability.  All  data  is  sAll  accessible   even  when  a  node  fails.  

•  Since  each  node  has  its  own  data  cache,  cache   coherency  must  be  maintained,  e.g.  by  means  of  a   lock  manager,  which  results  in  reduced  performance.   •  Shared-­‐disk  systems  have  limited  scalability  due  to  

bandwidth  of  the  high-­‐speed  bus  and  potenAal   bo7lenecks  of  shared  hardware.    

D D D

M M

(6)

Hardware  Architectures:  Shared  Nothing  

•  In  a  shared-­‐nothing  architecture,  each  node  is  a   complete  stand-­‐alone  computer  with  its  own   memory  and  disk.  

•  The  nodes  are  connected  via  switch  or  LAN.  But,  they   do  not  share  anything.  

•  The  main  advantages  of  such  systems  are  very  good   scalability  and  high  availability.  

•  However,  the  management  of  data  is  complicated   and  the  programming  with  this  model  is  harder  due   to  importance  of  data  parAAoning  and  allocaAon.    

D D M M P P D M P

(7)

ParAAoning  Schemes  

•  Ver$cal  Par$$oning:  VerAcal  parAAoning  divides  the  columns  of  a  table  into  separate  tables.  

 

–  VerAcal  parAAoning  makes  projecAons  and  joins  easier  and  helps  opAmizing  access  to  the  cache  by   reducing  size  of  the  tuples.  However,  access  to  the  whole  table  may  be  required  anyway,  when   execuAng  queries.  

 

•  Horizontal  Par$$oning:  Horizontal  parAAoning  divides  a  table  along  its  tuples.  Its  basic  

advantage  is  to  allow  parallel  scans  or  projects.    

 

–  The  hash  par55oning  is  based  on  a  hash  funcAon  that  distributes  the  tuples  according  to  a  hashing   key.  

•  useful  for  parallel  exact  match  queries  and  hash-­‐join  operaAons.  

•  not  appropriate  for  range  queries  and  operaAons  on  other  than  parAAoning  keys.   –  The  range  par55oning  is  made  based  on  value  intervals  of  parAAoning  keys.  

•  uAlizes  evaluaAons  of  range  queries.  

•  the  performance  of  the  range  parAAoning  depends  on  the  interval  size.    

–  The  round  robin  parAAoning  technique  distributes  the  tuples  on  each  of  the  parAAons.  This  approach   is  also  called  striping.  The  number  of  logically  con-­‐secuAve  tuples  forms  a  striping  unit.  

•  The  relaAve  size  of  the  striping  unit  directly  affects  the  performance.  

•  Small  striping  units  result  in  more  I/O  parallelism  for  scans  and  long  range  queries.   •  Larger  striping  units,  on  the  other  hand,  may  cause  latency  to  complete  scans.    

(8)

ParAAoning  Schemes  

A B C 1 2 3 4 5 6 7 8 9 10 A B 1 2 3 4 5 6 7 8 9 10 A C 1 2 3 4 5 6 7 8 9 10 A B C 1 2 3 4 5 A B C 6 7 8 9 10 A B C 1 4 5 A B C 7 8 10 A B C 2 3 6 9 A B C 1 4 7 10 A B C 2 5 8 A B C 3 6 9 Original Table a) Vertical Partitioning b) Hash Partitioning c) Range Partitioning d) Round-Robin Partitioning

(9)

Virtual  ParAAoning  

• 

Virtual  parAAoning,  also  called  query  parAAoning,  

assumes  that  all  tables  are  fully  replicated  on  each  

cluster  node.  

• 

In  this  approach,  a  query  is  decomposed  into  

subqueries  which  access  small  pieces  of  data  by  

appending  range  predicates  to  the  where  clause  of  

that  query.  

• 

Each  subquery  then  deals  with  only  a  small  part  of  

the  data.  

(10)

Virtual  ParAAoning  (Example)  

LineItem   LineItem  

node  A   node  B  

SELECT  Sum(L_ExtendedPrice*L_Discount)    AS  Revenue    

FROM  LineItem    

WHERE  L_Discount  BETWEEN  0.03  AND  0.05  

AND  L_OrderKey  BETWEEN  0  AND  3000000    

SELECT  Sum(L_ExtendedPrice*L_Discount)    AS  Revenue    

FROM  LineItem    

WHERE  L_Discount  BETWEEN  0.03  AND  0.05  

AND  L_OrderKey  BETWEEN  300001  AND  6000000    

SELECT  Sum(L_ExtendedPrice*L_Discount)    AS  Revenue    

FROM  LineItem    

WHERE  L_Discount  BETWEEN  0.03  AND  0.05  

original  query  

subquery2   subquery1  

(11)

ReplicaAon  Schemes  

• 

Full  Replica$on:  Tables  are  duplicated  on  each  cluster  node.  

That  is,  each  node  holds  an  exact  copy  of  the  original  

database.  

• 

Par$al  Replica$on:  ParAal  replicaAon  means  that  only  parts  

of  original  database  are  replicated  on  the  different  cluster  

nodes.    

• 

Mixed  Replica$on:  Both  full  and  parAal  replicaAon  at  the  

(12)

ReplicaAon  Schemes  

a)  Full  Replica$on   b)  Par$al  Replica$on  

c)  Mixed  Replica$on   Original  Database  

(13)

Global Database Scheme

Node 1 Node 2 Node 3 Node 4 Node 5

Node Group 1 NG 2 NG 3 Database Cluster Co-existing Design Schemes 1 2 3

Mixed  Data  Design  

-­‐  Organize  as  node  groups  (NG)   -­‐  Freely  design  every  NG  

(14)

Query  Parallelism  in  a  Cluster  

• 

inter-­‐query  parallelism:  The  capability  of  the  

database  management  system  to  accept  queries  

from  mulAple  users  simultaneously.  Each  query  is  

executed  independently  of  the  others.  

 

• 

intra-­‐query  parallelism:  Achieved  by  decomposing  

queries  into  subqueries  and  evaluaAng  them  

simultaneously.  

(15)

Data Q1 Q2 Database (Partition) Data Q3 Database Partition

b) intra-query & intra-partition a) inter-query

Data

Q4

Database Partition

c) intra-query & inter-partition

Data

Database Partition

Data

Q5

Database Partition

c) intra-query & intra-partition & inter-partition

Data

(16)

Logical  Cluster  OrganizaAon  

•  Flat  Cluster  Architecture:  Allows  any  cluster  node  to  be  accessible  by  

clients.  

–  Forms  a  federated  database  of  disAnct  databases  running  on  independent  servers.  

•  Connected  by  a  LAN,  no  resource  sharing,  such  as  disks.  

–  Provides  high  availability  and  simple  design.  

•  ReplicaAon  is  difficult  to  implement  with  this  model.    

 

•  Middleware  Based  Cluster  Architecture:  A  client  can  only  interact  with  

the  cluster  through  a  coordinaAon  middleware.  

–  The  middleware  is  responsible  for  scheduling  and  rouAng  of  the  clients  requests.   –  The  middleware  has  the  knowledge  about  underlying  cluster.    

•  It  can  be  used  to  ensure  correct  execuAons  of  concurrent  updates  and  reads.  

•  It  also  allows  to  improve  overall  throughput  by  choosing  be7er  components,  e.g.  with  less  load   to  perform  client  requests.  

–  It  is  subject  to  single  point  of  failure.  

•  If  the  middleware  fails,  the  cluster  will  become  useless.  

(17)

Logical  Cluster  OrganizaAon  

Coordination Middleware Clients Database Cluster

(18)

ReplicaAon  Management  

• 

ReplicaAon  is  an  essenAal  technique  to  improve  availability  

and  scalability  by  fully  or  parAally  duplicaAng  data  objects  

among  the  nodes  of  a  distributed  system.  

 

• 

ReplicaAon  management  is  responsible  for  the  maintenance  

of  replicas  and  ensures  consistency  of  mulAple  copies  of  the  

same  data  object  residing  on  different  nodes.  

 

• 

That  is,  replicaAon  management  is  not  simply  copying  data  

objects  onto  different  nodes  of  a  distributed  system.    

(19)

SynchronizaAon  of  Updates  

• 

There  are  two  possibiliAes  for  the  locaAon  of  updates:  

–  Updates  can  either  be  centralized  on  one  primary  copy  

–  Or,  be  distributed  on  (a  subset  of)  all  replicas  (update  everywhere).                  

• 

SynchronizaAon  of  updates  can  be  done  in  two  ways:  eager  

and  lazy  

a) Primary Copy b) Update Everywhere

: update

: updatable object : propagation : read-only object

(20)

SynchronizaAon  of  Updates  

• 

Eager  (or  synchronous)  replicaAon.  

–  All  copies  of  an  object  are  synchronized  within  the  same  database  

transacAon.  

–  Allows  early  detecAon  of  conflicts  and  presents  a  simple  soluAon  to  

provide  consistency.  

–  Has  drawbacks  regarding  performance  and  due  to  the  high  

communicaAon  overhead  among  the  replicas  and  the  high  probability  of   deadlocks.  

• 

Lazy  (or  asynchronous)  replicaAon.  

–  Replica  maintenance  is  decoupled  from  the  original  database  transacAon.  

–  The  transacAons  keeping  the  replicas  up-­‐to-­‐date  and  consistent  run  as  

separate  and  independent  database  transacAons  aler  the  original   transacAon  has  commi7ed.  

–  Compared  to  eager  replicaAon  approaches,  lazy  approaches  require  

(21)
(22)
(23)

Lazy  Primary  Copy  ReplicaAon  with  

Immediate  Updates  

(24)

Lazy  Primary  Copy  ReplicaAon  with  

Deferred  Updates  

(25)

References

Related documents

To the fullest extent permitted by law, the Company shall indemnify, defend and hold harmless each of the “Indemnitees” (as defined below) from and against any and all “Charges”

In this study, the impact of a number of important design factors on CAT performance is systematically evaluated, using realistic example item banks for two main scenarios:

Additionally, Latino students’ marked difference in library usage might stem out of a confusion between the role and purpose of the academic library as separate and distinct from

The obligation slice policy defines the permissions that must be activated and roles that must participate during a particular collaboration step. The root node defines the

StoneL offers a full range of monitoring, communication, and control platforms ideally suited for quarter-turn and linear discrete valve applications in extreme

Legal efforts taken by consumers in terms of disputes relating to the responsibilities of business actors in buying and selling transactions related to defective products

Susut masak daging sapi Bali sangat nyata (P<0.01) lebih tinggi 54.50% dari daging dada ayam dan lebih tinggi 29.46% dari daging paha ayam, sementara susut masak daging dada

1 Using methods discussed in Giordani and S¨oderlind (2003), we study data from the Livingston Survey and, in particular, from the Survey of Professional Forecasters, looking