• No results found

Using RDBMS, NoSQL or Hadoop?

N/A
N/A
Protected

Academic year: 2021

Share "Using RDBMS, NoSQL or Hadoop?"

Copied!
32
0
0

Loading.... (view fulltext now)

Full text

(1)

Copyright  ©  2014  Oracle  and/or  its  affiliates.  All  rights  reserved.    |  

Using  RDBMS,  NoSQL  or  Hadoop?  

Jean-­‐Pierre  Dijcks  

Big  Data  Product  Management  

Server  Technologies  

 

(2)
(3)

Ingest   “Chunk”   256  MB   Node  1   Node  2   Node  3   HDFS   Synchronous  

(4)

Ingest   “Chunk”   256  MB   Node  1   Node  2   Node  3   HDFS   Synchronous   Master   Node   Eventually   Consistent   “Records”   Replica   Node  1   Replica   Node  2   <key><value>   <key><value>   <key><value>   t   t  +  1   t  +  2   Ingest  

(5)

Ingest   “Chunk”   256  MB   Node  1   Node  2   Node  3   HDFS   Synchronous   Master   Node   Eventually   Consistent   “Records”   Replica   Node  1   Replica   Node  2   <key><value>   <key><value>   <key><value>   t   t  +  1   t  +  2   Ingest   Compute   Parse  /   Validate   DBF   DBF   Storage   ASM   SQL   Ingest   “Transac^ons”   Op^mized  

(6)

High-­‐level  Comparison  

HDFS   NoSQL   RDBMS  

Data  Type   Chunk   Record   Transac^on  

Write  Type   Synchronous   Eventually  Consistent   ACID    Compliant  

Data  Prepara^on   No  Parsing   No  Parsing   Parsing  and  Valida^on  

In

ge

(7)

Disaster  Recovery  -­‐  High  Availability  Across  

Geography  

(8)

Disaster  Recovery:  Hadoop  

Ingest   “Chunk”   256  MB   Node  1   Node  2   Node  3   Synchronous   between  nodes   Data  Center  1   Node  1   Node  2   Node  3   Data  Center  2  

MUST  Duplicate  to   a  separate  Cluster  

(9)

Disaster  Recovery:  RDBMS  

Ingest   Compute   “Transac^ons”   Parse  /   Validate   DBF   DBF   Storage   ASM   Op^mized   SQL   Data  Guard   Golden  Gate  

(10)

Disaster  Recovery:  NoSQL  Op^on  

Ingest   Master  Node  

Eventually   Consistent   “Records”   Replica   Node  1   Replica   Node  2   <key><value>   <key><value>   <key><value>   t   t  +  1   t  +  2  

Data  Center  1  or  Region  1   Data  Center  2  or  Region  2  

(11)

Note  on  HBase  

Don't  confuse  HBase  with  NoSQL  in  terms  of  geographical  distribu^on  

op^ons  

Hbase  is  based  on  HDFS  storage  

Consequently,  HBase  cannot  span  data  centers  as  we  see  in  other  NoSQL  

systems  

(12)

Big  Data  Appliance  includes  Cloudera  BDR  

Oracle  packages  this  on  BDA  as  a  Cloudera  op^on  called  BDR  (Backup  and  

Disaster  Recovery)  

BDR  =  DistCP  

distributed  copy  method  leveraging  parallel  compute  (MapReduce  like)  

BDR  is  NOT  proven  as  Data  Guard  or  GG  

Most  importantly,  think  of  BDR  as  a  batch  process  not  a  trickle  

(13)

High-­‐level  Comparison  

HDFS   NoSQL   RDBMS  

Data  Type   Chunk   Record   Transac^on  

Write  Type   Synchronous   Eventually  Consistent   ACID    Compliant  

Data  Prepara^on   No  Parsing   No  Parsing   Parsing  and  Valida^on  

DR  Type   Second  Cluster   Node  Replica   Second  RDBMS  

DR  Unit   File   Record   Transac^on  

DR  Timing   Batch   Record   Transac^on  

In

ge

st  

D

(14)
(15)

Data  Sets  and  Analy^cal  SQL  Queries:  Hadoop  

“Chunk”   256  MB  block   No  Parse   json   INSERT  

•  Data  is  NOT  parsed  on  ingest,  example  JSON  

document  is  not  parsed..  Original  files  loaded  an   broken  into  chunks  

•  Does  not  like  small  files  

 

UDPATE  

•  NO  update  allowed  (append  only)  

•  Expensive  to  update    a  few  KB  by  replacing  a  

256MB  chunk  (as  part  of  a  file  replacement)   SELECT  

•  Scan  ALL  data  even  for  a  single  “row”  answer  

•  SQL  access  path  is  a  “full  table  scan”  

•  Hive  op^mizes  this  with  

•  Metadata  (Par^^on  Pruning)  

•  MapReduce  in  Parallel  to  achieve  scale   NO    

UPDATE   Hive  uses   MapReduce  

In     Parallel   Achieves  

(16)

Data  Sets  and  Analy^cal  SQL  Queries:  NoSQL  

<key><Value>   json   Index  Lookup     API:   Get   Put   Mget   Mput   PUT  (Insert)  

•  Data  is  typically  not  parsed,  example,  JSON  

document  is  loaded  as  a  value  with  a  key    

GET  (Select)  

•  Data  Retrieval  through  “Index  Lookup”  based  

on  KEY  

•  Only  access  path  is  index  (primary  or  

secondary)  

•  If  retrieving  from  Replica  may  not  be  

consistent  (yet)  

•  Generally  do  not  issue  SQL  over  NoSQL  

•  Not  used  for  large  analysis,  instead  used  to  

receive  and  provide  single  records  or  small   sets  based  on  the  same  key  

(17)

Data  Sets  and  Analy^cal  SQL  Queries:  RDBMS  

TABLE   INDEX   INSERT  

•  Parse  data  and  op^mize  for  retrieval  

•  Adhere  to  transac^on  consistency  

 

UPDATE  

•  Enables  individual  records  to  be  updated  

•  With  read-­‐consistency  

•  Row  level  locking  etc.  

SELECT  

•  Read-­‐consistent  guaranteed  

•  SQL  Op^miza^on:  

•  Choose  the  best  access  path  based  on  the  

ques^on  and  database  sta^s^cs   •  Op^mized  joins,  spills  to  disk  etc.  

•  Supports  very  complex  ques^ons  

Data  is  parsed     Metadata  is   available  on   write     Auxiliary   structures   enable  myriad   of  access  paths  

(18)

High-­‐level  Comparison  

HDFS   NoSQL   RDBMS  

Data  Type   Chunk   Record   Transac^on  

Write  Type   Synchronous   Eventually  Consistent   ACID    Compliant  

Data  Prepara^on   No  Parsing   No  Parsing   Parsing  and  Valida^on  

DR  Type   Second  Cluster   Node  Replica   Second  RDBMS  

DR  Unit   File   Record   Transac^on  

DR  Timing   Batch   Record   Transac^on  

Complex  Analy^cs?   Yes   No   Yes  

Query  Speed   Slow   Fast  for  simple  ques^ons   Fast    

#  of  Data  Access  Methods   One  (full  table  scan)   One  (index  lookup)   Many  (Op^mized)  

In ge st   D R   Acces  

(19)

Performance  Aspects  

Ingest  and  Query  

(20)

A  simple  set  of  criteria  

0   0.5   1   1.5   2   2.5   3   3.5   4   4.5   5   Concurrency  

Complex  Query  Response  Times  

Single  Record  Read/Write   Performance  

Bulk  Write  Performance  

Privileged  User  Security   General  User  Security  

Governance  Tools   System  per  TB  Cost  

Backup  per  TB  Cost  

Skills  Acquisi^on  Cost  

RDBMS   NoSQL  DB   Hadoop  

(21)

Ingest  Performance  Considera^ons    

HDFS  

No  parsing  

Fastest  for  Inges^ng  Large  

Data  Sets,  like  log  files  

Bulk  write  of  sets  of  data,  

not  individual  records  

Slower  write  on  a  record-­‐

by-­‐record  basis  than  

NoSQL  

NoSQL  

No  parsing  

Fastest  for  Wri^ng  

Individual  Records  to  the  

master  

Direct  writes  on  a  per-­‐

record  basis  

Best  for  Millisecond  Write  

RDBMS  

• 

RDBMS  does  work  to  data  

(parsing)  on  ingest    

• 

Ingest  speed  is  slower  than  

NoSQL  on  a  record  by  record  

basis  due  to  parsing  

• 

Ingest  speed  is  slower  than  

Hadoop  on  a  file  by  file  basis  

due  to  parsing  

• 

Benefits  of  RDBMS  parsing  

become  evident  on  query  

performance…  

(22)

Query  Performance  Considera^ons    

HDFS  

Requires  parsing  

Slowest  on  Query  Time    

– 

Due  to  Full  Table  Scans  on  

top  of  parsing  

– 

No  query  caching  

Able  to  run  complex  

queries  

– 

Requires  use  of  MR  or  Spark  

programs  

– 

SQL  immature  (SQL-­‐92)  

NoSQL  

No  parsing  

Fastest  on  Query  Time  for  

“get”  

Consistently  fast  is  the  

goal  

No  syntax  available  to  run  

complex  queries  

RDBMS  

• 

Fastest  SQL  query  ^mes  

because  of  parsing  work  done  

on  ingest    

–  Completely  op^mized  storage  

formats  for  IO  

–  Oracle  knows  how  to  pick  stuff  

out,  metadata  on  files..    

–  Least  amount  of  I/O  to  retrieve  

records  

–  Advanced  Caching  

• 

RDBMS  can  run  more  complex  

SQL  queries  than  NoSQL  or  

Hadoop  

(23)

SQL  on  Hadoop  vs  RDBMS  

Hadoop  

Pay  on  QUERY  for  GAINS  on  INGEST    

Full  table  scan  

Shorten  full  table  scan  by  adding  more  

nodes  to  scan  through  data  

Problem  remains,  data  isn’t  parsed  

All  raw  data  must  be  read  for  EACH  and  

EVERY  read  

RDBMS  

Pay  on  INGEST  (Parsing)  for  GAINS  

on  QUERY  

Op^mized  Query  Path  

 

(24)

Benefits  Compared  

Hadoop   NoSQL     RDBMS  

“Schema  on  Read”  Flexibility  

Write  file  on  disk  without  error   checking    

Programma^cally  s^tch   disparate  data  together    

 

Low  Latency  for  simple  ac^ons  

  High  performance  for  lots  of  workloads  without  end  user  knowing     Parsed  Data  

Mul^ple  Access  Paths   Advanced  Query  Caching   In-­‐Memory  Formats      

Affordable  Scale   HOWEVER  No  API  to  run  analy^cal  

query    

Consistency  -­‐  exact  same  result  at   exact  same  ^me  

 

Traceability  of  transac^ons   Applica^ons  become  simpler    

(25)
(26)

Concurrency  Comparisons  

HDFS  

• 

Most  Hadoop  systems  have  

small  number  of  users  running  

large  number  of  jobs  

–  Batch  or  very  frequent  micro  

batch  calcula^ons  

• 

Customers  report  Impala  

struggles  with  concurrency  

(much  like  Netezza,  or  worse)  

• 

Immature  resource  

management,  not  all  solu^ons  

work  nicely  

NoSQL  

• 

Very  high  concurrency  

• 

Geographically  distributed  

• 

Must  deal  with  consistency  

issues  

• 

Great  reader  farm  for  

publishing  data  to  for  example  

web  apps  

• 

Load  balancing  built-­‐in  

RDBMS  

• 

Hundreds  of  users,  hundreds  of  

queries  running  at  the  same  

^me  

• 

Queries  are  balanced  over  the  

system  

• 

Resource  Management  able  to  

control  the  whole  system  

• 

Proven  in  many  large  

(27)
(28)

Cost  

Customers  want  reduce  cost  by  moving  from  RDBMS  to  Hadoop  

Hadoop  delivers  lowest  cost  per  TB  

But  which  workloads  can  move  from  RDBMS  to  Hadoop?  

Dealing  with  transac^ons?    Stay  in  RDBMS  

Running  Advanced  SQL  queries?    Stay  in  RDBMS  

Large  numbers  of  concurrent  users?    Stay  in  RDBMS  

Will  hit  high  performance  penalty  for  moving  DW  to  Hadoop  because  of  

full  table  scans  

(29)

Can  SQL  on  Hadoop  solve  the  problems  

without  sacrifice?  

(30)

No..  1  ½  Axempts  

Impala  

Solu^on  to  speed  up  Hive  and  MR  

Wrote  own  op^mizer  and  query  engine  

Large  speedup  but  at  the  cost  of:  

No  MR  so  loses  scalability  

Less  SQL  capabili^es  so  loses  BI  

transparency  

System  cost  increases  to  run  Impala  

(needs  add’l  memory)  so  loses  some  

cost  advantage  

Impala  with  Parquet  

Convert  files  into  columnar  data  blocks  

(Parquet  file  format)  

Speed  up  because  of  columnar  formats  

etc:  

But  at  a  cost:  

– 

Must  run  ETL  to  Parquet  (comparable  to  ingest  

into  RDBMS)  

– 

Loose  schema  on  read  =>  flexibility  

– 

Double  the  data  

(31)

Conclusion:  Choose  the  Right  Tool  for  the  

Right  Job  

(32)

References

Related documents

We have argued previously that our model presents estimates of both scale and technique effects, which are interpreted as the change in pollution concentrations due to change in

We have found that bank type matters as they have different role in bank networks, the Brazilian bank network is characterized by money centers, in which large banks are

For food out (that includes restaurant and canteen meals) cohort effects are not noticeable, but there is a strong retirement effect (the retired consume less food out.. of the

In the second half of 2008, Price Waterhouse Coopers (PWC) evaluated the merger of the Office for the Disclosure of Unusual Transactions and the National Public Prosecutor for Cases

Column (3) reports LATE estimates of the impact of ever attending the Promise Academy using a winning lottery number as an instrument. The sample is restricted to 2005 lottery

Oracle Engineered Solutions Schema-less Unstructured Data Variety In-DB Analytics “R” Mining Text Graph Oracle NoSQL DB HDFS Hadoop Oracle Data Integrator Oracle Loader for

While a NoSQL + Hadoop solution helps bridge traditional RDBMS with the wider-ranging data analytics and storage capabilities of Hadoop, the combination of MarkLogic and

While a NoSQL + Hadoop solution helps bridge traditional RDBMS with the wider-ranging data analytics and storage capabilities of Hadoop, the combination of MarkLogic and Apache