• No results found

Beyond Hadoop with Apache Spark and BDAS

N/A
N/A
Protected

Academic year: 2021

Share "Beyond Hadoop with Apache Spark and BDAS"

Copied!
29
0
0

Loading.... (view fulltext now)

Full text

(1)Beyond  Hadoop  with     Apache  Spark  and  BDAS   Khanderao  Kand   Principal  Technologist,  Guavus   12  April  GITPRO  World  2014   Palo  Alto,  CA       Credit:     Some  staJsJcs  and  content  came  from  presentaJons  from   publicly  shared  spark.apache.org  and  AmpLabs  presentaJons  .    .

(2) Apache  Spark   Apache  Spark™  is  a  fast  and  general  engine  for   large-­‐scale  data  processing.     • Fast   • Easy  to  use   • Support  mix  of  interacJon  models  .

(3) Big  Data  Systems  Today   Giraph Pregel MapReduce. Dremel Impala Storm. Batch. GraphLab. Tez. Drill Samza. S4. Different Models: Iterative, interactive and" streaming.

(4) Apache  Spark   Batch  . Interac3v e  . One Framework! ". Streaming  . Three  Programming  models  in  one  unified  framework:                                                                                      Batch-­‐                              Streaming-­‐                          InteracJve          .

(5) • • • • . What  is  Spark?  . Open  Source  from  Amplabs  :  UC  Berkely   Fast  InteracJve  Big  Data  Processing  engine   Not  a  modified  version  of  Hadoop   CompaJble  with  Hadoop’s  storage  APIs   – Can  read/write  to  any  Hadoop-­‐supported  system,  including   HDFS,  HBase,  SequenceFiles,  etc  . • Separate,  fast,  MapReduce-­‐like  engine   – In-­‐memory  data  storage  for  very  fast  iteraJve  queries   – General  execuJon  graphs  and  powerful  opJmizaJons   – Up  to  40x  faster  than  Hadoop  for  iteraJve  cases  . • faster  data  sharing  across  parallel  jobs  as  required  for   mulJ-­‐stage  and  interacJve  apps  require     • Efficient  Fault  Tolerant  Model  .

(6) Spark  History   AmpLabs  from  UC  Berkely   Spark  started  as  research  project  in  2009   Open  sourced  in  2010   Growing  community  since   Entered  Apache  Incubator  in  June  2013  and  now  1.0   is  coming  out   • Commercial  Backing:  Databricks   • Enhancements  by  current  stacks:  Cloudera,  MapR   • AddiJonal  ContribuJons:  Yahoo,  Conviva,  Oooyala,   …   • • • • • .

(7) ContribuJons   • AmpLabs  -­‐  Databricks   • YARN  support  (Yahoo!)   • Columnar  compression  in  Shark  (Yahoo!)   • Fair  scheduling  (Intel)   • Metrics  reporJng  (Intel,  QuanJfind)   • New  RDD  operators  (Bizo,  ClearStory)   • Scala  2.10  support  (Imaginea)  .

(8) How  does  Data  Sharing  work  in   Hadoop  Map  Reduce  Jobs?   HDFS   read  . HDFS   write  . iter.  1  . HDFS   read  . HDFS   write  . iter.  2  . Input  . Result:  Hadoop  Map  Reduce  is  Slow  :   1.  ReplicaJon   2.  SerializaJon   3.  Disk  IO  . .    .    .  .

(9) Spark  faster  due  to  in-­‐memory   Data  Sharing  across  jobs   iter.  1  . iter.  2  . .    .    .  . Input   query  1   one-­‐time   processing  . Input  . Distributed   memory  . query  2   query  3   .    .    .  . In  some  cases,  10-­‐100×  faster  than  network  and   disk  .

(10) Spark  is  Faster  and  Easier   • Extremely  fast  scheduling   – Ms-­‐seconds  in  Spark  vs  secs-­‐mins-­‐hrs  in  Hadoop   MR  . • Many  more  useful  primiJves   – Higher  level  APIs  for  more  than  map-­‐reduce-­‐filter   – Broadcast  and  accumulators  variables   – Combining  programming  models  .

(11) Result:   Spark  vs  Hadoop  Performance   Itera3on  3me  (s)  . PageRank  Performance   200  . 171  . Hadoop  . 150   100   50   0  . Basic  Spark  . 72   23  . Spark  +  Controlled   ParJJoning  .

(12) Apache  Spark  . • Unifies  batch,  streaming,  interac/ve  comp.   • Easy  to  build  sophisJcated  applicaJons   – Support  iteraJve,  graph-­‐parallel  algorithms   – Powerful  APIs  in  Scala,  Python,  Java   Machine Batch, Streaming Graph Interactive Interactive Learning. BlinkDB. Spark. GraphX. Streaming Shark SQL. Batch, Interactive. Spark Core – RDD , MR. Mesos / YARN. HDFS. MLlib.

(13) RDD   • Read  only  Parallelized  Fault  tolerant  Data  Set   • From  :     – Files,   – TransformaJon,     – Parallelizing  a  collecJon   – From  other  RDD  . • Spark  tracks  lineage  that  is  used  for  recreaJon   upon  failure   • Late-­‐realizaJon  .

(14) RDD  Fault  Tolerance   RDDs  track  Lineage   Tracks  the  transformaJons  used  to  build  them   (their  lineage)  to  recompute  lost  data   E.g:   messages  =  textFile(...).filter(lambda  s:  s.contains(“Spark”))                                                  .map(lambda  s:  s.split(‘\t’)[2])    .                                                  .   HadoopRDD  . path  =  hdfs://…  . FilteredRDD  . func  =  contains(...)  . MappedRDD   func  =  split(…)  .

(15) Richer  Programming  Constructs   • Not  just  map,  reduce,  filter   • But   • map,  reduce,     • flatMap,  filter,  count,  reduce,   • groupByKey,  reduceByKey,  sortByKey,     • join  .

(16) Shared  Variables   • Two  types:   – Broadcast   – Accumulate  . • Broadcast:   – Lookup  type  shared  dataset   – Large  (can  go  10s-­‐100s  GB)  . • Accumulators   – AddiJves   – Like  counters  in  reduce    .

(17) Scala  Shell  :  InteracJve  Big  Data   Programming   scala>  val  file  =  sc.textFile  (“mytext.data”)   scala>  val  words  =  file.map(  line  =>  line.split(“  “))   scala>  words.map(word=>(word, 1)).reduceByKey(_+_))  .

(18) Spark  Streaming   Micro-­‐batches:  a  series  of  very  small,  determinisJc   batch  jobs   § Microbatch is small (seconds) batches of the live stream . live  data  stream  . § Microbatches are treated as RDD § Same RDD Programming model and operations § Results are returned as batches § Recovery based on lineage & Checkpoint. Spark   Streaming  . batches  of  X   seconds  . processed   results  . Spark  .

(19) Comparison  with  Spark  and  S4   Higher  throughput  than  Storm   – Spark  Streaming:  670k  records/second/node   – Storm:  115k  records/second/node  . 100  . Grep  . 50   0   100   1000   Record  Size  (bytes)   Spark   Storm  . Throughput  per  node   (MB/s)  . Throughput  per  node   (MB/s)  . – Apache  S4:  7.5k  records/second/node   30  . WordCount  . 20   10  . Ref:  AmpLabs  :  Tathagata  Das  . 0   100   1000   Record  Size  (bytes)   Spark   Storm   19  .

(20) DStream  Data  Sources   • Many  sources  out  of  the  box   – HDFS   – Kara   – Flume   – Twiser   – TCP  sockets   – Akka  actor   – ZeroMQ  . • Extensible  . 20  .

(21) TransformaJons   Build  new  streams  from  exisJng  streams   – RDD-­‐like  operaJons   • map,  flatMap,  filter,  count,  reduce,   • groupByKey,  reduceByKey,  sortByKey,  join   • etc.  . – New  window  and  stateful  operaJons   • window,  countByWindow,  reduceByWindow   • countByValueAndWindow,  reduceByKeyAndWindow   • updateStateByKey   • etc.  .

(22) Sample  Program:   Find  trending  top  10  hashtags  in  last  10   min  .      //  Create  the  stream  of  tweets          val  tweets  =  ssc.twiserStream(<username>,  <password>)                    //  Count  the  tags  over  a  10  minute  window          val  tagCounts  =  tweets.flatMap  (statuts  =>  getTags(status))                                                                            .countByValueAndWindow  (Minutes(10),   Second(1))            //  Sort  the  tags  by  counts          val  sortedTags  =  tagCounts.map  {  case  (tag,  count)  =>  (count,  tag)  }                                                                                                                          .transform(_.sortByKey(false))            //  Show  the  top  10  tags          sortedTags.foreach(showTopTags(10)  _)  . 22  .

(23) Storm  +  Fusion  Convergence  –   Twiser  Model  .

(24) Shark   • High-­‐Speed  In-­‐Memory  AnalyJcs   over  Hadoop  (HDFS)  and  Hive  Data   • Common  HiveQL  provides  seemless  federaJon   between  Hive  and  Shark   • Sits  on  top  of  exisJng  Hive  warehouse  data   • Direct  query  access  from  UI  like  tableu   • Direct  ODBC/JDBC  from  desktop  BI  tools     24  .

(25) Shark   • AnalyJc  query  engine  compaJble  with  Hive   – Hive  QL,     – UDFs,     – SerDes,   –  scripts,  types  . • Makes  Hive  queries  run  much  faster   – Builds  on  top  of  Spark   – Caching  data     – OpJmizaJons  .

(26) Shark  Internals   • A  faster  execuJon  engine  than  Hadoop  Map   Reduce   • OpJmized  Storage  format   – Columnar  memory  store  . • Various  other  opJmizaJons   – Fully  distributed  sort   – data  co-­‐parJJoning   – parJJon  pruning  .

(27) Shark  Performance  Comparison  .

(28) 15. 25. 20. 10. 5. 0. Streaming Response Time (s). Storm. 25. 20. 15. 10. 5. 0. SQL. 35. 30. 25. 10 5. GraphX. 40. Giraph. 30 Hadoop. Response Time (min). Shark (mem). Shark (disk). Hive. 45. Impala (mem). 35. Impala (disk). 30 Spark. Throughput (MB/s/node). Performance  . 20. 15. 0. Graph.

(29) MLBase  Inventory   • ClassificaJons  . – NaiveBayes   – SVM   – SVMwithSGD  . • Clustering   – Kmeans  . • Regression  . – Linear   – Lasso   – RidgeRegressionWithSGD  . • OpJmizaJon:  . – GradientDiscent   – LogisJcGradient  . • RecommendaJon  . – CollaboraJveFiltering  /  ALS  . • Others  . – MatrixFactorizaJon  .

(30)

References

Related documents

This reaction is known to take place during the processing of milk and milk products.1 The deproteinized fraction of one of these milk products was examined by paper chromatography,

The untreated (U) channel was used for this purpose, as the land, weed and water areas did not change so the same validation geometries could be used for every image in the

Once they collect their logs, respondents say the most useful feature of log management systems is “real-time alerts,” with 68 percent indicating they are very useful and 25

Conclusion: The present study concludes that the expression of CK5/6 is associated with benign prostatic tumors while malignant forms of prostate rumors usually showed

higher pressures the Hugoniots of crystal quartz and coesite lie in the melt regime. 4) Stishovite is too cold along its principal Hugoniot to undergo phase transitions up to. 240

Computer simulation (FLUENT 12.0) was verified to visualize the location of the test station and ensure that the stream line had a fully developed flow, at 10 times

On the Timesheet screen set the View By to Week and the Date to the employee's Job Data Hire Date / Time Reporter Data Effective Date then click the Refresh icon (green arrows).

of Industrial Design, UIAH, 2003- (vice member); expert and chairman roles in developing studios and laboratories; main responsibility for developing and systematizing