• No results found

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

N/A
N/A
Protected

Academic year: 2022

Share "ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat"

Copied!
38
0
0

Loading.... (view fulltext now)

Full text

(1)

1  

ESS  event:    

Big  Data  in  Official  Statistics  

                                                         

v    

v erbi is

 

Antonino  Virgillito,  Istat  

(2)

2  

About  me  

•  Head  of  Unit  “Web  and  BI  Technologies”,  IT   Directorate  of  Istat  

•  Project  manager  and  technical  coordinator  of    

–  Web  technologies,  mobile  applications,  BI  &  ETL  

–  Highlights:  online  Census  questionnaires,  Consumer   Price  Survey  on-­‐field  data  collection  architecture  

•  PhD  in  Computer  Engineering  

–  Field  of  research:  large-­‐scale  distributed  systems  

•  Trainer  on  Hadoop-­‐MapReduce  

(3)

3  

Abstract  

•  The  hype  surrounding  Big  Data  technologies   hides  the  complexity  of  their  adoption  in  NSIs  

–  Strong  IT  know-­‐how  required  for  configuration  and   management  

–  Should  be  accessed  through  common  statistical   software  

•  Reasoned  overview  of  the  most  popular  Big  

Data  technologies,  with  focus  on  their  usage  in  

NSIs  

(4)

4  

Outline  

•  Motivations  for  Big  Data  technologies  

•  Overview  of  Big  Data  tools  

•  Adopting  Big  Data  technologies  in  NSIs  

(5)

5  

MOTIVATIONS  FOR  BIG  DATA  

TECHNOLOGIES  

(6)

6  

What  does    

Big   mean?  

Background  

(7)

7  

Big  

Size  

Tera-­‐,   Peta-­‐  …   and   growing  

 

Background  

Processing  

a  complex  statistical   method  can  become   untreatable  even  with   data  sets  of  

“reasonable”  size    

Quality  

Big  Data  is  often  loosely  

structured  and  highly  noisy  

 

(8)

8  

Big  

Size  

•  Real  Big  Data  begins  where  your  usual  tools  fail…  

 

Distributed  file  systems  

•  Clusters  of  commodity  hardware  that  can  scale  to  indefinite   size  simply  by  adding  new  nodes  at  runtime  

•  Overcome  physical  limitations  

•  Should  be  managed  by  a  middleware  platform  (Hadoop  HDFS)    

Tools  and  Techniques  

(9)

9  

Big  

Processing  

Tools  and  Techniques  

MapReduce  

•  Programming  paradigm  that  enables   programs  to  be  executed  in  parallel  on  a   cluster  

•  Not  tied  to  a  programming  language,  

interfaces  exist  for  all  common  languages   and  tools  

(10)

10  

Big   Quality  

•  Pre-­‐processing  for  cleaning  and  organizing   data  

•  Big  Data  are  often  unstructured  but  the   viceversa  is  not  true  

Tools  and  Techniques  

(11)

11  

Technical  Challenges  

 

Handling  Big  Data  necessarily  requires  relying   on   complex   distributed  technologies  

 

If  you  want  to  get  something  from   real  big  

data  you  have  to  deal  with  this  complexity  

(12)

12  

Perspectives  

Colleen, the Statistical Analyst Moss, the IT Guy

Ok,  but    I  want  to  use  my  tools   and  methods.  I  don’t  want  to  

touch  this  distributed  stuff   I  can  setup  the  

infrastructure  and  the  data   and  help  you  with  the  tools  

Deal    

I  don’t  want  to   write  programs  for  

every  analysis  she   makes  

(13)

13  

BIG  DATA  TOOLS  OVERVIEW  

(14)

14  

Big  Data  IT  Tools  Proliferation  

(15)

15  

Our  focus:  IT  Tools  for    

Statistical  Analysis  of  Big  Data  

•  What  are  the  basic  tools?  

•  What  is  the  best  tool  for  the  job?  

•  How  these  tools  integrate  with  common  

elements  in  an  IT  architecture?  

(16)

16  

Big  Data  IT  Tools:    

the  Common  Denominator  

(17)

17  

Distributed  Storage  and  Processing:  

Hadoop  

Distributed  storage  platform         De-­‐facto  standard  for  Big  Data   processing  

  Open  source  project  supported   and/or  adopted  by  most  major   vendors  

  Virtually  unlimited  scalability  

–  storage,  memory,  processing  

power  

(18)

18  

Hadoop  

Hadoop  Principle  

I’m one big data set

Hadoop  is  basically  a  

middleware  platforms  that   manages  a  cluster  of  

machines  

The  core  components  is  a   distributed  file  system   (HDFS)  

HDFS  

Files  in  HDFS  are  split  into   blocks  that  are  scattered   over  the  cluster  

The  cluster  can  grow   indefinitely  simply  by   adding  new  nodes  

(19)

19  

The  MapReduce  Paradigm  

Parallel  processing  paradigm  

Programmer  is  unaware  of  parallelism  

Programs  are  structured  into  a  two-­‐phase  execution  

Map  

Data  elements  are   classified  into  

categories  

Reduce  

An  algorithm  is  applied  to  all   the  elements  of  the  same   category  

x 4 x 5 x 3

(20)

20  

MapReduce  and  Hadoop  

Hadoop  

HDFS  

MapReduce   MapReduce  is  

logically  placed  on   top  of  HDFS  

(21)

21  

MapReduce  and  Hadoop  

Hadoop  

HDFS   MR  

HDFS   MR  

HDFS   MR  

HDFS   MR  

MR  works  on  (big)  files   loaded  on  HDFS  

Each  node  in  the  cluster   executes  the  MR  

program  in  parallel,   applying  map  and  

reduces  phases  on  the   blocks  it  stores  

Output  is  written  on   HDFS  

Scalability  principle:  

Perform  the  computation  were  the  data  is  

(22)

22  

MapReduce  Applications  

•  Naturally  targeted  at  counts  and  aggregations  

–  1-­‐line  aggregation  algorithm  

•  Collecting  &  combining  

–  It  all  began  there…inverted  index  computation  in  Google  

•  Machine  learning,  cross-­‐correlation  

•  Graph  analysis  

–  “People  you  may  know”  

–  Geographical  data:  in  Google  Maps,  finding  nearest  feature  to   a  given  address  or  location  

•  Pre-­‐processing  of  unstructured  data    

•  Can  also  handle  binary  files  

–  NYT  converted  4TBs  of  scanned  articles  into  1.5TB  of  PDFs    

(23)

23  

Data  Analysis  with  Hadoop  

Colleen, the Statistical Analyst Moss, the IT Guy

Cool!  Now  how  can  I  analyze  them?  

I  finally  loaded  those  elephant-­‐size   data  sets  into  Hadoop!  

It’s  simple!  Write  a  MapReduce   program  in  Java!    

No  

Ok,  I’ll  do  that  for  you  

No   MapReduce  programs  can  be  written  in  various  

programming  languages  

Several  tools  are  also  available  that  translate  high-­‐level   analysis  languages  into  MapReduce  programs  

(24)

24  

High-­‐level  languages  for  data   manipulation  

Tools  for  Data  Analysis  with  Hadoop  

Hadoop  

HDFS   MapReduce  

Pig  

Statistical   Software   Hive  

(25)

25  

Using  Hadoop  from  Statistical   Software  

•  R  

–  packages  rhdfs,  rmr  

–  Issue  HDFS  commands  and  write  MapReduce  jobs    

•  SAS  

–  SAS  In-­‐Memory  Statistics   –  SAS/ACCESS    

•  Makes  data  stored  in  Hadoop  appear  as  native  SAS   datasets  

•  Uses  Hive  interface  

•  SPSS  

–  Transparent  integration  with  Hadoop  data  

(26)

26  

Apache  Pig  

•  Tool  for  querying  data  on  Hadoop  clusters  

•  Widely  used  in  the  Hadoop  world  

–  Yahoo!  estimates  that  50%  of  their  Hadoop  workload   on  their  100,000  CPUs  clusters  is  genarated  by  Pig   scripts  

•  Allows  to  write  data  manipulation  scripts  

written  in  a  high-­‐level  language  called  Pig  Latin  

–  Interpreted  language:  scripts  are  translated  into   MapReduce  jobs  

•  Mainly  targeted  at  joins  and  aggregations    

(27)

27  

Pig  Example  

Real  example  of  a  Pig  script  used  at  Twitter  

The  Java  equivalent…  

(28)

28  

Hadoop-­‐MapReduce  Limitations  

•  Not  usable  in  transactional  applications  

•  Not  suited  to  Real-­‐time  analysis  

–  MapReduce  jobs  run  in  batch  mode.    

•  HDFS  is  an  append-­‐only  file  system  

–  Can  insert  and  delete,  but  cannot  update  

•  MapReduce  jobs  run  in  batch  mode  

–  You  cannot  expect  low  response  latency  

–  Not  suited  for  interactive,  real-­‐time  operations  and/

or  random-­‐access  read/writes  

 

(29)

29  

NoSQL  databases  

•  Distributed  storage  platforms  that  allows  for   lower  latency  processing    

•  NoSQL:  Not  Only  SQL  

•  Non-­‐relational  data  models  that  trade  

transactional  consistency  for  query  efficiency   and  support  semi-­‐structured  data  

–  No  joins,  no  transactions,  no  indexes  

(30)

30  

NoSQL  Databases  

Popular  choices:  Hbase  and   Cassandra  

Use  a  column-­‐oriented  model  

–  Data  organized  in   families  of  key:  value   pairs  

–  variable  schema  where   each  row  is  slightly   different  

–  optimized  for  sparse   data  

Can  be  accessed  from  R  

Hadoop  

HDFS   MapReduce  

HBase  

R  

Cassandra  

Fully  distributed  platform   Not  based  on  Hadoop   Based  on  Hadoop  

(31)

31  

Big  Data  Tools  in  the  IT  Architecture  

•  Hadoop  is  not  a  DB/DW  replacement  but  it   sits  besides  traditional  data  technologies  in  a   modern  IT  architecture  

•  The  outcome  of  Big  Data  processing  can  be   stored  in  a  traditional  DB-­‐DW  

•  Modern  (visual)  analytics  tools  can  integrate  

both  kinds  of  data  sources  

(32)

32  

Analysis  tools  

Augmented  IT  Architecture  

Hadoop  

Statistical  software   Visual  Analytics  

NoSQL  

DB   DW  

multi-­‐

structured     big   data  

BI  

Initial  processing  and  cleaning  

Keeps  multi-­‐structured  historical  data  online   and  accessible  

Analysis  results  

(33)

33  

ADOPTING  BIG  DATA  

TECHNOLOGIES  

(34)

34  

   

Maximum  control  of   configuration  and  costs      

High  complexity  

 Pay-­‐per-­‐use  billing  model    Cuts  hardware  and  software   costs  and  eliminates  

management  burden    

Privacy  issues!  

 

 Easy    Costly      

 

Hadoop  Deployment  Options  

In-­‐house   Cloud   Appliance  

(35)

35  

IT  Skills  for  Big  Data  Tools  

System  manager   Data  Integrator  

Data  Engineer   Data  scientist  

Designs  the  IT   architecture  for   collecting  and   processing      

Designs  and   develop  writes   MR  jobs  or  PIG   scripts  

Sets  up  and   manages  the   physical   infrastructure   Develops  ETL  procedures  to  

move  data  to/from  HDFS  and   NoSQL  DBs  

Derives  new  insights  by   applying  statistical  analysis   methods  on  different,  

heterogeneous,  possibly  big,   data  sources    

 Has  strong  IT  foundations  and   can  develop  her  algorithms   using  both  statistical  tools  and   Hadoop  

Uses  statistical  tools   and  VA  

Data  analyst  

SQL  

R  -­‐  SAS  -­‐  SPSS  

BI  and  Visual  Analytics   Excel  

Linux  

Map  Reduce  

ETL  

Pig   Java  

(36)

36  

Suggestions  for  ESS  

Training  on  data  science  for  statisticians  and   Big  data  engineering  for  IT  staff  

Implementation  of  standard  methods  and   tools  in  a  Hadoop-­‐compliant  version  

Eurostat  establishing  repositories  of  Big  Data   and  allowing  NSIs  to  access  them  

Set  up  of  a  “statistical  cloud”,  a  Hadoop   cluster  shared  by  NSIs    

Possible  agreements  with  providers  of  IT   solutions  (Google,  etc.)    

(37)

37  

Conclusions  

•  Big  Data  tools  makes  sense  when  you  really   have  serious  size  issues  to  deal  with  

–  Not  much  use  for  a  2-­‐node  Hadoop  cluster    

•  No  value  in  jumping  on  the  Big  Data   bandwagon  for  its  own  sake  

–  High  costs  

–  You  can  still  be  a  “data  scientist”...  

•  However,  Big  Data  engineering  provides  new   opportunities  

–  Collect  more  data  

–  “Ask  bigger  questions”  

(38)

38  

Questions  

References

Related documents

NA Č RTOVANJE.. Spodaj podpisani/-a študent/-ka TINA ČRNIGOJ MARC , vpisna številka 26430025 , avtor/-ica pisnega zaključnega dela študija z naslovom: RAZVOJ ISTRSKEGA

The same result is seen with positionals, where students were more likely to produced intransitive positional (non-imperative) verb forms when the target was a stative po- sitional

The variable “number of patients with HT” was categorized based on the methods for HT diagnosis as follows: presence of antithyroid antibodies, clinical history of HT, histologic

The Interior Design Institute’s team of qualified professionals are here to guide you through your diploma course, revealing the secrets that have brought them success along with

Q3’12 sales reflecting seasonality in the business as experienced in Q3’11, while continuously showing CLIA ex Vitamin D sales increasing double-digit and Murex business significantly

− Analiza prednosti (snaga) projektne nastave: učenici rado sudjeluju u planiranju, realizaciji i vrednovanju projektne nastave i imaju mogućnost izbora sadržaja

Either the suspect contributed the evidence, or an unlikely coincidence happened – the once-in-1.6 × 10 15 (1.6 quadrillion) coincidence that an unrelated person would

Whilst Florestine Collins did not have access to the same networks that white women had – she was, after all, the only African American female photographer in New Orleans at the