ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

(1)

1 ESS event:

Big Data in Oﬃcial Statistics

v

v erbi is

Antonino Virgillito, Istat

(2)

2 About me

•  Head of Unit “Web and BI Technologies”, IT Directorate of Istat

•  Project manager and technical coordinator of

–  Web technologies, mobile applications, BI & ETL

–  Highlights: online Census questionnaires, Consumer Price Survey on-‐ﬁeld data collection architecture

•  PhD in Computer Engineering

–  Field of research: large-‐scale distributed systems

•  Trainer on Hadoop-‐MapReduce

(3)

3 Abstract

•  The hype surrounding Big Data technologies hides the complexity of their adoption in NSIs

–  Strong IT know-‐how required for conﬁguration and management

–  Should be accessed through common statistical software

•  Reasoned overview of the most popular Big

Data technologies, with focus on their usage in

NSIs

(4)

4 Outline

•  Motivations for Big Data technologies

•  Overview of Big Data tools

•  Adopting Big Data technologies in NSIs

(5)

5 MOTIVATIONS FOR BIG DATA

TECHNOLOGIES

(6)

6 What does

Big _mean?

Background

(7)

7

Big

Size

Tera-‐, Peta-‐ ^… and growing

Background

Processing

a complex statistical method can become untreatable even with data sets of

“reasonable” size

Quality

Big Data is often loosely

structured and highly noisy

(8)

8

Big

Size

•  Real Big Data begins where your usual tools fail…

Distributed ﬁle systems

•  Clusters of commodity hardware that can scale to indeﬁnite size simply by adding new nodes at runtime

•  Overcome physical limitations

•  Should be managed by a middleware platform (Hadoop HDFS)

Tools and Techniques

(9)

9

Big

Processing

Tools and Techniques

MapReduce

•  Programming paradigm that enables programs to be executed in parallel on a cluster

•  Not tied to a programming language,

interfaces exist for all common languages and tools

(10)

10

Big Quality

•  Pre-‐processing for cleaning and organizing data

•  Big Data are often unstructured but the viceversa is not true

Tools and Techniques

(11)

11 Technical Challenges

Handling Big Data necessarily requires relying on complex distributed technologies

If you want to get something from real big

data you have to deal with this complexity

(12)

12 Perspectives

Colleen, the Statistical Analyst Moss, the IT Guy

Ok, but I want to use my tools and methods. I don’t want to

touch this distributed stuﬀ I can setup the

infrastructure and the data and help you with the tools

Deal

I don’t want to write programs for

every analysis she makes

(13)

13 BIG DATA TOOLS OVERVIEW

(14)

14 Big Data IT Tools Proliferation

(15)

15 Our focus: IT Tools for

Statistical Analysis of Big Data

•  What are the basic tools?

•  What is the best tool for the job?

•  How these tools integrate with common

elements in an IT architecture?

(16)

16 Big Data IT Tools:

the Common Denominator

(17)

17 Distributed Storage and Processing:

Hadoop

Distributed storage platform De-‐facto standard for Big Data processing

Open source project supported and/or adopted by most major vendors

Virtually unlimited scalability

–  storage, memory, processing

power

(18)

18

Hadoop

Hadoop Principle

I’m one big data set

Hadoop is basically a

middleware platforms that manages a cluster of

machines

The core components is a distributed ﬁle system (HDFS)

HDFS

Files in HDFS are split into blocks that are scattered over the cluster

The cluster can grow indeﬁnitely simply by adding new nodes

(19)

19 The MapReduce Paradigm

Parallel processing paradigm

Programmer is unaware of parallelism

Programs are structured into a two-‐phase execution

Map

Data elements are classiﬁed into

Reduce

An algorithm is applied to all the elements of the same category

x 4 x 5 x 3

(20)

20 MapReduce and Hadoop

Hadoop

HDFS

MapReduce MapReduce is

logically placed on top of HDFS

(21)

21 MapReduce and Hadoop

Hadoop

HDFS MR

MR works on (big) ﬁles loaded on HDFS

Each node in the cluster executes the MR

program in parallel, applying map and

reduces phases on the blocks it stores

Output is written on HDFS

Scalability principle:

Perform the computation were the data is

(22)

22 MapReduce Applications

•  Naturally targeted at counts and aggregations

–  1-‐line aggregation algorithm

•  Collecting & combining

–  It all began there…inverted index computation in Google

•  Machine learning, cross-‐correlation

•  Graph analysis

–  “People you may know”

–  Geographical data: in Google Maps, ﬁnding nearest feature to a given address or location

•  Pre-‐processing of unstructured data

•  Can also handle binary ﬁles

–  NYT converted 4TBs of scanned articles into 1.5TB of PDFs

(23)

23 Data Analysis with Hadoop

Colleen, the Statistical Analyst Moss, the IT Guy

Cool! Now how can I analyze them?

I ﬁnally loaded those elephant-‐size data sets into Hadoop!

It’s simple! Write a MapReduce program in Java!

No

Ok, I’ll do that for you

No MapReduce programs can be written in various

programming languages

Several tools are also available that translate high-‐level analysis languages into MapReduce programs

(24)

24

High-‐level languages for data manipulation

Tools for Data Analysis with Hadoop

Hadoop

HDFS MapReduce

Pig

Statistical Software Hive

(25)

25 Using Hadoop from Statistical Software

•  R

–  packages rhdfs, rmr

–  Issue HDFS commands and write MapReduce jobs

•  SAS

–  SAS In-‐Memory Statistics –  SAS/ACCESS

•  Makes data stored in Hadoop appear as native SAS datasets

•  Uses Hive interface

•  SPSS

–  Transparent integration with Hadoop data

(26)

26 Apache Pig

•  Tool for querying data on Hadoop clusters

•  Widely used in the Hadoop world

–  Yahoo! estimates that 50% of their Hadoop workload on their 100,000 CPUs clusters is genarated by Pig scripts

•  Allows to write data manipulation scripts

written in a high-‐level language called Pig Latin

–  Interpreted language: scripts are translated into MapReduce jobs

•  Mainly targeted at joins and aggregations

(27)

27 Pig Example

Real example of a Pig script used at Twitter

The Java equivalent…

(28)

28 Hadoop-‐MapReduce Limitations

•  Not usable in transactional applications

•  Not suited to Real-‐time analysis

–  MapReduce jobs run in batch mode.

•  HDFS is an append-‐only ﬁle system

–  Can insert and delete, but cannot update

•  MapReduce jobs run in batch mode

–  You cannot expect low response latency

–  Not suited for interactive, real-‐time operations and/

or random-‐access read/writes

(29)

29 NoSQL databases

•  Distributed storage platforms that allows for lower latency processing

•  NoSQL: Not Only SQL

•  Non-‐relational data models that trade

transactional consistency for query eﬃciency and support semi-‐structured data

–  No joins, no transactions, no indexes

(30)

30 NoSQL Databases

Popular choices: Hbase and Cassandra

Use a column-‐oriented model

–  Data organized in families of key: value pairs

–  variable schema where each row is slightly diﬀerent

–  optimized for sparse data

Can be accessed from R

Hadoop

HDFS MapReduce

HBase

R

Cassandra

Fully distributed platform Not based on Hadoop Based on Hadoop

(31)

31 Big Data Tools in the IT Architecture

•  Hadoop is not a DB/DW replacement but it sits besides traditional data technologies in a modern IT architecture

•  The outcome of Big Data processing can be stored in a traditional DB-‐DW

•  Modern (visual) analytics tools can integrate

both kinds of data sources

(32)

32

Analysis tools

Augmented IT Architecture

Hadoop

Statistical software Visual Analytics

NoSQL

DB DW

multi-‐

structured big data

BI

Initial processing and cleaning

Keeps multi-‐structured historical data online and accessible

Analysis results

(33)

33 ADOPTING BIG DATA

TECHNOLOGIES

(34)

34

Maximum control of conﬁguration and costs

High complexity

Pay-‐per-‐use billing model Cuts hardware and software costs and eliminates

management burden

Privacy issues!

Easy Costly

Hadoop Deployment Options

In-‐house Cloud Appliance

(35)

35 IT Skills for Big Data Tools

System manager Data Integrator

Data Engineer Data scientist

Designs the IT architecture for collecting and processing

Designs and develop writes MR jobs or PIG scripts

Sets up and manages the physical infrastructure Develops ETL procedures to

move data to/from HDFS and NoSQL DBs

Derives new insights by applying statistical analysis methods on diﬀerent,

heterogeneous, possibly big, data sources

Has strong IT foundations and can develop her algorithms using both statistical tools and Hadoop

Uses statistical tools and VA

Data analyst

SQL

R -‐ SAS -‐ SPSS

BI and Visual Analytics Excel

Linux

Map Reduce

^ETL

Pig Java

(36)

36 Suggestions for ESS

Training on data science for statisticians and Big data engineering for IT staﬀ

Implementation of standard methods and tools in a Hadoop-‐compliant version

Eurostat establishing repositories of Big Data and allowing NSIs to access them

Set up of a “statistical cloud”, a Hadoop cluster shared by NSIs

Possible agreements with providers of IT solutions (Google, etc.)

(37)

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

1

ESS event:

Big Data in Oﬃcial Statistics

v

v erbi is

Antonino Virgillito, Istat

2

About me

• Head of Unit “Web and BI Technologies”, IT Directorate of Istat

• Project manager and technical coordinator of

– Web technologies, mobile applications, BI & ETL

– Highlights: online Census questionnaires, Consumer Price Survey on-­‐ﬁeld data collection architecture

• PhD in Computer Engineering

– Field of research: large-­‐scale distributed systems

• Trainer on Hadoop-­‐MapReduce

3

Abstract

• The hype surrounding Big Data technologies hides the complexity of their adoption in NSIs

– Strong IT know-­‐how required for conﬁguration and management

– Should be accessed through common statistical software

• Reasoned overview of the most popular Big

Data technologies, with focus on their usage in

NSIs

4

Outline

• Motivations for Big Data technologies

• Overview of Big Data tools

• Adopting Big Data technologies in NSIs

5

MOTIVATIONS FOR BIG DATA

TECHNOLOGIES

6

What does

Big mean?

Background

7

Big

Size

Tera-­‐, Peta-­‐ … and growing

Background

Processing

a complex statistical method can become untreatable even with data sets of

“reasonable” size

Quality

Big Data is often loosely

structured and highly noisy

8

Big

Size

Tools and Techniques

9

Big

Processing

Tools and Techniques

10

Big Quality

Tools and Techniques

11

Technical Challenges

Handling Big Data necessarily requires relying on complex distributed technologies

If you want to get something from real big

data you have to deal with this complexity

12

Perspectives

13

BIG DATA TOOLS OVERVIEW

14

Big Data IT Tools Proliferation

15

Our focus: IT Tools for

Statistical Analysis of Big Data

• What are the basic tools?

• What is the best tool for the job?

• How these tools integrate with common

elements in an IT architecture?

16

Big Data IT Tools:

the Common Denominator

17

•  Head of Unit “Web and BI Technologies”, IT Directorate of Istat

•  Project manager and technical coordinator of

–  Web technologies, mobile applications, BI & ETL

–  Highlights: online Census questionnaires, Consumer Price Survey on-‐ﬁeld data collection architecture

•  PhD in Computer Engineering

–  Field of research: large-‐scale distributed systems

•  Trainer on Hadoop-‐MapReduce

•  The hype surrounding Big Data technologies hides the complexity of their adoption in NSIs

–  Strong IT know-‐how required for conﬁguration and management

–  Should be accessed through common statistical software

•  Reasoned overview of the most popular Big

•  Motivations for Big Data technologies

•  Overview of Big Data tools

•  Adopting Big Data technologies in NSIs

Big _mean?

Tera-‐, Peta-‐ ^… and growing

•  What are the basic tools?

•  What is the best tool for the job?

•  How these tools integrate with common

Distributed storage platform De-‐facto standard for Big Data processing

–  storage, memory, processing

Programs are structured into a two-‐phase execution

•  Naturally targeted at counts and aggregations

–  1-‐line aggregation algorithm

•  Collecting & combining

–  It all began there…inverted index computation in Google

•  Machine learning, cross-‐correlation

•  Graph analysis

–  “People you may know”

–  Geographical data: in Google Maps, ﬁnding nearest feature to a given address or location

•  Pre-‐processing of unstructured data

•  Can also handle binary ﬁles

–  NYT converted 4TBs of scanned articles into 1.5TB of PDFs

•  R

–  packages rhdfs, rmr

–  Issue HDFS commands and write MapReduce jobs

•  SAS

–  SAS In-‐Memory Statistics –  SAS/ACCESS

•  Makes data stored in Hadoop appear as native SAS datasets

•  Uses Hive interface

•  SPSS

–  Transparent integration with Hadoop data

•  Tool for querying data on Hadoop clusters

•  Widely used in the Hadoop world

–  Yahoo! estimates that 50% of their Hadoop workload on their 100,000 CPUs clusters is genarated by Pig scripts

•  Allows to write data manipulation scripts

written in a high-‐level language called Pig Latin

–  Interpreted language: scripts are translated into MapReduce jobs

•  Mainly targeted at joins and aggregations

Hadoop-‐MapReduce Limitations

•  Not usable in transactional applications

•  Not suited to Real-‐time analysis

–  MapReduce jobs run in batch mode.

•  HDFS is an append-‐only ﬁle system

–  Can insert and delete, but cannot update