1
ESS event:
Big Data in Official Statistics
v
v erbi is
Antonino Virgillito, Istat
2
About me
• Head of Unit “Web and BI Technologies”, IT Directorate of Istat
• Project manager and technical coordinator of
– Web technologies, mobile applications, BI & ETL
– Highlights: online Census questionnaires, Consumer Price Survey on-‐field data collection architecture
• PhD in Computer Engineering
– Field of research: large-‐scale distributed systems
• Trainer on Hadoop-‐MapReduce
3
Abstract
• The hype surrounding Big Data technologies hides the complexity of their adoption in NSIs
– Strong IT know-‐how required for configuration and management
– Should be accessed through common statistical software
• Reasoned overview of the most popular Big
Data technologies, with focus on their usage in
NSIs
4
Outline
• Motivations for Big Data technologies
• Overview of Big Data tools
• Adopting Big Data technologies in NSIs
5
MOTIVATIONS FOR BIG DATA
TECHNOLOGIES
6
What does
Big mean?
Background
7
Big
Size
Tera-‐, Peta-‐ … and growing
Background
Processing
a complex statistical method can become untreatable even with data sets of
“reasonable” size
Quality
Big Data is often loosely
structured and highly noisy
8
Big
Size
• Real Big Data begins where your usual tools fail…
Distributed file systems
• Clusters of commodity hardware that can scale to indefinite size simply by adding new nodes at runtime
• Overcome physical limitations
• Should be managed by a middleware platform (Hadoop HDFS)
Tools and Techniques
9
Big
Processing
Tools and Techniques
MapReduce
• Programming paradigm that enables programs to be executed in parallel on a cluster
• Not tied to a programming language,
interfaces exist for all common languages and tools
10
Big Quality
• Pre-‐processing for cleaning and organizing data
• Big Data are often unstructured but the viceversa is not true
Tools and Techniques
11
Technical Challenges
Handling Big Data necessarily requires relying on complex distributed technologies
If you want to get something from real big
data you have to deal with this complexity
12
Perspectives
Colleen, the Statistical Analyst Moss, the IT Guy
Ok, but I want to use my tools and methods. I don’t want to
touch this distributed stuff I can setup the
infrastructure and the data and help you with the tools
Deal
I don’t want to write programs for
every analysis she makes
13
BIG DATA TOOLS OVERVIEW
14
Big Data IT Tools Proliferation
15
Our focus: IT Tools for
Statistical Analysis of Big Data
• What are the basic tools?
• What is the best tool for the job?
• How these tools integrate with common
elements in an IT architecture?
16
Big Data IT Tools:
the Common Denominator
17
Distributed Storage and Processing:
Hadoop
Distributed storage platform De-‐facto standard for Big Data processing
Open source project supported and/or adopted by most major vendors
Virtually unlimited scalability
– storage, memory, processing
power
18
Hadoop
Hadoop Principle
I’m one big data set
Hadoop is basically a
middleware platforms that manages a cluster of
machines
The core components is a distributed file system (HDFS)
HDFS
Files in HDFS are split into blocks that are scattered over the cluster
The cluster can grow indefinitely simply by adding new nodes
19
The MapReduce Paradigm
Parallel processing paradigm
Programmer is unaware of parallelism
Programs are structured into a two-‐phase execution
Map
Data elements are classified into
categories
Reduce
An algorithm is applied to all the elements of the same category
x 4 x 5 x 3
20
MapReduce and Hadoop
Hadoop
HDFS
MapReduce MapReduce is
logically placed on top of HDFS
21
MapReduce and Hadoop
Hadoop
HDFS MR
HDFS MR
HDFS MR
HDFS MR
MR works on (big) files loaded on HDFS
Each node in the cluster executes the MR
program in parallel, applying map and
reduces phases on the blocks it stores
Output is written on HDFS
Scalability principle:
Perform the computation were the data is
22
MapReduce Applications
• Naturally targeted at counts and aggregations
– 1-‐line aggregation algorithm
• Collecting & combining
– It all began there…inverted index computation in Google
• Machine learning, cross-‐correlation
• Graph analysis
– “People you may know”
– Geographical data: in Google Maps, finding nearest feature to a given address or location
• Pre-‐processing of unstructured data
• Can also handle binary files
– NYT converted 4TBs of scanned articles into 1.5TB of PDFs
23
Data Analysis with Hadoop
Colleen, the Statistical Analyst Moss, the IT Guy
Cool! Now how can I analyze them?
I finally loaded those elephant-‐size data sets into Hadoop!
It’s simple! Write a MapReduce program in Java!
No
Ok, I’ll do that for you
No MapReduce programs can be written in various
programming languages
Several tools are also available that translate high-‐level analysis languages into MapReduce programs
24
High-‐level languages for data manipulation
Tools for Data Analysis with Hadoop
Hadoop
HDFS MapReduce
Pig
Statistical Software Hive
25
Using Hadoop from Statistical Software
• R
– packages rhdfs, rmr
– Issue HDFS commands and write MapReduce jobs
• SAS
– SAS In-‐Memory Statistics – SAS/ACCESS
• Makes data stored in Hadoop appear as native SAS datasets
• Uses Hive interface
• SPSS
– Transparent integration with Hadoop data
26
Apache Pig
• Tool for querying data on Hadoop clusters
• Widely used in the Hadoop world
– Yahoo! estimates that 50% of their Hadoop workload on their 100,000 CPUs clusters is genarated by Pig scripts
• Allows to write data manipulation scripts
written in a high-‐level language called Pig Latin
– Interpreted language: scripts are translated into MapReduce jobs
• Mainly targeted at joins and aggregations
27
Pig Example
Real example of a Pig script used at Twitter
The Java equivalent…
28
Hadoop-‐MapReduce Limitations
• Not usable in transactional applications
• Not suited to Real-‐time analysis
– MapReduce jobs run in batch mode.
• HDFS is an append-‐only file system
– Can insert and delete, but cannot update
• MapReduce jobs run in batch mode
– You cannot expect low response latency
– Not suited for interactive, real-‐time operations and/
or random-‐access read/writes
29
NoSQL databases
• Distributed storage platforms that allows for lower latency processing
• NoSQL: Not Only SQL
• Non-‐relational data models that trade
transactional consistency for query efficiency and support semi-‐structured data
– No joins, no transactions, no indexes
30
NoSQL Databases
Popular choices: Hbase and Cassandra
Use a column-‐oriented model
– Data organized in families of key: value pairs
– variable schema where each row is slightly different
– optimized for sparse data
Can be accessed from R
Hadoop
HDFS MapReduce
HBase
R
Cassandra
Fully distributed platform Not based on Hadoop Based on Hadoop
31
Big Data Tools in the IT Architecture
• Hadoop is not a DB/DW replacement but it sits besides traditional data technologies in a modern IT architecture
• The outcome of Big Data processing can be stored in a traditional DB-‐DW
• Modern (visual) analytics tools can integrate
both kinds of data sources
32
Analysis tools
Augmented IT Architecture
Hadoop
Statistical software Visual Analytics
NoSQL
DB DW
multi-‐
structured big data
BI
Initial processing and cleaning
Keeps multi-‐structured historical data online and accessible
Analysis results
33
ADOPTING BIG DATA
TECHNOLOGIES
34
Maximum control of configuration and costs
High complexity
Pay-‐per-‐use billing model Cuts hardware and software costs and eliminates
management burden
Privacy issues!
Easy Costly
Hadoop Deployment Options
In-‐house Cloud Appliance
35
IT Skills for Big Data Tools
System manager Data Integrator
Data Engineer Data scientist
Designs the IT architecture for collecting and processing
Designs and develop writes MR jobs or PIG scripts
Sets up and manages the physical infrastructure Develops ETL procedures to
move data to/from HDFS and NoSQL DBs
Derives new insights by applying statistical analysis methods on different,
heterogeneous, possibly big, data sources
Has strong IT foundations and can develop her algorithms using both statistical tools and Hadoop
Uses statistical tools and VA
Data analyst
SQL
R -‐ SAS -‐ SPSS
BI and Visual Analytics Excel
Linux
Map Reduce
ETLPig Java
36
Suggestions for ESS
Training on data science for statisticians and Big data engineering for IT staff
Implementation of standard methods and tools in a Hadoop-‐compliant version
Eurostat establishing repositories of Big Data and allowing NSIs to access them
Set up of a “statistical cloud”, a Hadoop cluster shared by NSIs
Possible agreements with providers of IT solutions (Google, etc.)