Programme of the ESTP training course on
BIG DATA
EFFECTIVE PROCESSING AND ANALYSIS OF VERY LARGE AND UNSTRUCTURED DATA FOR OFFICIAL STATISTICS
Rome, 5 – 9 May 2014
Istat – Piazza Indipendenza 4, Room Vanoni
A laboratory approach in managing very large datasets, which are emerging as primary sources feeding
most up to date statistical processes. Students will be introduced to the appropriate use of technology for
managing the ETL processes resulting from collecting and feeding data from large structured and
unstructured data sources. The course also provides a collection of methods and techniques to integrate the
sources, to compare the archives against reference metadata sets and to discover and eventually resolve
source anomalies. The attendee will be introduced in the theoretical fundamentals, which underlie any
presented methodology and will finally be brought to a real implementation by using innovative techniques
and algorithms.
Day 1, 5 May 2014
Old and new d
ata manipulation paradigms
Time Duration Subjects Lecturers
9.00 - 9.15 15’ Opening Guido Rotondi
9.15 – 9.45 30’ Too big to ignore: a matter of balance. Evolution in data
management; scenario. Guido Rotondi
9.45 – 10.15 30’ The need for alternative computing paradigms. Antonino Virgillito
10.15 – 11.00 45’ Classification of data sources. Francesco Bosio
11.00 - 11.15 15’ Coffee break
11.15 – 11.45 30’ The Internet of Things. Monica Scannapieco
11.45 -12.30 45’ Case study: synthesising a Big Data driven framework. Diego Zardetto
12.30 – 13.00 30’ Sharing experiences, expectations and critical aspects. Giulio Barcaroli
13.00 – 13.30 30’ International activities on Big Data in Official Statistics Carlo Vaccari
13.30 -14.30 60’ Lunch time
15.00 – 15.30 30’ XML enabled databases. Non relational databases. Guido Rotondi
15.30 - 15.45 15’ Coffee break
15.45 – 16.15 30’ Handling XML sources. Non structured XML Tables. Francesco Bosio
16.15 – 16.45 30’ Dealing with XSD schemas. Structured XML Tables. Francesco Bosio
16.45 – 17.15 30’ Merging XML data in the business process: the Resource
Description Framework. Monica Scannapieco
Day 2, 6 May 2014
A roadmap toward Big Data
Time Duration Subjects Lecturers
9.00 - 9.15 15’ Opening Guido Rotondi
9.15 – 10.00 45’ The Map Reduce programming model. Antonino Virgillito
10.00 – 11.00 60’ The World of Hadoop. Antonino Virgillito
11.00 - 11.15 15’ Coffee break
11.15 – 12.15 60’ NoSQL databases Monica Scannapieco
12.15 – 12.45 30’
Robust concurrent computing architectures and the Byzantine agreement problem. Single Point Of Control. Single Point Of Failure.
Guido Rotondi
12.45 – 13.30 45’ Using Big Data technologies (part one): massive computing. Antonino Virgillito
13.30 -14.30 60’ Lunch time
14.30 – 15.30 60’ Using Big Data technologies (part two): dealing with
unstructured data examples and applications. Monica Scannapieco
15.30 - 15.45 15’ Coffee break
15.45 – 16.30 45' Implementing the Map Reduce programming model on a
parallel enabled database: aggregating functions. Guido Rotondi
16.30 – 17.15 45'
Profiling the Map Reduce model on a real enterprise infrastructure. Implementing and evaluating simple Map Reduce algorithms.
Francesco Bosio
Day 3, 7 May 2014
Big Data in Official Statistics
Time Duration Subjects Lecturers
9.00 - 9.15 15’ Opening Guido Rotondi
9.15 – 10.00 45’ Introduction to Big Data in Official Statistic.
The concept of Big Data; overview of Big Data sources. Antonino Virgillito
10.00 – 11.00 60’ Methodological issues in using Big Data for Official
Statistics. Giulio Barcaroli
11.00 - 11.15 15’ Coffee break
11.15 – 12.15 60’ IT Issues in using Big Data for Official Statistics. Monica Scannapieco
12.15 – 13.30 75’ Using mobile phones for analyzing mobility of city users. Antonino Virgillito
13.30 -14.30 60’ Lunch time
14.30 – 15.30 60’ Improving Labor Force Survey estimates by the effective
usage of Google Trends. Monica Scannapieco
15.30 - 15.45 15’ Coffee break
15.45 – 16.45 60’ Internet as a data source: web scraping and text mining for
estimating ICT usage by enterprises and public Institutions. Monica Scannapieco
16.45 – 17.15 30’ Privacy, Security and Safety: Recipes for securing data,
recipes for disclosure control, trusted computing. Guido Rotondi
Day 4, 8 May 2014
Improving data availability and processing efficiency
Time Duration Subjects Lecturers
9.00 - 9.15 15’ Opening Guido Rotondi
9.15 – 10.00 45’ Data location and partitioning. Indexing. Problem splitting.
Actor systems. Storage virtualisation. Guido Rotondi
10.00 – 11.00 60’ Examples of improving data location and partitioning.
Effective usage of indexes. Francesco Bosio
11.00 - 11.15 15’ Coffee break
11.15 – 12.15 60’
Improving database (serial) operations. Code profiling. Bulk operations. Pipelined functions. Sustained data streaming. Partition swapping.
Guido Rotondi
12.15 – 13.00 45’
External tables in performing fast bulk operations. Application of a pipelined function to an ETL process. Managing changes of a big micro data set.
Francesco Bosio
13.00 – 13.30 30’ Quasi real time analytics. Diego Zardetto
13.30 -14.30 60’ Lunch time
14.30 – 15.30 60’
Fundamentals of parallel computing. Definitions, metrics, workload, critical aspects. Distributed vs Symmetric Multi Processing.
Guido Rotondi
15.30 - 15.45 15’ Coffee break
15.45 – 16.30 45’
Parallel database operations. Scheduled concurrent tasks. Parallel enabled pipelined functions. Parallel queries. Embedded relational objects, aggregating functions.
Guido Rotondi
16.30 – 17.15 45’
Self-made parallelism vs controlled tasks .Benefits of parallel data streaming. Multipath data querying.
Embedded relational objects. Design of central aggregating functions.
Francesco Bosio
Day 5, 9 May 2014
The analysis of massive datasets
Time Duration Subjects Lecturers
9.00 - 9.15 15’ Opening Guido Rotondi
9.15 – 10.15 60’ Geometric interpretation of data structures and the
introduction of regular languages and expressions. Guido Rotondi
10.15 – 11.00 45’ Getting involved with regular expressions. Francesco Bosio
11.00 - 11.15 15’ Coffee break
11.15 – 12.00 45’ Mapping techniques for studying anomalies in structured
data: Probabilistic ranking of event patterns. Guido Rotondi
12.00 – 12.45 45’ Stochastic characterisation of unstructured data sets. Guido Rotondi
12.45 – 13.30 45’ Characteristics of a Big Data Analysis Framework: a
distributed approach Guido Rotondi
13.30 -14.30 60’ Lunch time
14.30 -15.30 60’ Inference techniques used for Official Statistics (Part-1) Diego Zardetto
15.30 - 15.45 15’ Coffee break
15.45 – 16.45 60’ Inference techniques used for Official Statistics. (Part-2) Diego Zardetto
16.45 – 17.00 15’ Where can we go from here? Golden rules. Guido Rotondi
17.00– 17.30 30’ Final remarks Guido Rotondi Giulio Barcaroli Francesco Bosio Monica Scannapieco Antonino Virgillito