XML enabled databases. Non relational databases. Guido Rotondi

(1)

Programme of the ESTP training course on

BIG DATA

EFFECTIVE PROCESSING AND ANALYSIS OF VERY LARGE AND UNSTRUCTURED DATA FOR OFFICIAL STATISTICS

Rome, 5 – 9 May 2014

Istat – Piazza Indipendenza 4, Room Vanoni

A laboratory approach in managing very large datasets, which are emerging as primary sources feeding

most up to date statistical processes. Students will be introduced to the appropriate use of technology for

managing the ETL processes resulting from collecting and feeding data from large structured and

unstructured data sources. The course also provides a collection of methods and techniques to integrate the

sources, to compare the archives against reference metadata sets and to discover and eventually resolve

source anomalies. The attendee will be introduced in the theoretical fundamentals, which underlie any

presented methodology and will finally be brought to a real implementation by using innovative techniques

and algorithms.

Day 1, 5 May 2014

Old and new d

ata manipulation paradigms

Time Duration Subjects Lecturers

9.00 - 9.15 15’ Opening Guido Rotondi

9.15 – 9.45 30’ Too big to ignore: a matter of balance. Evolution in data

management; scenario. Guido Rotondi

9.45 – 10.15 30’ The need for alternative computing paradigms. Antonino Virgillito

10.15 – 11.00 45’ Classification of data sources. Francesco Bosio

11.00 - 11.15 15’ Coffee break

11.15 – 11.45 30’ The Internet of Things. Monica Scannapieco

11.45 -12.30 45’ Case study: synthesising a Big Data driven framework. Diego Zardetto

12.30 – 13.00 30’ Sharing experiences, expectations and critical aspects. Giulio Barcaroli

13.00 – 13.30 30’ International activities on Big Data in Official Statistics Carlo Vaccari

13.30 -14.30 60’ Lunch time

(2)

15.00 – 15.30 30’ XML enabled databases. Non relational databases. Guido Rotondi

15.30 - 15.45 15’ Coffee break

15.45 – 16.15 30’ Handling XML sources. Non structured XML Tables. Francesco Bosio

16.15 – 16.45 30’ Dealing with XSD schemas. Structured XML Tables. Francesco Bosio

16.45 – 17.15 30’ Merging XML data in the business process: the Resource

Description Framework. Monica Scannapieco

(3)

Day 2, 6 May 2014

A roadmap toward Big Data

9.15 – 10.00 45’ The Map Reduce programming model. Antonino Virgillito

10.00 – 11.00 60’ The World of Hadoop. Antonino Virgillito

11.00 - 11.15 15’ Coffee break

11.15 – 12.15 60’ NoSQL databases Monica Scannapieco

12.15 – 12.45 30’

Robust concurrent computing architectures and the Byzantine agreement problem. Single Point Of Control. Single Point Of Failure.

Guido Rotondi

12.45 – 13.30 45’ Using Big Data technologies (part one): massive computing. Antonino Virgillito

13.30 -14.30 60’ Lunch time

14.30 – 15.30 60’ Using Big Data technologies (part two): dealing with

unstructured data examples and applications. Monica Scannapieco

15.30 - 15.45 15’ Coffee break

15.45 – 16.30 45' Implementing the Map Reduce programming model on a

parallel enabled database: aggregating functions. Guido Rotondi

16.30 – 17.15 45'

Profiling the Map Reduce model on a real enterprise infrastructure. Implementing and evaluating simple Map Reduce algorithms.

Francesco Bosio

(4)

Day 3, 7 May 2014

Big Data in Official Statistics

9.15 – 10.00 45’ Introduction to Big Data in Official Statistic.

The concept of Big Data; overview of Big Data sources. Antonino Virgillito

10.00 – 11.00 60’ Methodological issues in using Big Data for Official

Statistics. Giulio Barcaroli

11.00 - 11.15 15’ Coffee break

11.15 – 12.15 60’ IT Issues in using Big Data for Official Statistics. Monica Scannapieco

12.15 – 13.30 75’ Using mobile phones for analyzing mobility of city users. Antonino Virgillito

13.30 -14.30 60’ Lunch time

14.30 – 15.30 60’ Improving Labor Force Survey estimates by the effective

usage of Google Trends. Monica Scannapieco

15.30 - 15.45 15’ Coffee break

15.45 – 16.45 60’ Internet as a data source: web scraping and text mining for

estimating ICT usage by enterprises and public Institutions. Monica Scannapieco

16.45 – 17.15 30’ Privacy, Security and Safety: Recipes for securing data,

recipes for disclosure control, trusted computing. Guido Rotondi

(5)

Day 4, 8 May 2014

Improving data availability and processing efficiency

9.15 – 10.00 45’ Data location and partitioning. Indexing. Problem splitting.

Actor systems. Storage virtualisation. Guido Rotondi

10.00 – 11.00 60’ Examples of improving data location and partitioning.

Effective usage of indexes. Francesco Bosio

11.00 - 11.15 15’ Coffee break

11.15 – 12.15 60’

Improving database (serial) operations. Code profiling. Bulk operations. Pipelined functions. Sustained data streaming. Partition swapping.

Guido Rotondi

12.15 – 13.00 45’

External tables in performing fast bulk operations. Application of a pipelined function to an ETL process. Managing changes of a big micro data set.

Francesco Bosio

13.00 – 13.30 30’ Quasi real time analytics. Diego Zardetto

13.30 -14.30 60’ Lunch time

14.30 – 15.30 60’

Fundamentals of parallel computing. Definitions, metrics, workload, critical aspects. Distributed vs Symmetric Multi Processing.

Guido Rotondi

15.30 - 15.45 15’ Coffee break

15.45 – 16.30 45’

Parallel database operations. Scheduled concurrent tasks. Parallel enabled pipelined functions. Parallel queries. Embedded relational objects, aggregating functions.

Guido Rotondi

16.30 – 17.15 45’

Self-made parallelism vs controlled tasks .Benefits of parallel data streaming. Multipath data querying.

Embedded relational objects. Design of central aggregating functions.

Francesco Bosio

(6)

Day 5, 9 May 2014

The analysis of massive datasets

9.15 – 10.15 60’ Geometric interpretation of data structures and the

introduction of regular languages and expressions. Guido Rotondi

10.15 – 11.00 45’ Getting involved with regular expressions. Francesco Bosio

11.00 - 11.15 15’ Coffee break

11.15 – 12.00 45’ Mapping techniques for studying anomalies in structured

data: Probabilistic ranking of event patterns. Guido Rotondi

12.00 – 12.45 45’ Stochastic characterisation of unstructured data sets. Guido Rotondi

12.45 – 13.30 45’ Characteristics of a Big Data Analysis Framework: a

distributed approach Guido Rotondi

13.30 -14.30 60’ Lunch time

14.30 -15.30 60’ Inference techniques used for Official Statistics (Part-1) Diego Zardetto

15.30 - 15.45 15’ Coffee break

15.45 – 16.45 60’ Inference techniques used for Official Statistics. (Part-2) Diego Zardetto

16.45 – 17.00 15’ Where can we go from here? Golden rules. Guido Rotondi

17.00– 17.30 30’ Final remarks Guido Rotondi Giulio Barcaroli Francesco Bosio Monica Scannapieco Antonino Virgillito