Summary - PiCo: A Domain-Specific Language for Data Analytics Pipelines

can be used for the data serialization step, e.g., Boost.Serialize [93] or Google Pro- tobuf [94].

FastFlow has also been provided with a minimal message passing layer implemented on top of InfiniBand RDMA features [112]. The proposed RDMA-based commu- nication channel implementation achieves comparable performance with highly op- timised MPI/InfiniBand implementations. The results obtained demonstrate that the RDMA-based library can improve application performance by more than 30% with a consistent reduction of CPU time utilization with respect to the original TCP/IP implementation.

2.5 Summary

In this chapter we provided a review of the most common parallel computing platforms, programming models for multicore and cluster of multicores. From a hard- ware perspective, we presented multicore, manycore processors such as accelerators and distributed systems. From a high-level programming model perspective, we listed the four main classes of parallelism (task, data, stream and dataflow parallelism) followed by some insights about the Dataflow model, a model of computation describing a program as a set of concurrent processes extensively used in this thesis. We also provided a review of low-level approaches in parallel programming, such as POSIX programming model, MPI, TBB and OpenMP. In particular, we reviewed programming model for the described platforms exploiting high-level skeleton-based approaches, focusing on the FastFlow library, which we use in this work to imple- ment the PiCo runtime.

31

Chapter 3

Overview of Big Data

Analytics Tools

Big Data is a term used to identify data sets that are very large and/or complex (i.e., unstructured) so that traditional data processing applications are inadequate to process them. In this chapter we will describe what Big Data is starting by its “formal” definition. We then continue with a survey of the state-of-the-art of Big Data Analytics tools.

3.1 A Definition for Big Data

Big Data has been defined as the “3Vs” model, an informal definition proposed by Beyer and Laney [32,90] that has been widely accepted by the community:

“Big data is high-Volume, high-Velocity and/or high-Variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation.”

More in detail:

• Volume: The amount of data that is generated and stored. The size of the data determines whether it can be considered big data or not

• Velocity: The speed at which the data is generated and processed

• Variety: The type and nature of the data that, when collected, are typically unstructured

The “3Vs” model can be extended by adding two more V -features: 1) Variability, since data not only are unstructured, but can also be of different types (i.e. text, images) or even inconsistent and, 2) Veracity, in the sense that the quality and accuracy of the data may vary.

In 2013, IBM tried to quantify the amount of data being produced: “Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years alone.” [75]

There is no clear-cut definition on how “big” the data should be to be considered Big Data. Data is said to be Big if it becomes difficult to be stored, analyzed and searched using traditional database systems, since it is large, unstructured and hard to be organized. Consider, for instance, large organizations like Facebook: having more than 950 millions users, it pulls in 500 Terabytes per day, into a 100 Petabytes

warehouse, and runs 70 000 queries per day on this data as of 2012.1 Here are some of the statistics provided by the company:

• 2.5 billion content items shared per day • 2.7 billion Likes per day

• 300 million photos uploaded per day

• 100+ Petabytes of disk space in one of their largest Hadoop (HDFS) clusters • 105 Terabytes of data scanned every 30 minutes

• 70 000 queries executed on these databases per day

• 500+ Terabytes of new data stored into the databases every day

Of course, Facebook is not the unique entity producing Big Data. There is a large set of use cases in which Big Data analysis can produce knowledge. For instance, the IoT (Internet of Things) is the interconnection of uniquely identifiable devices connected (also among each other) to the Internet. Those devices are referred to as “connected” or “smart” devices. Big Data and IoT are strictly connected, since billions of devices produce massive amount of data, and companies can benefit from analyzing all that information in order to produce new knowledge. Another use case example comes from healthcare and genomics, where the amount of data being produced by sequencing, mapping, and analyzing genomes takes those areas into the realm of Big Data. Big Data can improve operational efficiencies, help predict and plan responses to disease epidemics, improve the quality of monitoring of clinical trials, and optimize healthcare spending at all levels from patients to hospital systems to governments. Another key area is genomics sequencing which is expected to be the future of healthcare [33]. Big Data can also be not that big. Companies not as big as Facebook or Twitter may be interested in analyzing their data, as well as institutions, public offices, banks, etc.

Extracting knowledge from Big Data is also related to programmability, that is, how to easily write programs and algorithms to analyze data, and performance issues, such as scalability when running analysis and queries on multicore or cluster of multicores. For this reason, starting from the Google MapReduce paper [61], a constantly increasing number of frameworks for Big Data processing has been implemented. In the next section, we provide an overview of such frameworks, starting from an sum-up about data management.

In document PiCo: A Domain-Specific Language for Data Analytics Pipelines (Page 47-50)