Platforms - Big Data mining and machine learning techniques applied to real world scenarios

High performance hardware platforms are becoming increasingly important for the analysis of big data, and choosing the right hardware/software platforms is crucial if the user’s requirements have to be satisfied in a reasonable amount of time [SR15]. Since many platforms exist with different characteristics, the choice of the most suitable platform requires an in-depth knowledge of all viable alternatives.

The essential feature every platform incorporates is scalability, namely the capability to grow and manage an increased demand in a reasonable amount of time. The big data platforms can be divided according to the following two types of scaling:

Horizontal scaling (scale out): the workload is distributed across many independent machines to improve the processing capability. Multiple instances of the operating system are typically running on separate machines.

Vertical scaling (scale up): involves installing new hardware typically within a single server, e.g. more processors, more memory and faster hardware. It usually involves a single instance of an operating system.

Horizontal scaling increases the performance in small steps, reducing the financial investment required to upgrade. Moreover, the system can scale out as much as needed, without any theoretical limitation. On the other hand, software has to handle the data distribution and has to deal with parallelism. Few software frameworks are available that can take advantage of horizontal scaling.

Vertical scaling is very handy to set up, because managing and installing hardware within a single machine is straightforward. Differently from horizontal scaling, most of the software can easily take advantage of vertical scaling. However, a substantial financial investment is required to scale up a machine. To handle future workloads, the user typically invests more than what is required for the current processing needs. This is due to the limited space and the number of expansion slots available in a single machine, which limit vertical scaling.

6.2.1 Horizontal scaling platforms

The most known platforms based on horizontal scaling are peer-to-peer networks [MKL+02, SW05], Apache Hadoop [Had09] and Spark [ZCF+10].

CHAPTER 6. BIG DATA ANALYTICS 43 Peer-to-peer networks

Peer-to-peer networks are a decentralized and distributed network architecture where the nodes in the network (i.e. peers) serve and consume resources, typically through message passing. While broadcasting messages is cheap, the aggregation of data and results is much more expensive.

The standard communication paradigm of peer-to-peer networks is Message Passing Interface (MPI), which has several desirable features: the ability to preserve the state of the computation; the hierarchical master/slave paradigm to support dynamic resource allocation where the slaves have large amounts of data to process; the barrier mechanism to synchronize processes at a certain point of the computation. On the other hand, MPI does not include a fault tolerance mechanism, so that a single node failure in a peer-to-peer network can cause the entire system to shut down. Since the enforcement of fault tolerance is completely transferred to the user, MPI is no longer widely used.

Hadoop

Hadoop is an open source framework for storing and processing large datasets using clusters of commodity hardware. Hadoop can scale up to thousands of nodes, and it is also fault tolerant, differently from MPI and peer-to-peer networks. The Hadoop platform contains various components (Figure 6.1), whose most important are a distributed file system (HDFS) [B+08] and a resource management layer (YARN) [VMD+13]. The former is used to store data across cluster of machines while providing high availability and fault tolerance, whereas the latter schedules the jobs across the cluster.

Figure 6.1: Hadoop Stack showing different components. Source: [SR15].

The programming model used in Hadoop is MapReduce [DG08], according to which a task is broken into two parts, namely mappers and reducers. The former read the data from HDFS, process them, and generate intermediate results, which are then aggregated by the latter to generate the final output to be written to HDFS.

Some wrappers have been developed for MapReduce to foster the source code development, including Apache Pig by Yahoo! [ORS+08] and Hive by Facebook [TSJ+09]. One of the major

44 CHAPTER 6. BIG DATA ANALYTICS

drawbacks of MapReduce is its inefficiency in running iterative algorithms, because MapReduce is not designed for iterative processes. A few approaches have been developed to circumvent this limitation [BHBE10, EPF08, PR12], which however introduce additional levels of complexity in the source code.

Spark

Spark is a next generation paradigm for big data processing developed by researchers at the University of California in Berkeley. Spark is able to perform in-memory computations, so as to allow data to be cached in memory, eliminating the Hadoop’s disk overhead limitation for iterative tasks. Generally, it is up to 100x faster than Hadoop MapReduce when data fit in memory, and up to 10x faster when data reside on disk. Spark can run on Hadoop Yarn manager, use HDFS, and support Java, Scala and Python, making it a general and versatile engine for large-scale data processing that may run on different systems.

6.2.2 Vertical scaling platforms

The most famous vertical scaling platforms include High Performance Computing Clusters (HPC) [B+99], Field Programmable Gate Arrays (FPGA) [BFRV12], Multicore processors [BBL11] and Graphics Processing Unit (GPU) [OHL+08].

HPC clusters

HPC clusters, sometimes referred as supercomputers, are machines with thousands of cores. Depending on requirements, they can have a different kind of disk organization, cache, communication mechanism, and so on. HPC clusters typically use MPI as the communication scheme, but fault tolerance is not a big deal here due to the top quality hardware employed, which makes failures extremely rare. Such top quality hardware is optimized for speed and throughput; as a consequence, the initial cost of deployment is very high. HPC clusters are not as scalable as Hadoop or Spark clusters, because the cost of scaling up such a system is much higher, but they are still able to process terabytes of data.

FPGAs

FPGAs are specialized hardware units that are custom-built for specific applications, and can be highly optimized for speed, e.g. orders of magnitude faster than other platforms. They are programmed using Hardware Descriptive Language (HDL) [TM08] at a higher development cost because of the customized hardware. In spite of being useful for a variety of real world applications, in particular network security [CCS11], the speed of multicore processors is recently reaching closer to that of FPGAs.

CHAPTER 6. BIG DATA ANALYTICS 45 Multicore processors

Multicore processors refer to one machine having dozens of processing cores, which typically have shared memory but only one disk. Recently, CPUs have gained internal parallelism, the number of cores per chip and the number of operations that a core can perform have significantly increased, and multiple CPUs are allowed within a single machine, supporting algorithm acceleration. The parallelism in CPUs is mainly achieved through multithreading, where a task is broken down into threads, which are executed in parallel on different CPU cores, sharing the same memory. A high-level support to multithreading is provided by most of the programming languages through libraries.

GPUs

Graphics Processing Unit (GPU) is a specialized hardware designed to accelerate the creation of images in a frame buffer intended for display output. Although GPUs were originally used just for video and image editing, general-purpose computing on graphics processing units (GPGPU) [ND10] has recently arisen.

A GPU has large number of processing cores, e.g. 3,000+, and has its own high throughput DDR5 memory, which is many times faster than a DDR3 memory. A GPU usually has two levels of parallelism: it contains multiprocessors (MPs) and within each multiprocessor there are several streaming processors (SPs). A GPU-based program breaks down into threads that execute on SPs, then these threads are grouped to form blocks that run on a MP. Each thread within a block can only communicate and synchronize with other threads in the same block, accessing a small portion of shared cache memory and a larger portion of global main memory. Since inter-block communication is not allowed, any task needs to be broken into blocks that can run independently of others.

Nvidia has recently launched Tesla series of GPUs for high performance computing, and the CUDA framework to make GPU programming accessible to software programmers. Con- sequently, GPUs have been used in the development of faster machine learning algorithms based on the CUDA framework. Beyond the advantages, GPUs also have drawbacks. First, they have a limited memory, not suitable to handle terabytes of data; when data exceed the size of memory, the disk access becomes a bottleneck. Second, few algorithms are easily portable to GPUs, because of the way in which the task is required to be broken into blocks.

CPUs vs GPUs

The processing power of GPUs is much higher than CPUs (i.e. 3,000+ processing cores with 1,000Tflops of processing power for a GPU, compared with tens of processing cores with 10Gflops of processing power for a CPU). The main drawbacks of CPUs are their limited number of processing cores and their dependence on the system memory for data access, which is the major bottleneck. GPU avoids this by using a DDR5 memory, much faster than a DDR3 memory

46 CHAPTER 6. BIG DATA ANALYTICS

used in a system. Moreover, in order to accelerate the data access, GPU uses a high speed cache for each multiprocessor.

In document Big Data mining and machine learning techniques applied to real world scenarios (Page 54-58)