From an application perspective, Big Data technologies fall into two categories: systems with operational capabilities for real time, interactive workloads where data is primarily captured and stored, for example, NoSQL databases, and systems that provide analytical capabilities for retrospective, complex analysis that may touch most or all the data, for instance massive parallel processing (MPP) database systems and MapReduce. In practice both are often combined. That is why in this chapter Big Data technologies are divided into three categories: how Big Data is stored, how Big Data is processed in a parallel and distributed way, and how Big Data is analyzed.
There are many proposed reference architectures for Big Data. There is no one-size- fits-all solution and every architecture has its advantages and disadvantages. A high-level layer view is shown in Figure 6.2. It does not provide a conclusive list and serves as a scaf- fold to exemplify where the components described in this chapter might fit into an overall architecture. Some layers such as the management, monitoring, and security layer are not shown.
The data layer is where the data is processed using frameworks such as Hadoop, and data is stored using NoSQL databases such as MongoDB.
The integration layer is where the data is preprocessed, relevance filtered, and trans- ferred. For instance Sqoop is designed to transfer bulk data and Flume for collecting, aggregating, and moving large amounts of log data. Both, Sqoop and Flume, are projects of the Apache Foundation.
Once the data has been prepared, it is ready to be looked at by the analytic layer. The data is visualized and analyzed for patterns using dashboards with slice-and-dice and search functionality. Tools such as Pentaho or QlikView provide this functionality. Big Data anal- ysis is typically exploratory and iterative by nature.
Hadoop NoSQL Cluster Cloud Hardware Infrastructure
Big Data analytics stack Applications
Data Data processing Data storage Business intelligence Visualization Analytics
Data modeling Integration
Data preparation
Sometimes a predictive layer is added. Historic and real-time data is used to make pre- dictions and identify risks and opportunities using machine learning techniques and busi- ness rules. In this chapter, it is considered part of the analytic layer.
This chapter describes the techniques and technologies of the data and the analytics layer. The hardware and integration layer are beyond the scope of this chapter.
NoSQL Databases
NoSQL databases are a form of non-relational database management systems. They are often schema-less, avoid joins, and scale easily. The term NoSQL is misleading since it implies that NoSQL databases do not support SQL. This is not necessarily true since some do support SQL-like query languages. NoSQL is thus sometimes referred to as “Not only SQL.”
NoSQL databases are usually classified by their data model. There are different types of NoSQL databases. Some of the most widespread databases are
• Key-value store, for example, Dynamo, Riak, Scalaris, Voldemort • Graph, for example, Neo4J, FlockDB, Allegro
• BigTable, for example, Accumulo, HBase, Hypertable • Document, for example, CouchDB, ElasticSearch, MongoDB
Data Model
Key-value Store Key-value pair databases are based on Amazon’s Dynamo paper [7], for instance, Voldemort is a key-value pair database developed at LinkedIn. They store data as simple key-value pairs and data is retrieved when a key is known (Figure 6.3).
Key-value stores are one of the simplest forms of databases. They are often used as high performance in-process databases. For instance, a shopping cart at Amazon is stored in a key-value store.
Graph The Graph DBs are inspired by the mathematical Graph theory and model data and their associations. Graph DBs contain nodes, properties and their relationships, the edges. A node can contain one to millions of values, called properties. Figure 6.4 shows nodes and their relations.
Data Process and Analysis Technologies of Big Data ◾ 107
Nodes are organized with explicit relationships. The graph is queried with a traversal. An algorithm is traversing the graph from starting nodes to related nodes to answer ques- tions of the form “what friends work for the same company like me.”
Graph databases are optimized for highly connected data.
BigTable BigTable or column families of DBs are based on Googles BigTable paper [8]. BigTable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers [8]. A table in BigTable is a multidimensional map. In a cluster environment, it is distributed over the nodes. Data is organized in three dimensions, rows, columns, and timestamps. The basic storage unit is a cell, referenced by row key, column key, and timestamp. BigTable can have multiple timestamp version of data within a cell. Figure 6.5 shows a sample column family table.
BigTable can be regarded as a database with dynamic schemas.
Document Store Document store databases such as MongoDB or CoucheDB store data as whole documents. A document can be almost in any form and most document-oriented databases accept different formats and encapsulate them in an internal format. XML and JSON are popular formats. A sample JSON is shown in Figure 6.6.
Name: Peter Occupation: Astronaut Facebook Blogger NASA Member of Works for Contributes to
FIgURE 6.4 Graph database.
The structure in a document database is usually flexible; a second JSON document in a database might have more or less entries. It can be processed directly by the application, no client-side processing is required.
Other Characteristics NoSQL databases have a simpler design than RDBMS which makes them faster for certain operations. Accessing NoSQL databases is also easier since no complex SQL queries have to be submitted. They often support their own, simpler query language and offer application programming interfaces (APIs) for different programing languages such as Java or Python.
NoSQL databases are usually distributed across multiple nodes for high availability and scale horizontally. When one node fails, another node takes over without interrup- tion or data loss. Most NoSQL databases usually do not support real ACID (Atomicity, Consistency, Isolation, Durability) transactions. They sacrifice consistency in favor of high availability and partition tolerance.
hadoop
Hadoop MapReduce has become one of the most popular tools for data processing [9]. Hadoop is an open-source parallel distributed programming framework based on the popular MapReduce framework from Google [5]. Hadoop MapReduce has evolved to an important industry standard for massive parallel data processing and has become widely adopted for a variety of use-cases [10]. However, there are alternatives such as Apache Spark, HPCC Systems, or Apache Flink.
Hadoop, originally developed by Yahoo! [11], is used by Facebook [12], Amazon, eBay, Yahoo! [13], and IBM [14] to name a few.
Hadoop is normally installed on a cluster of computers [9]. Hadoop provides a distrib- uted file system and a framework for the analysis and transformation of very large data
Data Process and Analysis Technologies of Big Data ◾ 109
sets using the MapReduce paradigm [11]. The distributed file system, the HDFS, is at the storage level, MapReduce at the execution level.
MapReduce and the HDFS are the core components of Hadoop. Other components are Hadoop Common, the common utilities that support the other Hadoop modules and Hadoop YARN for job scheduling and cluster resource management [15].
With the release of version 2 of Hadoop, the workload management has been decoupled from MapReduce and Yet Another Resource Negotiator (YARN) has taken over this func- tionality. Figure 6.7 shows the two Hadoop versions.
MapReduce
MapReduce is based on a paper by Google [16]. MapReduce uses a “divide and conquer” strategy. A Hadoop job consists of two steps, a map step and a reduce step. There can be optionally steps before and between the map and reduce steps. The map step reads in data in the form of key/value pairs, processes it, and emits a set of zero or more key/value pairs. For example, in a text mining task, the parsing and preprocessing is done in the map step. The output is passed to the reduce step, where a number of reducers compute an aggre- gated, reduced set of data such as the word count and frequency, again as key/value pair. Figure 6.8 shows the Hadoop processing steps.
Hadoop V1 Hadoop V2 MapReduce Resource management Data processing MapReduce Data processing YARN Resource management HDFS
Distributed redundant storage Distributed redundant storageHDFS
FIgURE 6.7 Hadoop V1 versus Hadoop V2.
Mapper Data
blocks Input data
Data split in multiple
smaller blocks Map
Mapper
Mapper
List of <key,
value> pairs <unique key,List of value> pairs List of <unique key, value> pairs List of <key, value> pairs List of <key, value> pairs Shuffle Reduce Reducer Reducer Output Output Data blocks Data blocks
YARN
Hadoop has undergone a major rework from version 1 to 2. The resource management has been separated out into YARN. YARN or MapReduce 2.0 consists of a global resource man- ager and a per-node slave, the node manager. The resource manager arbitrates the resources among all applications whereas the per-application Application Master (App Msr) negoti- ates the resource manager and works with the node manager to execute and monitor tasks. The resource manager consists of an applications manager and a scheduler. The sched- uler is responsible for allocating resources such as queues to the running applications. The applications manager accepts job-submissions, negotiates the first container for executing the Application Master and restarts it in case of failure.
The node manager is a per-machine agent. It is responsible for containers, monitoring their resource usage such as CPU, memory, and disk space and reporting to the resource manager and scheduler. Figure 6.9 shows the interactions of the YARN components.
Hadoop Distributed File System
HDFS is designed to store very large data sets reliably, and to stream those data sets at high bandwidth to user applications [11]. HDFS stores data across the cluster so that it appears as one contiguous volume. It also provides redundancy and fault tolerance. HDFS looks like a single storage volume and has been optimized for many serialized, concurrent reads of large files. HDFS supports only a single writer and random access is not really possible in an efficient way. There are replacement file systems that support full random-access read/
Client Client Resource manager Master Slave Node manager Container App Msr Container Container Job submission Node status Resource request MapReduce status Node manager Node manager App Msr Container
FIgURE 6.9 YARN. (Adapted from Thusoo, A. et al., Data Warehousing and Analytics Infrastructure at Facebook, in Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, Indianapolis, Indiana, USA, 2010, pp. 1013–1020.)
Data Process and Analysis Technologies of Big Data ◾ 111
write functionality such as MapR Direct Access NFS [17] or IBM’s General Parallel File System (GPFS) that can be used with Hadoop instead [18].
While the interface to HDFS is patterned after the UNIX file system, faithfulness to standards was sacrificed in favor of improved performance for the applications at hand [11]. Application data and file system metadata are stored separately. Application data is stored on DataNodes, metadata is stored on NameNodes. All nodes are connected and use TCP-based protocols to communicate among each other.
HDFS breaks files up into blocks and stores them on different file system nodes. Since the blocks are replicated across different nodes, failure of one node is a minor issue. HDFS re-replicates another copy of the block to re-establish the desired number of blocks.
The Hadoop ecosystem provides supporting analysis technologies that include the following:
• Apache Hive, a simple SQL-like interface • Apache Pig, a high-level scripting language • Apache Mahout, a data mining framework
The Apache ecosystem is shown in Figure 6.10. It is not a comprehensive list of Apache projects.
Hadoop is a complex system. It is difficult to program and needs skilled developers and data scientists to be properly used.
Big Data Analytics
There is no clear distinction between Big Data and data mining in literature. Data min- ing is the analytic process of exploring data to detect patterns and relationships and then applying the detected patterns to new data. Sometimes data mining is seen as focusing on the predictive rather than the descriptive side. However, data mining techniques such as association rule mining and sequential pattern mining also aim to understand relation- ships between data. Here, data mining is considered as an analytic step of Big Data.
Data Mining
Data mining is also called knowledge discovery in databases (KDD) [19]. Data mining is defined as the process of discovering patterns in data [20]. The data sources include
YARN (MRv2) Distributed processing framework
Mahoot Machine Learning
ZooKeeper Coordination HDFS
Hadoop distributed file system
Apache ecosystem
Flume Log collector Sqoop Data exchange
Oozie
Workflow ScriptingPig ConnectorR
Statistics
Hive Connector
Statistics ColumnarHBase
store Storm
Streaming
databases, the Web, texts, multimedia content, and machine-generated data such as sen- sor data. Examples of patterns are correlations between genomes and diseases to predict possible illnesses or seismic activities to detect new oil sources. Data mining is a multidis- ciplinary field in the areas of statistics, databases, machine learning, artificial intelligence, information retrieval, and visualization.
Data mining goes through several steps. They are divided into a data conditioning or data preprocessing phase and a data analysis phase. The data conditioning phase consists of a data collection and filtering step and a step where the predictor variables are elaborated. The analysis phase comprises a model selection step and a step where the performance of the model is evaluated. The data preprocessing techniques depend on the data that is being mined, whether it is text data, financial or sensor data and what it is being mined for. Text can be mined for instance for opinions or for spam to filter out junk mail. Sensor data can be mined for possible malfunctions or fatigue of parts that need to be replaced. Figure 6.11 shows the data mining process.
The data mining process usually starts with understanding the application domain and selecting suitable data sources for data collection and the target data.
The collected data is usually not in a suitable form for data mining. It has to go through a cleaning process. In the preprocessing step, irrelevant data such as stop words, noise, and abnormalities are removed and the attribute and features are selected. For instance, in opinion mining features can be frequencies of sentiment words like how often does a word such as “great” or “excellent” appear in a text? Preparing input for a data mining investigation usually consumes the bulk of the effort invested in the entire data mining process [20]. Data preprocessing is highly domain specific and beyond the scope of this chapter.
• Relevance filtering
• Attribute or feature selection Preprocessing
• Selection of data mining algorithm
• Applying data to the selected model Data mining
• Selection of useful patterns
• Visualization and reporting Postprocessing
• Selection of data sources
• Historic or real-time data collection Data collection
Data Process and Analysis Technologies of Big Data ◾ 113
In the data mining step, the preprocessed data is fed into a data mining algorithm. Usually, several algorithms are tested and the best performing one is being selected. The data mining step produces patterns or knowledge.
Postprocessing involves identifying the useful patterns for the application and visual- izing the patterns for reporting and decision making.
There are many data mining techniques. Some of the more common ones are supervised and unsupervised learning, association rule mining, and sequential pattern. These tech- niques will be covered in the next chapters.
Supervised Learning Analogous to human learning, machine learning can be defined as gaining knowledge from past experiences to perform better in future tasks. Since comput- ers do not have experiences like humans, machine learning learns from data collected in the past. A typical application of machine learning are spam filters. They classify emails into legitimate and unsolicited bulk emails, “spam” or “junk mail.” They have to adapt, to learn, to detect new forms of spam. Machine learning is a subfield of Artificial Intelligence. One form of machine learning is supervised learning.
Supervised learning is also called classification or inductive learning [19]. Supervised learning methods are used when the class label is known. For instance, text documents can be classified into different topics: politics, sciences, or arts. Here, the class is the topic and the class label are the values “politics,” “sciences” and “arts.” The classifier uses attributes as input. Attributes can be frequencies of words when classifying text documents or the age, income, and wealth of a person when classifying loan applications into “approved” or “rejected.” Supervised learning is often used in predictive analysis, “namely, learning a target function that can be used to predict the values of a discrete class attribute” [19].
Given a data set D, which are described by a set of attributes A= {A1, A2, …, A|A|}, where
|A| denotes the number of attributes, and has a target attribute C, which is called the class attribute and has a set of discrete values, C= {c1, c2, …, c|c|} where |C| is the number of classes
and |C| ≥ 2, the objective is to find a function f to relate values of attributes in A and classes in C. f is the classification/prediction function or model or simply the classifier. The function can be of any form, for instance a Bayesian model, a decision tree, or a neural network. It is called supervised learning because the class labels C are provided in the data. In unsupervised learning, the classes are unknown and the learning algorithm needs to generate the classes.
For supervised learning methods, data is typically divided into training and test data. The training data is used to build the model. The model is trained using a training algo- rithm. Learning is the process of building the model. The test data is used to evaluate the trained models accuracy. To assess the model, the test data usually has class labels too.
Usually the training goes through several iterations until the classification accuracy converges. The resulting classifier is used to assign class labels to unseen data where the class label is unknown. The accuracy is defined as