• No results found

Data storage on the batch layer: Illustration

5.1 Using the Hadoop Distributed File System

6.3.4 Choosing a style of algorithm

Table 6.1 summarizes this section in terms of recomputation and incremental algorithms.

The key takeaway is that you must always have recomputation versions of your algorithms. This is the only way to ensure human-fault tolerance for your system, and human-fault tolerance is a non-negotiable requirement for a robust system.

Additionally, you have the option to add incremental versions of your algorithms to make them more resource-efficient.

For the remainder of this chapter, we’ll focus solely on recomputation algorithms, though in chapter 18 we’ll come back to the topic of incrementalizing the batch layer.

6.4

Scalability in the batch layer

The word scalability gets thrown around a lot, so let’s carefully define what it means in a data systems context. Scalability is the ability of a system to maintain performance under increased load by adding more resources. Load in a Big Data context is a combi- nation of the total amount of data you have, how much new data you receive every day, how many requests per second your application serves, and so forth.

More important than a system being scalable is a system being linearly scalable. A linearly scalable system can maintain performance under increased load by adding resources in proportion to the

increased load. A nonlinearly scal- able system, despite being “scal- able,” isn’t particular useful. Suppose the number of machines you need in relation to the load on your system has a quadratic relationship, like in figure 6.8. The costs of running your system would rise dramatically over time. Increasing your load ten-fold would increase your costs by a hundred. Such a system isn’t feasi- ble from a cost perspective.

Table 6.1 Comparing recomputation and incremental algorithms

Recomputation algorithms Incremental algorithms

Performance Requires computational effort to process the entire master dataset

Requires less computational resources but may generate much larger batch views Human-fault

tolerance

Extremely tolerant of human errors because the batch views are continually rebuilt

Doesn’t facilitate repairing errors in the batch views; repairs are ad hoc and may require estimates

Generality Complexity of the algorithm is addressed dur- ing precomputation, resulting in simple batch views and low-latency, on-the-fly processing

Requires special tailoring; may shift complexity to on-the-fly query processing

Conclusion Essential to supporting a robust data- processing system

Can increase the efficiency of your sys- tem, but only as a supplement to recom- putation algorithms

Number of machines needed

Load Figure 6.8 Nonlinear scalability

When a system is linearly scalable, costs rise in proportion to the load. This is a crit- ically important property of a data system.

We delved into this discussion about scalability to set the scene for introducing Map- Reduce, a distributed computing paradigm that can be used to implement a batch layer. As we cover the details of its workings, keep in mind that it’s linearly scalable: should the size of your master dataset double, then twice the number of servers will be able to build the batch views with the same latency.

6.5

MapReduce: a paradigm for Big Data computing

MapReduce is a distributed computing paradigm originally pioneered by Google that provides primitives for scalable and fault-tolerant batch computation. With Map- Reduce, you write your computations in terms of map and reduce functions that manip- ulate key/value pairs. These primitives are expressive enough to implement nearly any function, and the MapReduce framework executes those functions over the mas- ter dataset in a distributed and robust manner. Such properties make MapReduce an excellent paradigm for the precomputation needed in the batch layer, but it’s also a low-level abstraction where expressing computations can be a large amount of work.

The canonical MapReduce example is word count. Word count takes a dataset of text and determines the number of times each word appears throughout the text. The map function in MapReduce executes once per line of text and emits any number of key/value pairs. For word count, the map function emits a key/value pair for every word in the text, setting the key to the word and the value to 1:

function word_count_map(sentence) { for(word in sentence.split(" ")) {

emit(word, 1) }

}

What scalability doesn’t mean...

Counterintuitively, a scalable system doesn’t necessarily have the ability to increase

performance by adding more machines. For an example of this, suppose you have a website that serves a static HTML page. Let’s say that every web server you have can serve 1,000 requests/sec within a latency requirement of 100 milliseconds. You won’t be able to lower the latency of serving the web page by adding more ma- chines—an individual request is not parallelizable and must be satisfied by a single machine. But you can scale your website to increased requests per second by adding more web servers to spread the load of serving the HTML.

More practically, with algorithms that are parallelizable, you might be able to increase performance by adding more machines, but the improvements will diminish the more machines you add. This is because of the increased overhead and communication costs associated with having more machines.

MapReduce then arranges the output from the map functions so that all values from the same key are grouped together.

The reduce function then takes the full list of values sharing the same key and emits new key/value pairs as the final output. In word count, the input is a list of 1 val- ues for each word, and the reducer simply sums the values to compute the count for that word:

function word_count_reduce(word, values) { sum = 0 for(val in values) { sum += val } emit(word, sum) }

There’s a lot happening under the hood to run a program like word count across a cluster of machines, but the MapReduce framework handles most of the details for you. The intent is for you to focus on what needs to be computed without worrying about the details of how it’s computed.

6.5.1 Scalability

The reason why MapReduce is such a powerful paradigm is because programs written in terms of MapReduce are inherently scalable. A program that runs on 10 gigabytes of data will also run on 10 petabytes of data. MapReduce automatically parallelizes the computation across a cluster of machines regardless of input size. All the details of concurrency, transferring data between machines, and execution planning are abstracted for you by the framework.

Let’s walk through how a program like word count executes on a MapReduce clus- ter. The input to your MapReduce program is stored within a distributed filesystem such as the Hadoop Distributed File System (HDFS) you encountered in the last chap- ter. Before processing the data, the program first determines which machines in your cluster host the blocks containing the input—see figure 6.9.

Distributed filesystem Data file: input.txt Server 1 Server 5 Server 3 Server 2 Server 4 Server 6 File block locations: 1, 3 2, 3

Before a MapReduce program begins processing data, it first determines the block locations within the distributed filesystem.

After determining the locations of the input, MapReduce launches a number of map tasks proportional to the input data size. Each of these tasks is assigned a subset of the input and executes your map function on that data. Because the amount of the code is typically far less than the amount of the data, MapReduce attempts to assign tasks to servers that host the data to be processed. As shown in figure 6.10, moving the code to the data avoids the need to transfer all that data across the network.

Like map tasks, there are also reduce tasks spread across the cluster. Each of these tasks is responsible for computing the reduce function for a subset of keys generated by the map tasks. Because the reduce function requires all values associated with a given key, a reduce task can’t begin until all map tasks are complete.

Once the map tasks finish executing, each emitted key/value pair is sent to the reduce task responsible for processing that key. Therefore, each map task distributes its output among all the reducer tasks. This transfer of the intermediate key/value pairs is called shuffling and is illustrated in figure 6.11.

Once a reduce task receives all of the key/value pairs from every map task, it sorts the key/value pairs by key. This has the effect of organizing all the values for any given

Map code Map task: server 1 Map task: server 3 <to,1>, <be,1>, <or,1>, <not,1>, <to,1>, <be,1>, ... <brevity,1>, <is,1>, <the,1>, <soul,1> <of,1>, <wit,1>, ...

Code is sent to the servers hosting the input files to limit network traffic across the cluster.

The map tasks generate intermediate key/value pairs that will be redirected to reduce tasks.

B C

Figure 6.10 MapReduce promotes data locality, running tasks on the servers that host the input data.

Reduce task 1 Reduce task 2 <to,1>, <be,1>, <or,1>, <not,1>, <to,1>, <be,1>, ... <brevity,1>, <is,1>, <the,1>, <soul,1> <of,1>, <wit,1>, ... <once,1>, <more,1>, <unto,1>, <the,1>, <breach,1>, ...

During the shuffle phase, all of the key/value pairs generated by the map tasks are distributed among the reduce tasks. In this process, all of the pairs with the same key are sent to the same reducer.

key to be together. The reduce function is then called for each key and its group of values, as demonstrated in figure 6.12.

As you can see, there are many moving parts to a MapReduce program. The impor- tant takeaways from this overview are the following:

■ MapReduce programs execute in a fully distributed fashion with no central point of contention.

MapReduce is scalable: the map and reduce functions you provide are executed

in parallel across the cluster.

■ The challenges of concurrency and assigning tasks to machines is handled for you.