• No results found

Data storage on the batch layer: Illustration

5.1 Using the Hadoop Distributed File System

6.5.2 Fault-tolerance

Distributed systems are notoriously testy. Network partitions, server crashes, and disk failures are relatively rare for a single server, but the likelihood of something going wrong greatly increases when coordinating computation over a large cluster of machines. Thankfully, in addition to being easily parallelizable and inherently scal- able, MapReduce computations are also fault tolerant.

A program can fail for a variety of reasons: a hard disk can reach capacity, the process can exceed available memory, or the hardware can break down. MapReduce watches for these errors and automatically retries that portion of the computation on another node. An entire application (commonly called a job) will fail only if a task fails more than a configured number of times—typically four. The idea is that a single failure may arise from a server issue, but a repeated failure is likely a problem with your code.

Because tasks can be retried, MapReduce requires that your map and reduce func- tions be deterministic. This means that given the same inputs, your functions must always produce the same outputs. It’s a relatively light constraint but important for MapReduce to work correctly. An example of a non-deterministic function is one that generates random numbers. If you want to use random numbers in a MapReduce job, you need to make sure to explicitly seed the random number generator so that it always produces the same outputs.

<to, 1> <and, 1> <from, 1> <to, 1> <here, 1> <from, 1> <and, 1> ... <and, 1> <and, 1> <from, 1> <from, 1> <here, 1> <to, 1> <to, 1> ... <and, 2> <from, 2> <here, 1> <to, 2> ... Sort Reduce

Figure 6.12 A reduce task sorts the incoming data by key, and then performs the re- duce function on the resulting groups of values.

6.5.3 Generality of MapReduce

It’s not immediately obvious, but the computational model supported by MapReduce is expressive enough to compute almost any functions on your data. To illustrate this, let’s look at how you could use MapReduce to implement the batch view functions for the queries introduced at the beginning of this chapter.

IMPLEMENTING NUMBER OF PAGEVIEWS OVER TIME

The following MapReduce code produces a batch view for pageviews over time:

function map(record) {

key = [record.url, toHour(record.timestamp)] emit(key, 1)

}

function reduce(key, vals) {

emit(new HourPageviews(key[0], key[1], sum(vals))) }

This code is very similar to the word count code, but the key emitted from the mapper is a struct containing the URL and the hour of the pageview. The output of the reducer is the desired batch view containing a mapping from [url,hour] to the num- ber of pageviews for that hour.

IMPLEMENTING GENDER INFERENCE

The following MapReduce code infers the gender of supplied names:

function map(record) {

emit(record.userid, normalizeName(record.name)) }

function reduce(userid, vals) { allNames = new Set()

for(normalizedName in vals) { allNames.add(normalizedName) } maleProbSum = 0.0 for(name in allNames) { maleProbSum += maleProbabilityOfName(name) }

maleProb = maleProbSum / allNames.size() if(maleProb > 0.5) {

gender = "male" } else {

gender = "female" }

emit(new InferredGender(userid, gender)) }

Gender inference is similarly straightforward. The map function performs the name semantic normalization, and the reduce function computes the predicted gender for each user.

Semantic normalization occurs during the mapping stage. A set is used to remove any potential duplicates. Averages the probabilities of being male. Returns the most

IMPLEMENTING INFLUENCE SCORE

The influence-score precomputation is more complex than the previous two exam- ples and requires two MapReduce jobs to be chained together to implement the logic. The idea is that the output of the first MapReduce job is fed as the input to the second MapReduce job. The code is as follows:

function map1(record) {

emit(record.responderId, record.sourceId) }

function reduce1(userid, sourceIds) { influence = new Map(default=0) for(sourceId in sourceIds) { influence[sourceId] += 1 } emit(topKey(influence)) } function map2(record) { emit(record, 1) }

function reduce2(influencer, vals) {

emit(new InfluenceScore(influencer, sum(vals))) }

It’s typical for computations to require multiple MapReduce jobs—that just means multiple levels of grouping were required. Here the first job requires grouping all reactions for each user to determine that user’s top influencer. The second job then groups the records by top influencer to determine the influence scores.

Take a step back and look at what MapReduce is doing at a fundamental level:

It arbitrarily partitions your data through the key you emit in the map phase.

Arbitrary partitioning lets you connect your data together for later processing while still processing everything in parallel.

It arbitrarily transforms your data through the code you provide in the map and

reduce phases.

It’s hard to envision anything more general that could still be a scalable, distributed system.

The first job determines the top influencer for each user.

The top influencer data is then used to determine the number of people each user influences.

MapReduce vs. Spark

Spark is a relatively new computation system that has gained a lot of attention. Spark’s computation model is “resilient distributed datasets.” Spark isn’t any more general or scalable than MapReduce, but its model allows it to have much higher per- formance for algorithms that have to repeatedly iterate over the same dataset (because Spark is able to cache that data in memory rather than read it from disk every time). Many machine-learning algorithms iterate over the same data repeatedly, making Spark particularly well suited for that use case.