Data storage on the batch layer: Illustration
5.1 Using the Hadoop Distributed File System
6.1.3 Influence score
The final example operates over a Twitter-inspired dataset containing reaction records. Each reaction record contains sourceId and responderId fields, indicating that
responderId retweeted or replied to sourceId’s post.
The query determines an influencer score for each person in the social network. The score is computed in two steps. First, the top influencer for each person is selected based on the number of reactions the influencer caused in that person. Then, someone’s influence score is set to the number of people for which he or she was the top influencer.
The algorithm to determine a user’s influence score is as follows:
function influence_score(masterDataset, personId) { influence = new Map()
for(record in masterDataset) {
curr = influence.get(record.responderId) || new Map(default=0) curr[record.sourceId] += 1
Normalizes all names associated with the person
Averages each name’s probability of being male
Returns the gender with the highest likelihood
Computes amount of influence between all pairs of people
influence.set(record.sourceId, curr) } score = 0 for(entry in influence) { if(topKey(entry.value) == personId) { score += 1 } } return score }
In this code, the topKey function is mocked because it’s straightforward to imple- ment. Otherwise, the algorithm simply counts the number of reactions between each pair of people and then counts the number of people for whom the queried user is the top influencer.
6.2
Computing on the batch layer
Let’s take a step back and review how the Lambda Architecture works at a high level. When processing queries, each layer in the Lambda Architecture has a key, comple- mentary role, as shown in figure 6.1.
Counts the number of people for whom personId is the top influencer
Batch layer Serving layer Speed layer Realtime view Batch view Master dataset New data: 011010010... Query: “How many...?”
The serving layer serves the precomputed results with low-latency reads. The speed layer fills the
latency gap by querying recently obtained data.
c
b
d
Realtime view Realtime view Batch view Batch viewThe batch layer precomputes functions over the master dataset. Processing the entire dataset introduces high latency.
The batch layer runs functions over the master dataset to precompute intermediate data called batch views. The batch views are loaded by the serving layer, which indexes them to allow rapid access to that data. The speed layer compensates for the high latency of the batch layer by providing low-latency updates using data that has yet to be precomputed into a batch view. Queries are then satisfied by processing data from the serving layer views and the speed layer views, and merging the results.
A linchpin of the architecture is that for any query, it’s possible to precompute the data in the batch layer to expedite its processing by the serving layer. These precompu- tations over the master dataset take time, but you should view the high latency of the batch layer as an opportunity to do deep analyses of the data and connect diverse pieces of data together. Remember, low-latency query serving is achieved through other parts of the Lambda Architecture.
A naive strategy for computing on the batch layer would be to precompute all possible queries and cache the results in the serving layer. Such an approach is illustrated in figure 6.2.
Unfortunately you can’t always pre- compute everything. Consider the pageviews-over-time query as an exam- ple. If you wanted to precompute every potential query, you’d need to deter-
mine the answer for every possible range of hours for every URL. But the number of ranges of hours within a given time frame can be huge. In a one-year period, there are approximately 380 million distinct hour ranges. To precompute the query, you’d need to precompute and index 380 million values for every URL. This is obviously infeasible and an unworkable solution.
Instead, you can precompute intermediate results and then use these results to complete queries on the fly, as shown in figure 6.3.
Master dataset Function Batch view Function Function Query results Function Batch view Batch view
Precomputation Low-latency query
Figure 6.3 Splitting a query into pre- computation and on-the-fly compo- nents Master dataset Query results Function
Figure 6.2 Precomputing a query by running a function on the master dataset directly
For the pageviews-over-time query, you can precompute the number of pageviews for every hour for each URL. This is illustrated in figure 6.4.
To complete a query, you retrieve from the index the number of pageviews for every hour in the range, and sum the results. For a single year, you only need to pre- compute and index 8,760 values per URL (365 days, 24 hours per day). This is cer- tainly a more manageable number.
6.3
Recomputation algorithms vs. incremental algorithms
Because your master dataset is continually growing, you must have a strategy for updating your batch views when new data becomes available. You could choose a
recomputation algorithm, throwing away the old batch views and recomputing func- tions over the entire master dataset. Alternatively, an incremental algorithm will update the views directly when new data arrives.
As a basic example, consider a batch view containing the total number of records in your master dataset. A recomputation algorithm would update the count by first appending the new data to the master dataset and then counting all the records from scratch. This strategy is shown in figure 6.5.
2012/12/08 21:00 101 foo.com/blog 657 2012/12/08 20:00 foo.com/blog 1098 2012/12/08 19:00 foo.com/blog 413 2012/12/08 18:00 foo.com/blog 762 foo.com/blog 2012/12/08 17:00 foo.com/blog 2012/12/08 16:00 987 foo.com/blog 2012/12/08 15:00 876 # Pageviews Hour URL Function: sum Results: 2930
Figure 6.4 Computing the number of pageviews by querying an indexed batch view
Master dataset New data Merged dataset Recomputed view: 20,612,788 records Count
Figure 6.5 A recomputing algorithm to update the number of records in the master data- set. New data is appended to the master dataset, and then all records are counted.
An incremental algorithm, on the other hand, would count the number of new data records and add it to the existing count, as demonstrated in figure 6.6.
You might be wondering why you would ever use a recomputation algorithm when you can use a vastly more efficient incremental algorithm instead. But efficiency is not the only factor to be considered. The key trade-offs between the two approaches are performance, human-fault tolerance, and the generality of the algorithm. We’ll dis- cuss both types of algorithms in regard to each of these issues. You’ll discover that although incremental approaches can provide additional efficiency, you must also have recomputation versions of your algorithms.