7.4 A case study
7.4.1 MapReduce use cases
In our case study, we initially considered five different MapReduce programs that are presented as typical use cases in Google’s original paper on MapRe- duce [DG04]. Two of these use cases, namely computing word histograms and revers web-link graphs, have already been introduced in Section 7.1.1. However, we will revisit these use cases to explain how the output data is represented in the HBase data model, and introduce the remaining use cases in greater detail in the following.
Word histogram The MapReduce program for computing word histograms has been presented in Section 7.1.1. A word histogram is essentially a list of words, each with an occurrence count. There are multiple ways to represent word histograms in the HBase data model. However, it seems essential that the frequency of any given word can be retrieved efficiently. Hence, the word itself should act as row key and the occurrence count is stored in some fixed column family under some fixed qualifier.
Count of URL access frequency The MapReduce program processes logs of web page requests and computes the access frequency per URL. The computa- tion is very similar to the computation of word histograms. The MapReduce program differs only in the Map function that needs to parse URLs from web logs instead of words from web pages. Similarly, the representation of the result data in the HBase data model resembles that of word histograms.
Reverse web-link graph The MapReduce program for computing reverse web- link graphs has been presented in Section 7.1.1. The resulting reverse web-link graph is represented as a list of target URLs, each with a set of associated source URLs. It seems natural to use the target URL as row key in the HBase data model. In this way, the referencing web pages can be efficiently retrieved for any given URL. The set of referencing web pages could be encoded as a single string using some separator character. However, is seems favorable to exploit HBase column qualifiers to impose more structure. Recall that column qualifiers can be dynamically created and dropped. Each referencing URL can thus be stored as a column qualifier within some fixed column family. Note that the actual column value can be left empty. Alternatively, it may be used to store an occurrence count for each target URL. We will elaborate more on this aspect in Section 7.4.4.
Term-vector per host A term vector summarizes the (most important) words that occur in a set of documents stored at the same host. Note that “term vector” is just another term for word histogram. That is, the MapReduce program essentially computes word histograms for multiple sets of documents at the same time. To do so, the Map function computes a term vector for each input document and emits<hostname, term vector> intermediate key-value pair. Note that the intermediate value is complex structured. The Reduce function consumes all per-document term vectors for a given host and adds them together.
For the representation of per-host term vectors in the HBase data model, it seems natural to use the host name as row key. Words within a given term vector may be stored as column qualifiers (within a fixed column family) and the per-word frequency count as column value.
Inverted index Inverted indexes are popular in information retrieval and full text searching and essentially provide a mapping from words to documents that contain those words. To compute inverted indexes using MapReduce, a Map function that parses each document and emits<word, URL > intermediate key- value pairs is used. The Reduce function collects all URLs associated with a given word. Note that computing inverted indexes is quite similar to computing reverse web-link graphs and that the output data is equally structured. For this reason, it can be mapped into the HBase data model in much the same way, i.e. words act as row keys and URLs as column qualifiers.
Two more MapReduce use cases are mentioned in [DG04], namely distributed grep and distributed sort. We will not consider these use cases here, because distributed grep is an embarrassingly parallel problem and incremental main-
tenance is trivial. Distributed sort is not interesting when HBase is used at the storage layer, because HBase data is always stored in sort order.
All MapReduce use cases we consider process large collections of web-related documents. These documents will change over time (as the web is being crawled, for instance). Hence, the result of any computation will become more and more stale. To obtain recent results, the MapReduce program can sim- ply be re-executed. However, typically only a fraction of base documents did actually change, making this approach rather inefficient. Incremental recom- putation approaches, in contrast, avoid re-processing unchanged base data and are thus often preferable.
To perform incremental recomputations, dedicated MapReduce programs may be derived from the original programs in a similar way that incremental SQL/RA expressions are derived from view definitions for view maintenance. We will refer to such programs as incremental MapReduce programs. The latter problem has been extensively studied in the database research community as discussed in Section 2.3.3. Since all MapReduce use cases included in our case study perform aggregation-like tasks, work on the maintenance of aggregate views proved most relevant. We found that strictly algebraic approaches (such as [GL95, Qua96]) could not easily be transferred into the MapReduce envi- ronment. However, we found the so-called summary delta algorithm proposed and refined in [MQM97, LYC+00, GM06] to be a better fit for the MapRe-
duce programming model. The summary delta algorithm will be discussed in Section 7.4.3.