• No results found

An example batch layer: Architecture and algorithms

8.1.2 Batch views

Next we’ll review the batch views needed to satisfy each query. The key to each batch view is striking a balance between the size of the precomputed views and the amount of on-the-fly computation required at query time.

PAGEVIEWS OVER TIME

You want to be able to retrieve the number of pageviews for a URL for any time range down to the granularity of an hour. As mentioned in chapter 4, precomputing the pageview counts for every possible time range is infeasible, as that would require an unmanageable 380 million precomputed values for every URL for each year covered by the dataset. Instead, you can precompute a smaller number and require more com- putation to be done at query time.

The simplest approach is to precompute the number of pageviews for each URL

for every hour bucket. This would result in a batch view that looks like figure 8.2. To resolve a query, you retrieve the value for every hour bucket in the time range, and sum the values together.

Pageview

Pageview Pageview Pageview Pageview

Equiv Equiv Equiv Person (Cookie): KLMNO Person (Cookie): ABCDE Person (Cookie): FGHIJ Person (UserID): 123 Person (UserID): 200 Person (UserID): 87 Page: http://foo.com/about Page: http://foo.com/blog

A person may acess the same URL using multiple user identifiers.

Users with only one identifier will not have any equiv edges.

c

b

Figure 8.1 Examples of different pageviews for the same person being captured using different identifiers

URL foo.com/blog foo.com/blog foo.com/blog foo.com/blog foo.com/blog foo.com/blog foo.com/blog 2012/12/08 15:00 2012/12/08 16:00 2012/12/08 17:00 2012/12/08 18:00 2012/12/08 19:00 2012/12/08 20:00 2012/12/08 21:00 876 987 762 413 1098 657 101 Hour # Pageviews Function: sum Results: 2930

But there’s a problem with this approach—the query becomes slower as you increase the size of the time range. Finding the number of pageviews for a one-year time period requires approximately 8,760 values to be retrieved from the batch view and added together. Since many of those values are going to be served from disk, this can cause the latency of queries with large ranges to be substantially higher than queries with small ranges.

Fortunately, the solution is simple. Instead of precomputing values only using an hourly granularity, you can also precompute at coarser granularities such as 1-day, 7- day (1-week), and 28-day (1-month) intervals. An example best demonstrates how this improves latency.

Suppose you want to compute the number of pageviews from March 3 at 3 a.m. through September 17 at 8 a.m. If you only used hourly values, this query would require retrieving and summing the values for 4,805 hour buckets. Alternatively, using coarser granularities can substantially reduce the number of retrieved values. The idea is to retrieve values for each month between March 3 and September 17, and then add or subtract values for more refined intervals to get the desired range. This idea is illustrated in figure 8.3.

For this query, only 26 values need to be retrieved—almost a 200x improvement! You may wonder how expensive it is to precompute values for the 1-day, 7-day, and 28-day intervals in addition to the hourly buckets. Astonishingly, there is hardly any addi- tional cost. Figure 8.4 shows how many time buckets are needed for each granularity for a one-year period.

Month

Week

Day

Query range

Bucketing values at different granularities reduces the number of retrievals for queries involving large time ranges.

The strategy is to retrieve the values for the large buckets; then add (horizontal cross-hatching) or subtract (diagonal cross-hatching) values for smaller buckets to cover the desired query range.

Figure 8.3 Optimizing pageviews over large query ranges using coarser granularities

Granularity hourly daily weekly monthly 8760 ~ 365 ~ 52 ~ 13

Number of buckets in 1 year

Figure 8.4 Number of buckets in a one-year period for each granularity

Adding up the numbers, the 1-day, 7-day, and 28-day buckets require an additional 430 values to be precomputed for every URL for a one-year period. That’s only a 5% increase in precomputation for a 200x reduction in the query-time work for large ranges—a more than acceptable trade-off.

UNIQUE VISITORS OVER TIME

The next query type determines the number of unique visitors for a specified time interval. This seems like it should be similar to pageviews over time, but there is one key difference: unique counts are not additive. Whereas you can get the total number of pageviews for a two-hour period by adding the values for the individual hours together, you can’t do the same for this query type. This is because a unique count represents the size of a set of elements, and there may be overlap between the sets for each hour. If you simply added the counts for the two hours together, you’d double- count the people who visited the URL in both time intervals.

The only way to compute the number of uniques with perfect accuracy over any time range is to compute the unique count on the fly. This requires random access to the set of visitors for each URL for each hour time bucket. This is doable but expen- sive, as essentially your entire master dataset must be indexed. Alternatively, you can use an approximation algorithm that sacrifices some accuracy to vastly decrease the amount of data to be indexed in the batch view. An example of an approximation algorithm for distinct counting is the HyperLogLog algorithm. For every URL and hour bucket, HyperLogLog only requires information on the order of 1 KB to esti- mate set cardinalities of up to one billion with a maximum 2% error rate.1

Although it’s an intriguing algorithm, we want to avoid becoming sidetracked with the details of HyperLogLog. Instead, let’s treat it as a black box and focus on its interface:

interface HyperLogLog { long size();

void add(Object o);

HyperLogLog merge(HyperLogLog... otherSets); }

Each HyperLogLog object represents a set of elements and supports adding new ele- ments to the set, merging with other HyperLogLog sets, and retrieving the size of the set. Using HyperLogLog makes the uniques-over-time query very similar to the pageviews-over-time query. The key differences are that a relatively larger value is com- puted for each URL and time bucket, and the HyperLogLog merge function is used to combine time buckets instead of summing counts together. As with pageviews over time, HyperLogLog sets for 1-day, 7-day, and 28-day granularities are created to reduce the amount of work to be done at query time.

1 The HyperLogLog algorithm is described in a research paper titled “HyperLogLog: the analysis of a near- optimal cardinality estimation algorithm” by Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier, available at http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf.

BOUNCE-RATE ANALYSIS

The final query type is to determine the bounce rate for every domain. The batch view for this query is simple: a map from each domain to the number of bounced visits and the total number of visits. The bounce rate is simply the ratio of these two values.

The key to precomputing these values is defining what exactly constitutes a visit. We’ll define two pageviews as being part of the same visit if they are from the same user to the same domain and are separated by less than half an hour. A visit is consid- ered a bounce if it only contains one pageview.

8.2

Workflow overview

Now that the specific requirements for the batch views are understood, we can define the batch layer workflow at a high level. The basis of the workflow is illustrated in fig- ure 8.5.

At the start of the batch layer workflow is a single folder on the distributed filesys- tem that contains the master dataset. The first step is simply to take any new data that has accumulated since the last time the batch layer ran and append it to the master dataset.

The next two steps normalize the data in preparation for computing the batch views. The first normalization step accounts for the fact that different URLs can refer

Append new data

URL normalization User ID normalization Remove duplicate pageview events Pageviews-over-time batch view Unique-visitors-over-

time batch view Bounce-rate batch view

Incorporate newly arrived data into the master dataset.

Different URLs may refer to the same web resource.

A single user may have multiple identifiers.

Batch views depend on processing distinct events.

After data is prepared, batch views can be computed in parallel.

Collect new data:

011011011... b

c d e

f

to the same resource. For example, the distinct URLs www.mysite.com/blog/1?utm=1

and http://mysite.com/blog/1 refer to the same location. This first normalization step transforms all URLs to a standard format so that future computations can cor- rectly aggregate the data.

The second normalization step is needed because data for the same person may exist under different user identifiers. In order to support queries about visits and visi- tors, you must select a single identifier for each person. This latter normalization step processes the equiv graph to accomplish this task. Since the batch views only make use of the pageviews data, only the pageview edges will be converted to use these selected user IDs.

The next step deduplicates the pageview events. Recall from chapter 2 the advan- tages of having your data units contain enough information to make them uniquely identifiable. In problematic scenarios (such as network partitioning), it’s common to register the same pageview multiple times to ensure that the event is recorded. De- duplicating the pageviews is necessary to compute the batch views, as they depend on the distinct events in the dataset.

The final step is to use the normalized data to compute the batch views described in the previous section. Note that this workflow is a pure recomputation workflow— every time new data is added, the batch views are recomputed from scratch. In a later chapter, you’ll learn that in many cases you can incrementalize the batch layer such that recomputing using the entire master dataset is not always required. But it’s abso- lutely essential to have the pure recomputation workflow defined, because you need to recompute from scratch in case the views become corrupted.

Let’s now look at the design for each step in more detail. We’ll focus on architec- ture and algorithms, showing pipe diagrams for every data-transformation step, and we’ll leave the nitty-gritty code details for next chapter.

8.3

Ingesting new data

One approach you might take to add data to the master dataset is to insert new files into the master dataset folder as the new data comes in. But there’s a problem with this approach. Suppose your batch workflow needs to run multiple computations over the master dataset, such as to compute multiple views. The computations may start at differ- ent times, which means every view will be representative of the master dataset at different times. While this is not necessarily a deal-breaker, we think it’s much simpler to reason about your views when you know they’re all based on the exact same master dataset.

One simple way to deal with this problem is to have new data written into a new- data/ folder. Then the first step of the batch workflow is to move whatever data was in new-data/ into the master dataset folder (potentially vertically partitioning it in the process). Once the data has been moved, the corresponding files in new-data/ are deleted. The result is that the batch workflow has full control over when data is added to the master dataset and can ensure that every batch view is based on the exact same master dataset.