The Caching Problem - SRC RR 177 pdf

To understand the problem introduced by dynamic dependencies, consider the pro- totypical case of caching a compilation. Assume that a compilation is invoked by a call to the following function:

Here, filename is the name of the file to compile, options is a binding denoting the command-line options to be passed to the compiler, and “.” is the current environment.1

The current environment “.” is a binding value that includes a representation of a file system directory tree. This directory tree contains all the files necessary to perform the compilation. As described in Chapter 5, the Vesta language makes it easy to construct and extend bindings, so a custom file system can be constructed quite cheaply for each build.

Figure 6.1 shows a sample environment. Two paths in “.” are special: root

androot/.WD. As described in Chapter 3, external tools such as compilers and linkers are run in an encapsulated environment in which all references to files are trapped by the repository and serviced by the evaluator. To service a file request from the repository, the evaluator looks up the file in the current environment: absolute pathnames are looked up in therootsubtree, while relative pathnames are looked up in theroot/.WDsubtree.

.WD

root

usr

defs.h hello.c include lib

stdio.h libc.a

Figure 6.1:The file system directory trees of a sample environment.

Now consider the following invocation of the compile function:

compile("hello.c", [debug = "-g"], .);

Let “.” be the binding shown in Figure 6.1. When invoked, this function returns the singleton binding that maps the name “hello.o” to the derived object file produced by compiling the file bound to./root/.WD/hello.c.

To effectively cache this call, the evaluator will compute dynamic, fine-grained dependencies. This function invocation might be found to depend on the defini- tion of the compile function itself, the values of the first two arguments, and the following parts of the environment:

1_{As described in Section 5.2.1, the final “.” parameter can be omitted, in which case it is passed}

./root/usr/lib/cmplrs/cc ./root/.WD/hello.c

./root/usr/include/stdio.h

Notice that even the reference to the compiler (cc) is trapped and recorded as a dependency. The cache entry for the compilation combines all these values into a cache key, and associates that key with the resulting value.

It is worth noting here what we mean by “combining” values together to form a cache key. Since the values used to form the cache key can be quite large, the evalu- ator and function cache use fingerprints of the values instead of values themselves. A fingerprint is a small, fixed-size hash of an arbitrary byte sequence [10, 42]. Fingerprints come with a mathematical guarantee bounding the probability of a collision; by choosing long enough fingerprints, the probability of a collision can be made vanishingly small.2 As a result, fingerprints can be used as a basis for equality tests, since we can safely assume that FPa FPb a b. Two

operations supported on fingerprints are extending a fingerprint by more bytes and extending a fingerprint by another fingerprint. In the latter case, we write fp₁ fp₂

to denote the result of extending fp₁by fp₂. The operation is non-commutative. Fingerprints have the advantage that they are small and can be easily combined without forfeiting their probabilistic guarantee. Storing just the fingerprints of values contributing to the cache key is sufficient because the only operations the function cache needs to perform on such values are combining them and compar- ing them for equality, both of which can be done just as well on fingerprints as on the full values. Result values, of course, must be stored in full so that they can be supplied to the evaluator in the event of a cache hit.

We now come to the main caching problem in Vesta: how can a lookup be performed on the cache when all of the dependencies are not known until after the function has been evaluated? Given the use of dynamic dependencies, it would seem that when the evaluator reaches a new function call site, it must evaluate the function before it will have the necessary key to look it up in the cache! Obvi- ously the cache would then be useless. Alternatively, the evaluator could search linearly through the entire cache, looking at each entry’s dependencies and check- ing whether their values at the call site match the entry. A cache designed this way would be too slow to be useful.

The solution to this chicken-and-egg problem lies in separating the cache key into two parts, primary and secondary. Lookup then becomes a two-step process. First, the evaluator computes the primary key in a fixed way solely from informa-

2_{For safety, Vesta uses 128-bit fingerprints. Using the numbers in Section 3.3 and other conser-}

vative estimates, we compute that the probability of a collision occurring over the expected lifetime

of the Vesta system is much less than 2!

tion available at the call site. A cache lookup using this primary key yields a small number of candidate cache entries. Next, the secondary key of each candidate entry is checked to see if it matches the values at the call site.

In more detail, we divide the dependencies into two groups: those that are known statically at the function call site, which we call primary dependencies, and those that are only known dynamically when the function is evaluated, which we call secondary dependencies.3 Since primary dependencies are known at the call site, they can be used to compute the primary key and reduce the number of cache entries that must be considered during a cache lookup operation.

The primary key (PK) is formed by combining the fingerprints of the primary dependency values. Each secondary dependency consists of a name and the fingerprint of the corresponding value. Together, these secondary dependency names and fingerprints form the cache entry’s secondary key (SK). Overall, then, a cache entry is a triple of the following form:

primary-key, secondary-key, result-value#

Here, the primary key is a single fingerprint, the secondary key is a set of (name, fingerprint) pairs, and the result value is the function’s full result value, suitable for use by the evaluator in the event of a cache hit.

Figure 6.2 shows the primary and secondary keys computed for the example compilation above. First, the fingerprint Q of the compile function itself (a clo- sure value) and the fingerprints R and S of the first two function arguments are computed. These fingerprints are then combined to form a new fingerprint A, the primary key. When the function is evaluated, references to “.” are trapped and the names and fingerprints B, C, and D of the corresponding values are recorded as secondary dependencies. The cache entry formed for this function evaluation is a triple consisting of the primary key, secondary dependency names and fingerprints, and the result value of the evaluation.

It is common for multiple cache entries to have the same primary key. In partic- ular, this occurs whenever a source file is edited and recompiled. Figure 6.3 shows an example. In that figure, the two columns of fingerprints on the right denote two different cache entries. Both cache entries correspond to the compilation of a file named “hello.c” with the same compilation switches, so both entries have the same primary key A. However, between the two compilations, the source file “hello.c” has been edited, so the fingerprint for the corresponding secondary dependency has changed from C to E . Since the secondary dependencies for the two evaluations are different, two different entries are stored in the cache.

3_{As described in Section 7.4.1, the Vesta evaluator uses heuristics and user-supplied pragmas to}

compile("hello.c", [debug = "−g"], .); Q R S A PK ./root/usr/lib/cmplrs/cc ./root/.WD/hello.c ./root/usr/include/stdio.h B C D SK

Figure 6.2: The primary key (PK) and secondary key (SK) of a single cache entry. Primary Key (PK) 0 1 ./root/usr/lib/cmplrs/cc ./root/.WD/hello.c ./root/usr/include/stdio.h A A B B C D E D Entries SK

Figure 6.3:Two cache entries with the same primary key (PK).

An important consequence of using dynamic fine-grained dependencies is that the set of secondary dependency names may differ from one cache entry to the next, even among entries with the same primary key. An example is shown in Figure 6.4, where cache entries 1 and 2 have different sets of secondary dependency names because the file “hello.c” was edited to include a file named “defs.h” instead of “stdio.h”. The difference between entries 2 and 3 is that the file “defs.h” was changed.

Figure 6.4 also demonstrates an important property of the Vesta caching strat- egy. For Vesta caching to be correct, all user-defined functions and external tools must be functional and deterministic: the same (possibly dynamic) inputs should always produce the same output4. As a result, cache entries like the one labeled

4_{This requirement is stated more strictly than is actually necessary. For example, some compilers}

Primary Key (PK) 0 1 ./root/usr/lib/cmplrs/cc ./root/.WD/hello.c ./root/usr/include/stdio.h A A B B C D E D Entries 2 3 X A A A ./root/.WD/defs.h B B B F G F H F J H SK

Figure 6.4:Multiple cache entries with the same primary key (PK), but with differing sets of secondary dependency names.

“X” in the figure should not be created. The fingerprints of cache entry X agree with those of cache entry 3 for those secondary dependencies shared by the two entries, but entry X has an additional secondary dependency on the file “stdio.h”. Functional tools like compilers never produce such entries, and neither do user- defined functions written in the Vesta language.

In document SRC RR 177 pdf (Page 105-110)