In this section we motivate cache-transfer style (CTS). Sec. 17.2.1 summarizes a reformulation of ILC, so we recommend it also to readers familiar with Cai et al. [2014]. In Sec. 17.2.2 we consider a minimal first-order example, namely an average function. We incrementalize it using ILC, explain why the result is inefficient, and show that remembering results via cache-transfer style enables efficient incrementalization with asymptotic speedups. In Sec. 17.2.3 we consider a higher-order example that requires not just cache-transfer style but also efficient support for both nil and non-nil function changes, together with the ability to detect nil changes.

Chapter 17. Cache-transfer-style conversion 137

### 17.2.1

### Background

ILC considers simply-typed programs, and assumes that base types and primitives come with support for incrementalization.

The ILC framework describes changes as first-class values, and types them using dependent types. To each type A we associate a type ∆A of changes for A, and an update operator ⊕ :: A → ∆A → A, that updates an initial value with a change to compute an updated value. We also consider changes for evaluation environments, which contain changes for each environment entry.

A change da :: ∆A can be valid from a1 :: A to a2 :: A, and we write then da ▷ a1 ,→ a2 :: A. Then a1is the source or initial value for da, and a2is the destination or updated value for da. From da ▷ a1,→ a2:: A follows that a2coincides with a1⊕ da, but validity imposes additional invariants that are useful during incrementalization. A change can be valid for more than one source, but a change da and a source a1uniquely determine the destination a2= a1⊕ da. To avoid ambiguity, we always consider a change together with a specific source.

Each type comes with its definition of validity: Validity is a ternary logical relation. For function types A → B, we define ∆(A → B) = A → ∆A → ∆B, and say that a function change df :: ∆(A → B) is valid from f1 :: A → B to f2 :: A → B (that is, df ▷ f1 ,→ f2 :: A → B) if and only if df maps valid input changes to valid output changes; by that, we mean that if da ▷ a1 ,→ a2 :: A, then df a1 da ▷ f1 a1 ,→ f2 a2 :: B. Source and destination of df a1 da, that is f1a1and f2 a2, are the result of applying two different functions, that is f1and f2.

ILC expresses incremental programs as derivatives. Generalizing previous usage, we simply
say derivative for all terms produced by differentiation. If dE is a valid environment change from
E1to E2, and term t is well-typed and can be evaluated against environments E1, E2, then term
Dι⟦ t ⟧, the derivative of t, evaluated against dE, produces a change from v1to v2, where v1is the
value of t in environment E1, and v2is the value of t in environment E2. This correctness property
follows immediately the fundamental property for the logical relation of validity and can be proven
accordingly; we give a step-indexed variant of this proof in Sec. 17.3.3. If t is a function and dE is a
nil change (that is, its source E1and destination E2coincide), then Dι_{⟦ t ⟧ produces a nil function}
change and is also a derivative according to Cai et al. [2014].

To support incrementalization, one must define change types and validity for each base type, and a correct derivative for each primitive. Functions written in terms of primitives can be differentiated automatically. As in all approaches to incrementalization (see Sec. 17.6), one cannot incrementalize efficiently an arbitrary program: ILC limits the effort to base types and primitives.

### 17.2.2

### A first-order motivating example: computing averages

Suppose we need to compute the average y of a bag of numbers xs :: Bag Z, and that whenever we receive a change dxs :: ∆(Bag Z) to this bag we need to compute the change dy to the average y.

In fact, we expect multiple consecutive updates to our bag. Hence, we say we have an initial bag xs1and compute its average y1as y1= avg xs1, and then consecutive changes dxs1, dxs2, . . .. Change dxs1goes from xs1to xs2, dxs2goes from xs2to xs3, and so on. We need to compute y2= avg xs2, y3= avg xs3, but more quickly than we would by calling avg again.

We can compute the average through the following function (that we present in Haskell): avg xs=

let s= sum xs n= length xs r= div s n in r

138 Chapter 17. Cache-transfer-style conversion We write this function in A’-normal form (A’NF), a small variant of A-normal form (ANF) Sabry and Felleisen [1993] that we introduce. In A’NF, programs bind to a variable the result of each function call in avg, instead of using it directly; unlike plain ANF, A’NF also binds the result of tail calls such as div s n in avg. A’NF simplifies conversion to cache-transfer style.

We can incrementalize efficiently both sum and length by generating via ILC their derivatives dsumand dlength, assuming a language plugin supporting bags supporting folds.

But division is more challenging. To be sure, we can write the following derivative: ddiv a1da b1db=

let a2= a1⊕ da b2= b1⊕ db in div a2b2⊖ div a1b1

Function ddiv computes the difference between the updated and the original result without any special optimization, but still takes O(1) for machine integers. But unlike other derivatives, ddiv uses its base inputs a1and b1, that is, it is not self-maintainable [Cai et al., 2014].

Because ddiv is not self-maintainable, a derivative calling it will not be efficient. To wit, let us look at davg, the derivative of avg:

davg xs dxs= let s = sum xs ds = dsum xs dxs n = length xs dn= dlength xs dxs r = div s n dr = ddiv s ds n dn in dr

This function recomputes s, n and r just like in avg, but r is not used so its recomputation can easily be removed by later optimizations. On the other hand, derivative ddiv will use the values of its base inputs a1and b1, so derivative davg will need to recompute s and n and save no time over recomputation! If ddiv were instead a self-maintainable derivative, the computations of s and n would also be unneeded and could be optimized away. Cai et al. leave efficient support for non-self-maintainable derivatives for future work.

To avoid recomputation we must simply remember intermediate results as needed. Executing davg xs1dxs1will compute exactly the same s and n as executing avg xs1, so we must just save and reuse them, without needing to use memoization proper. Hence, we CTS-convert each function f to a CTS function fC and a CTS derivative dfC: CTS function fC produces, together with its final result, a cache, that the caller must pass to CTS derivative dfC. A cache is just a tuple of values, containing information from subcalls — either inputs (as we explain in a bit), intermediate results or subcaches, that is caches returned from further function calls. In fact, typically only primitive functions like div need to recall actual result; automatically transformed functions only need to remember subcaches or inputs.

CTS conversion is simplified by first converting to A’NF, where all results of subcomputations are bound to variables: we just collect all caches in scope and return them.

As an additional step, we avoid always passing base inputs to derivatives by defining ∆(A → B)= ∆A → ∆B. Instead of always passing a base input and possibly not using it, we can simply assume that primitives whose derivative needs a base input store the input in the cache.

To make the translation uniform, we stipulate all functions in the program are transformed to CTS, using a (potentially empty) cache. Since the type of the cache for a function f :: A → B

Chapter 17. Cache-transfer-style conversion 139 depends on implementation of f , we introduce for each function f a type for its cache FC, so that CTS function fC has type A → (B, FC) and CTS derivative dfC has type ∆A → FC → (∆B, FC).

The definition of FC is only needed inside fC and dfC, and it can be hidden in other functions to keep implementation details hidden in transformed code; because of limitations of Haskell modules, we can only hide such definitions from functions in other modules.

Since functions of the same type translate to functions of different types, the translation does not preserve well-typedness in a higher-order language in general, but it works well in our case studies (Sec. 17.4); Sec. 17.4.1 shows how to map such functions. We return to this point briefly in Sec. 17.5.1.

CTS-converting our example produces the following code: data AvgC= AvgC SumC LengthC DivC

avgC:: Bag Z → (Z, AvgC) avgC xs=

let (s, cs1) = sumC xs (n, cn1)= lengthC xs (r, cr1) = s ‘divC‘ n in (r, AvgC cs1cn1cr1)

davgC:: ∆(Bag Z) → AvgC → (∆Z, AvgC) davgC dxs (AvgC cs1cn1cr1)=

let (ds, cs2) = dsumC dxs cs1 (dn, cn2)= dlengthC dxs cn1 (dr, cr2) = ddivC ds dn cr1 in (dr, AvgC cs2cn2cr2)

In the above program, sumC and lengthC are self-maintainable, that is they need no base inputs and can be transformed to use empty caches. On the other hand, ddiv is not self-maintainable, so we transform it to remember and use its base inputs.

divC a b= (a ‘div‘ b, (a, b)) ddivC da db (a1, b1)=

let a2= a1⊕ da b2= b1⊕ db

in (div a2b2⊖ div a1b1, (a2, b2))

Crucially, ddivC must return an updated cache to ensure correct incrementalization, so that applica- tion of further changes works correctly. In general, if (y, c1)= fC x and (dy, c2)= dfC dx c1, then (y ⊕ dy, c2)must equal the result of the base function fC applied to the new input x ⊕ dx, that is (y ⊕ dy, c2)= fC (x ⊕ dx).

Finally, to use these functions, we need to adapt the caller. Let’s first show how to deal with a sequence of changes: imagine that dxs1is a valid change for xs, and that dxs2is a valid change for xs ⊕ dxs1. To update the average for both changes, we can then call the avgC and davgC as follows:

-- A simple example caller with two consecutive changes avgDAvgC:: Bag Z → ∆(Bag Z) → ∆(Bag Z) →

(Z, ∆Z, ∆Z, AvgC) avgDAvgC xs dxs1dxs2=

let (res1, cache1)= avgC xs

(dres1, cache2)= davgC dxs1cache1 (dres2, cache3)= davgC dxs2cache2 in (res1, dres1, dres2, cache3)

140 Chapter 17. Cache-transfer-style conversion Incrementalization guarantees that the produced changes update the output correctly in response to the input changes: that is, we have res1⊕ dres1 = avg (xs ⊕ dxs1)and res1⊕ dres1⊕ dres2 = avg (xs ⊕ dxs1⊕ dxs2). We also return the last cache to allow further updates to be processed.

Alternatively, we can try writing a caller that gets an initial input and a (lazy) list of changes, does incremental computation, and prints updated outputs:

processChangeList (dxsN : dxss) yN cacheN = do let (dy, cacheN′)= avg′dxsN cacheN

yN′= yN ⊕ dy print yN′

processChangeList dxss yN′cacheN′ -- Example caller with multiple consecutive -- changes

someCaller xs1dxss= do let (y1, cache1)= avgC xs1 processChangeList dxss y1cache1

More in general, we produce both an augmented base function and a derivative, where the augmented base function communicates with the derivative by returning a cache. The contents of this cache are determined statically, and can be accessed by tuple projections without dynamic lookups. In the rest of the paper, we use the above idea to develop a correct transformation that allows incrementalizing programs using cache-transfer style.

We’ll return to this example in Sec. 17.4.1.

### 17.2.3

### A higher-order motivating example: nested loops

Next, we consider CTS differentiation on a minimal higher-order example. To incrementalize this example, we enable detecting nil function changes at runtime by representing function changes as closures that can be inspected by incremental programs. We’ll return to this example in Sec. 17.4.2.

We take an example of nested loops over sequences, implemented in terms of standard Haskell functions map and concat. For simplicity, we compute the Cartesian product of inputs:

cartesianProduct :: Sequence a → Sequence b → Sequence (a, b) cartesianProduct xs ys=

concatMap (λx → map (λy → (x, y)) ys) xs concatMap f xs= concat (map f xs)

Computing cartesianProduct xs ys loops over each element x from sequence xs and y from sequence
ys, and produces a list of pairs (x, y), taking quadratic time O(n2_{)}_{(we assume for simplicity that}
|xs|and |ys| are both O(n)). Adding a fresh element to either xs or ys generates an output change
containing Θ(n) fresh pairs: hence derivative dcartesianProduct must take at least linear time. Thanks
to specialized derivatives dmap and dconcat for primitives map and concat, dcartesianProduct has
asymptotically optimal time complexity. To achieve this complexity, dmap f df must detect when
df is a nil function change and avoid applying it to unchanged input elements.

To simplify the transformations we describe, we λ-lift programs before differentiating and transforming them.

cartesianProduct xs ys= concatMap (mapPair ys) xs mapPair ys= λx → map (pair x) ys

Chapter 17. Cache-transfer-style conversion 141 pair x= λy → (x, y)

concatMap f xs= let yss= map f xs in concat yss

Suppose we add an element to either xs or ys. If change dys adds one element to ys, then dmapPair ys dys, the argument to dconcatMap, is a non-nil function change taking constant time, so dconcatMap must apply it to each element of xs ⊕ dxs.

Suppose next that change dxs adds one element to xs and dys is a nil change for ys. Then dmapPair ys dysis a nil function change. And we must detect this dynamically. If a function change df :: ∆(A → B) is represented as a function, and A is infinite, one cannot detect dynamically that it is a nil change. To enable runtime nil change detection, we apply closure conversion on function changes: a function change df , represented as a closure is nil for f only if all environment changes it contains are also nil, and if the contained function is a derivative of f ’s function.