16.2 Predicting nil changes

16.2.1 Self-maintainability

In databases, a self-maintainable view [Gupta and Mumick, 1999] is a function that can update its result from input changes alone, without looking at the actual input. By analogy, we call a derivative self-maintainableif it uses no base parameters, only their changes. Self-maintainable derivatives describe efficient incremental computations: since they do not use their base input, their running time does not have to depend on the input size.

Example 16.2.1

D ⟦ merge ⟧ = λx dx y dy → merge dx dy is self-maintainable with the change structure ›Bag S described in Example 10.3.1, because it does not use the base inputs x and y. Other derivatives are self-maintainable only in certain contexts. The derivative of element-wise function application (map f xs)ignores the original value of the bag xs if the changes to f are always nil, because the underlying primitive foldBag is self-maintainable in this case (as discussed in next section). We take advantage of this by implementing a specialized derivative for foldBag.

Similarly to what we have seen in Sec. 10.6 that dgrandTotal needlessly recomputes merge xs ys without optimizations. However, the result is a base input to fold′. In next section, we’ll replace fold′by a self-maintainable derivative (based again on foldBag) and will avoid this recomputation.□

Chapter 16. Differentiation in practice 129 To conservatively predict whether a derivative is going to be self-maintainable (and thus efficient), one can inspect whether the program restricts itself to (conditionally) self-maintainable primitives, like merge (always) or map f (only if df is nil, which is guaranteed when f is a closed term).

To avoid recomputing base arguments for self-maintainable derivatives (which never need them), we currently employ lazy evaluation. Since we could use standard techniques for dead-code elimination [Appel and Jim, 1997] instead, laziness is not central to our approach.

A significant restriction is that not-self-maintainable derivatives can require expensive compu- tations to supply their base arguments, which can be expensive to compute. Since they are also computed while running the base program, one could reuse the previously computed value through memoization or extensions of static caching (as discussed in Sec. 19.2.3). We leave implementing these optimizations for future work. As a consequence, our current implementation delivers good results only if most derivatives are self-maintainable.

16.3

Case study

We perform a case study on a nontrivial realistic program to demonstrate that ILC can speed it up. We take the MapReduce-based skeleton of the word-count example [Lämmel, 2007]. We define a suitable differentiation plugin, adapt the program to use it and show that incremental computation is faster than recomputation. We designed and implemented the differentiation plugin following the requirements of the corresponding proof plugin, even though we did not formalize the proof plugin (e.g. in Agda). For lack of space, we focus on base types which are crucial for our example and its performance, that is, collections. The plugin also implements tuples, tagged unions, Booleans and integers with the usual introduction and elimination forms, with few optimizations for their derivatives.

wordcount takes a map from document IDs to documents and produces a map from words appearing in the input to the count of their appearances, that is, a histogram:

wordcount: Map ID Document → Map Word Int

For simplicity, instead of modeling strings, we model documents as bags of words and document IDs as integers. Hence, what we implement is:

histogram: Map Int (Bag a) → Map a Int

We model words by integers (a = Int), but treat them parametrically. Other than that, we adapt directly Lämmel’s code to our language. Figure 16.1 shows the λ-term histogram.

Figure 16.2 shows a simplified Scala implementation of the primitives used in Fig. 16.1. As bag primitives, we provide constructors and a fold operation, following Gluche et al. [1997]. The constructors for bags are ∅ (constructing the empty bag), singleton (constructing a bag with one element), merge (constructing the merge of two bags) and negate (negate b constructs a bag with the same elements as b but negated multiplicities); all but singleton represent abelian group operations. Unlike for usual ADT constructors, the same bag can be constructed in different ways, which are equivalent by the equations defining abelian groups; for instance, since merge is commutative, merge x y= merge y x. Folding on a bag will represent the bag through constructors in an arbitrary way, and then replace constructors with arguments; to ensure a well-defined result, the arguments of fold should respect the same equations, that is, they should form an abelian group; for instance, the binary operator should be commutative. Hence, the fold operator foldBag can be defined to take a function (corresponding to singleton) and an abelian group (for the other constructors). foldBag is

130 Chapter 16. Differentiation in practice

// A b e l i a n g r o u p s

a b s t r a c t class Group [ A ] {

def merge ( value1 : A , value2 : A ): A

def i n v e r s e ( value : A ): A def zero : A

}

// Bags

type Bag [ A ] = c o l l e c t i o n . i m m u t a b l e . H a s h M a p [A , Int ] def g r o u p O n B a g s [ A ] = new Group [ Bag [ A ]] {

def merge ( bag1: Bag [ A ] , bag2: Bag [ A ]) = . . .

def i n v e r s e ( bag : Bag [ A ]) = bag . map ({ case ( value , count ) ⇒ ( value , - count ) })

def zero = c o l l e c t i o n . i m m u t a b l e . H a s h M a p () }

def f o l d B a g [A , B ]( group : Group [ B ] , f : A ⇒ B , bag : Bag [ A ]): B = bag . f l a t M a p ({

case (x , c ) if c ≥ 0 ⇒ Seq . fill ( c )( f ( x ))

case (x , c ) if c < 0 ⇒ Seq . fill (-c )( group . inverse (f(x )))

}). fold ( g rou p . zero )( g rou p . m erg e )

// Maps

type Map [K , A ] = c o l l e c t i o n . i m m u t a b l e . H a s h M a p [K , A ]

def g r o u p O n M a p s [K , A ]( group : Group [ A ]) = new Group [ Map [K , A ]] {

def merge ( dict1: Map [K , A ] , dict2: Map [K , A ]): Map [K , A ] =

dict1. m e r g e d ( dict2 )({

case (( k , v1) , (_ , v2)) ⇒ (k , g rou p . m erg e ( v1, v2))

}). f i l t e r ({

case (k , v ) ⇒ v , gr oup . zero

})

def i n v e r s e ( dict : Map [K , A ]): Map [K , A ] = dict . map ({ case (k , v ) ⇒ (k , group . i n v e r s e ( v ))

})

def zero = c o l l e c t i o n . i m m u t a b l e . H a s h M a p () }

// The g e n e r a l map fold

def f o l d M a p G e n [K , A , B ]( zero : B , merge : (B , B ) ⇒ B ) ( f : (K , A ) ⇒ B , dict : Map [K , A ]): B =

dict . map ( F u n c t i o n . t u p l e d ( f )). fold ( zero )( mer ge )

// By usi ng foldMap i n s t e a d of foldMapGen, the user p r o m i s e s that // f k is a h o m o m o r p h i s m from groupA to groupB for each k : K. def f o l d M a p [K , A , B ]( g r o up A : Group [ A ] , g r o up B : Group [ B ])

( f : (K , A ) ⇒ B , dict : Map [K , A ]): B =

f o l d M a p G e n ( g r o u p B . zero , g r o u p B . mer ge )( f , dict )

Figure 16.2: A Scala implementation of primitives for bags and maps. In the code, we call ⊞, ⊟ and e respectively merge, inverse, and zero. We also omit the relatively standard primitives.

Chapter 16. Differentiation in practice 131 then defined by equations:

foldBag: Group τ → (σ → τ ) → Bag σ → τ foldBag д@(_, ⊞, ⊟, e) f ∅ = e

foldBag д@(_, ⊞, ⊟, e) f (merge b1b2)= foldBag д f b1 ⊞ foldBag д f b1 foldBag д@(_, ⊞, ⊟, e) f (negate b) = ⊟ (foldBag д f b) foldBag д@(_, ⊞, ⊟, e) f (singleton v) = f v

If д is a group, these equations specify foldBag д precisely [Gluche et al., 1997]. Moreover, the first three equations mean that foldBag д f is the abelian group homomorphism between the abelian group on bags and the group д (because those equations coincide with the definition). Figure 16.2 shows an implementation of foldBag as specified above. Moreover, all functions which deconstruct a bag can be expressed in terms of foldBag with suitable arguments. For instance, we can sum the elements of a bag of integers with foldBag gZ (λx → x), where gZ is the abelian group induced on integers by addition (Z, +, 0, −). Users of foldBag can define different abelian groups to specify different operations (for instance, to multiply floating-point numbers).

If д and f do not change, foldBag д f has a self-maintainable derivative. By the equations above,

foldBag д f (b ⊕ db) = foldBag д f (merge b db) = foldBag д f b ⊞ foldBag д f db

= foldBag д f b ⊕ GroupChange д (foldBag д f db)

We will describe the GroupChange change constructor in a moment. Before that, we note that as a consequence, the derivative of foldBag д f is

λb db → GroupChange д (foldBag д f db),

and we can see it does not use b: as desired, it is self-maintainable. Additional restrictions are require to make foldMap’s derivative self-maintainable. Those restrictions require the precondition on mapReduce in Fig. 16.1. foldMapGen has the same implementation but without those restrictions; as a consequence, its derivative is not self-maintainable, but it is more generally applicable. Lack of space prevents us from giving more details.

To define GroupChange, we need a suitable erased change structure on τ , such that ⊕ will be equivalent to ⊞. Since there might be multiple groups on τ, we allow the changes to specify a group, and have ⊕ delegate to ⊞ (which we extract by pattern-matching on the group):

∆τ = Replace τ | GroupChange (AbelianGroup τ )τ v ⊕ (Replace u)= u

v ⊕ (GroupChange (⊞, inverse, zero) dv) = v ⊞ dv v ⊖ u= Replace v

That is, a change between two values is either simply the new value (which replaces the old one, triggering recomputation), or their difference (computed with abelian group operations, like in the changes structures for groups from Sec. 13.1.1. The operator ⊖ does not know which group to use, so it does not take advantage of the group structure. However, foldBag is now able to generate a group change.

132 Chapter 16. Differentiation in practice We rewrite grandTotal in terms of foldBag to take advantage of group-based changes.

id = λx → x G+= (Z, +, −, 0)

grandTotal = λxs → λys → foldBag G+id (merge xs ys) D ⟦ grandTotal ⟧ =

λxs → λdxs → λys → λdys → foldBag′G+G+′ id id′

(merge xs ys)

(merge′xs dxs ys dys) It is now possible to write down the derivative of foldBag.

(if static analysis detects that dG and df are nil changes) foldBag′= D ⟦ foldBag ⟧ =

λG → λdG → λ f → λdf → λzs → λdzs → GroupChange G (foldBag G f dzs) We know from Sec. 10.6 that

merge′= λu → λdu → λv → λdv → merge du dv.

Inlining foldBag′ and merge′gives us a more readable term β-equivalent to the derivative of grandTotal:

D ⟦ grandTotal ⟧ =

λxs → λdxs → λys → λdys → foldBag G+id (merge dxs dys).

16.4

Benchmarking setup

We run object language programs by generating corresponding Scala code. To ensure rigorous benchmarking [Georges et al., 2007], we use the Scalameter benchmarking library. To show that the performance difference from the baseline is statistically significant, we show 99%-confidence intervals in graphs.

We verify Eq. (10.1) experimentally by checking that the two sides of the equation always evaluate to the same value.

We ran benchmarks on an 8-core Intel Core i7-2600 CPU running at 3.4 GHz, with 8GB of RAM, running Scientific Linux release 6.4. While the benchmark code is single-threaded, the JVM offloads garbage collection to other cores. We use the preinstalled OpenJDK 1.7.0_25 and Scala 2.10.2. Input generation Inputs are randomly generated to resemble English words over all webpages on the internet: The vocabulary size and the average length of a webpage stay relatively the same, while the number of webpages grows day by day. To generate a size-n input of type (Map Int (Bag Int)), we generate n random numbers between 1 and 1000 and distribute them randomly in n/1000 bags. Changes are randomly generated to resemble edits. A change has 50% probability to delete a random existing number, and has 50% probability to insert a random number at a random location.

Chapter 16. Differentiation in practice 133 Experimental units Thanks to Eq. (10.1), both recomputation f (a ⊕ da) and incremental com- putation f a ⊕ D ⟦ f ⟧ a da produce the same result. Both programs are written in our object language. To show that derivatives are faster, we compare these two computations. To compare with recomputation, we measure the aggregated running time for running the derivative on the change and for updating the original output with the result of the derivative.

16.5

Benchmark results

Our results show (Fig. 16.3) that our program reacts to input changes in essentially constant time, as expected, hence orders of magnitude faster than recomputation. Constant factors are small enough that the speedup is apparent on realistic input sizes.

We present our results in Fig. 16.3. As expected, the runtime of incremental computation is essentially constantin the size of the input, while the runtime of recomputation is linear in the input size. Hence, on our biggest inputs incremental computation is over 104times faster.

Derivative time is in fact slightly irregular for the first few inputs, but this irregularity decreases slowly with increasing warmup cycles. In the end, for derivatives we use 104warmup cycles. With fewer warmup cycles, running time for derivatives decreases significantly during execution, going from 2.6ms for n = 1000 to 0.2ms for n = 512000. Hence, we believe extended warmup is appropriate, and the changes do not affect our general conclusions. Considering confidence intervals, in our experiments the running time for derivatives varies between 0.139ms and 0.378ms.

In our current implementation, the code of the generated derivatives can become quite big. For the histogram example (which is around 1KB of code), a pretty-print of its derivative is around 40KB of code. The function application case in Fig. 12.1c can lead to a quadratic growth in the worst case. More importantly, we realized that our home-grown transformation system in some cases performs overly aggressive inlining, especially for derivatives, even though this is not required for incrementalization, and believe this explains a significant part of the problem. Indeed, code blowup problems do not currently appear in later experiments (see Sec. 17.4).

16.6

Chapter conclusion

Our results show that the incrementalized program runs in essentially constant time and hence orders of magnitude faster than the alternative of recomputation from scratch.

An important lessons from the evaluations is that, as anticipated in Sec. 16.2.1, to achieve good performance our current implementation requires some form of dead code elimination, such as laziness.

134 Chapter 16. Differentiation in practice 0.1   1   10   100   1000   10000   1k   2k   4k   8k   16k   32k   64k   128k   256k   512k   1024k  2048k  4096k   Run$me  (ms)   Input  size   Incremental   Recomputa:on  

Figure 16.3: Performance results in log–log scale, with input size on the x-axis and runtime in ms on the y-axis. Confidence intervals are shown by the whiskers; most whiskers are too small to be visible.

Chapter 17

Cache-transfer-style conversion

17.1

Introduction

Incremental computation is often desirable: after computing an output from some input, we often need to produce new outputs corresponding to changed inputs. One can simply rerun the same base programon the new input; but instead, incremental computation transforms the input change to an output change. This can be desirable because more efficient.

Incremental computation could also be desirable because the changes themselves are of interest: Imagine a typechecker explaining how some change to input source propagates to changes to the typechecking result. More generally, imagine a program explaining how some change to its input propagates through a computation into changes to the output.

ILC (Incremental λ-Calculus) [Cai et al., 2014] is a recent approach to incremental computation for higher-order languages. ILC represents changes from an old value v1to an updated value v2as a first-class value written dv. ILC also transforms statically base programs to incremental programs or derivatives: derivatives produce output changes from changes to all their inputs. Since functions are first-class values, ILC introduces a notion of function changes.

However, as mentioned by Cai et al. and as we explain below, ILC as currently defined does not allow reusing intermediate results created by the base computation during the incremental computation. That restricts ILC to supporting efficiently only self-maintainable computations, a rather restrictive class: for instance, mapping self-maintainable functions on a sequence is self- maintainable, but dividing numbers isn’t! In this paper, we remove this limitation.

To remember intermediate results, many incrementalization approaches rely on forms of memo- ization: one uses hashtables to memoize function results, or dynamic dependence graphs [Acar, 2005] to remember the computation trace. However, such data structures often remember results that might not be reused; moreover, the data structures themselves (as opposed to their values) occupy additional memory, looking up intermediate results has a cost in time, and typical general- purpose optimizers cannot predict results from memoization lookups. Instead, ILC aims to produce purely functional programs that are suitable for further optimizations.

We eschew memoization: instead, we transform programs to cache-transfer style (CTS), following ideas from Liu and Teitelbaum [1995]. CTS functions output caches of intermediate results along with their primary results. Caches are just nested tuples whose structure is derived from code, and accessing them does not involve looking up keys depending on inputs. We also extend differentiation to produce CTS derivatives, which can extract from caches any intermediate results they need. This approach was inspired and pioneered by Liu and Teitelbaum for untyped first-order functional languages, but we integrate it with ILC and extend it to higher-order typed languages.

136 Chapter 17. Cache-transfer-style conversion While CTS functions still produce additional intermediate data structures, produced programs can be subject to further optimizations. We believe static analysis of a CTS function and its CTS derivative can identify and remove unneeded state (similar to what has been done by Liu and Teitelbaum), as we discuss in Sec. 17.5.6. We leave a more careful analysis to future work.

We prove most of our formalization correct in Coq To support non-simply-typed programs, all our proofs are for untyped λ-calculus, while previous ILC correctness proofs were restricted to simply-typed λ-calculus. Unlike previous ILC proofs, we simply define which changes are valid via a logical relation, then show the fundamental property for this logical relation (see Sec. 17.2.1). To extend this proof to untyped λ-calculus, we switch to step-indexed logical relations.

To support differentiation on our case studies, we also represent function changes as closures that can be inspected, to support manipulating them more efficiently and detecting at runtime when a function change is nil hence need not be applied. To show this representation is correct, we also use closures in our mechanized proof.

Unlike plain ILC, typing programs in CTS is challenging, because the shape of caches for a function depends on the function implementation. Our case studies show how to non-trivially embed resulting programs in typed languages, at least for our case studies, but our proofs support an untyped target language.

In sum, we present the following contributions:

• via examples, we motivate extending ILC to remember intermediate results (Sec. 17.2); • we give a novel proof of correctness for ILC for untyped λ-calculus, based on step-indexed

logical relations (Sec. 17.3.3);

• building on top of ILC-style differentiation, we show how to transform untyped higher-order programs to cache-transfer-style (CTS) (Sec. 17.3.5);

• we show that programs and derivatives in cache-transfer style simulate correctly their non- CTS variants (Sec. 17.3.6);

• we mechanize in Coq most of our proofs;

• we perform performance case studies (in Sec. 17.4) applying (by hand) extension of this technique to Haskell programs, and incrementalize efficiently also programs that do not admit self-maintainable derivatives.

The rest of the paper is organized as follows. Sec. 17.2 summarizes ILC and motivates the extension to cache-transfer style. Sec. 17.3 presents our formalization and proofs. Sec. 17.4 presents our case studies and benchmarks. Sec. 17.5 discusses limitations and future work. Sec. 17.6 discusses related work and Sec. 17.7 concludes.

In document Optimizing and Incrementalizing Higher-order Collection Queries by AST Transformation (Page 146-154)