• No results found

Program Performance-Related Macroscopic Applications

Speculative Applications of Macroscopic Techniques

8.2 Program Performance-Related Macroscopic Applications

Program performance is the primary focus of this thesis, but we still have not been able to explore all possible applications of macroscopic techniques for improved program performance.

8.2.1 Automatic Instance Interleaving

Instance Interleaving is a technique which arranges for the fields of multiple instances of structures in a program to be interleaved with each other [136]. For example, consider a recursive data structure consisting of nodes with fields F1,F2,F3,F4. With a standard memory organization, four instances (A,B,C,D) of this node type would be laid out in memory as:

AF1,AF2,AF3,AF4, BF1,BF2,BF3,BF4, CF1,CF2,CF3,CF4, DF1,DF2,DF3,DF4

When instance interleaving is used, assuming that the fields of this structure are all the same size and that four fields fit on a cache line, memory would be organized like this instead:

AF1,BF1,CF1,DF1, AF2,BF2,CF2,DF2, AF3,BF3,CF3,DF3, AF4,BF4,CF4,DF4

The advantage of this layout is that it packs identical fields together onto a cache line. Consider a traversal of this data structure that accesses fields F1 and F2, but not F3 or F4. In the first case, each structure instance occupies an entire cache line, and traversing these four instances requires the use of four cache lines, and only half of the information each cache line is actually used. After instance interleaving, only two cache lines are accessed, reducing cache footprint of the traversal.

Instance interleaving is a powerful technique, first proposed by Truong et. al, in [136], and partially automated in [106]. They show that instance interleaving can have a large positive perfor-mance impact, but is difficult to implement. In particular, instance interleaving requires a special allocation library and requires a way to get the compiler to lay out the fields of a structure in this unusual ways. The implementation in [106] is limited in several ways: in particular, they only eval-uate the transformation for very small programs, assume (but do not check) that their C programs are type-safe, performs the transformation “per type” instead of per data structure instance, does not check for memory that escapes the program, etc.

Implementing instance interleaving as a Macroscopic transformation would improve upon this in several ways, requiring implementation techniques that are very similar to the pointer compression algorithm described in Chapter 7. In particular, a macroscopic implementation could directly solve the problems with the algorithm presented in [106], making this suitable for use in a production compiler by using the following properties:

• Macroscopic analysis identifies memory that is accessed in a type-safe way.

• Macroscopic analysis identifies type-homogenous recursive data structures.

• Macroscopic analysis identifies memory objects that escape outside of the scope of analysis (e.g., those that are passed to external functions).

• Macroscopic techinques give full control over the allocation runtime library that the program allocates and frees memory with.

• Macroscopic techniques would transform each data structure instance at a time, independently of each other. This would allow different instances to have different fields collocated together with each other when profitable.

• Macroscopic techniques identify tricky cases that the algorithm must handle, such as alloca-tion of arrays of nodes.

We believe that this aggressive application would have a large performance impact on many different programs and be reasonably straight-forward to implement.

8.2.2 Automatic use of Superpages for Inproved TLB Effectiveness

TLB misses can be a significant factor that limits the performance of programs with large memory footprints. To combat this problem, architecture support for superpages has become commonplace.

Superpages improve TLB “reach” by enhancing the TLB to support entries for two or more page sizes, the first is a standard size (e.g. 4K bytes) and the second is a power of two that is often much larger (e.g. 1M or 16M bytes).

Using superpages improves TLB performance by reducing the number of entries required to cover an address range. Because of this, operating system support for automatically inferring when superpages are beneficial has been investigated (e.g. [111]), focusing on how and when to promote normal pages to superpages and when to reduce them to normal pages again. However, use of superpages is not always profitable [132, 23]. In particular, superpages add increased complexity to the operating system, make swapping more expensive, and can affect working set sizes.

Macroscopic analysis and pool allocation in particular can be used to identify and increase the number of cases when superpage promotion is cost effective. In particular, a simple approach would enhance the pool runtime library (described in Section 5.1.1) to allocate superpage memory when allocating slabs that are larger than the superpage. This approach (or more aggressive ones) could increase the number of situations where use of superpages for recursive data structures is profitable, taking advantage of the data structure defragmentation properties provided by pool allocation (discussed in Section 6.3.7).

8.2.3 New Approaches for Prefetching

Prefetching for programs that use dense arrays is a well understood problem [21, 100], but prefetch-ing for pointer-chasprefetch-ing traversals of recursive data structures is much harder. For example, consider Figure 8.1, a function that computes the length of a linked list.

s t r u c t l i s t { int X; l i s t ∗ Next ; } ;

unsigned l e n g t h ( l i s t ∗L ) { unsigned Length = 0 ; f o r ( ; L ; L = L−>Next )

++Length ; return Length ; }

Figure 8.1: Linked-list pointer-chasing example

The problem in this case, and many other tight pointer-chasing loops, is that there is not enough work to overlap with the prefetch. Even if the prefetch for the ’next’ dereference is started immediately after the previous load completes, the prefetch will not have enough time to bring the memory into cache, unless it is already there to begin with. The only general-purpose prior solution to this problem is a technique known as history-pointer prefetching [94] (also known as jump-pointer prefetching [112]).

Compressed History-Pointer Prefetching

History-pointer prefetching is one successful approach for overcoming the latency of pointer-chasing loops, which adds additional pointers to the data structure that point several nodes ahead in the traversal. Having a pointer to the node that will be needed N steps ahead in the traversal allows the prefetching code to be fetching N nodes away, which allows it to overcome almost arbitrary memory latency (assuming that these links are accurate). The primary disadvantage of history-pointer prefetching is that it simultaneously reduces the effectiveness of the cache by increasing the size of the list nodes. This effect is particularly bad on 64-bit systems.

Note that the inefficiency introduced by history-pointers is precisely the overhead that pointer-compression is designed to eliminate: it adds intra-data-structure pointers. For this reason, us-ing pointer compression to compress the original and history-pointers in a data structure seems extremely powerful: it has the prefetching power of history-pointer prefetching, but without the

overhead of increasing the size of the nodes. Also, as with pointer compression, data structure anal-ysis exposes information about when it is safe to modify the layout of a particular data structure, which is a prerequisite to performing automatic history-pointer prefetching for programs written in languages like C.

Pool-order prefetching

With standard heap allocation of data structure nodes, the individual nodes can be fragmented throughout memory. Automatic Pool Allocation inherently improves this situation by grouping the nodes together in memory, which has a positive effect on locality (improving effective cache line density and TLB usage). Additionally, we find that the allocation order and common traversal patterns of data structures are strongly correlated.

All of these observations lead us to believe that simple stride prefetching of data structure nodes in a pool might be an effective way to improve the performance of pointer-chasing codes. Stride prefetching is very simple and has the advantage (like history-pointer prefetching) that you can prefetch as many nodes ahead as needed to cover the latency of memory accesses. Implementing this technique and experimenting with it could provide valuable insight into the locality gains that pool allocation can provide, especially because many processors now have hardware stride prefetching hardware available.

8.2.4 Data Structure Traversal-Order Node Relocation

A common usage pattern for data structures is to have a construction/mutation phase followed by a traversal phase, followed by a destruction phase. As an example, consider a program that populates a balanced binary tree then spends a lot of time querying it. When created, the tree will require the nodes to be reordered to maintain the balancing properties, thus the common traversal orders will be unstable. However, when the program enters its query phase it will begin querying it with very similar traversal patterns.

For programs with distinct phase behavior like this, it is sometimes effective for the compiler to insert code into the program that reorders the nodes of the data structure in the expected hot traversal order. Others have observed this effect and implemented it in garbage collected systems