Cache Locality Optimization

2.5 Object Locality

2.5.1 Cache Locality Optimization

Java objects are generally small in size [Bacon et al., 2002, Chilimbi et al., 1999a, Chilimbi and Larus, 1998]. A cache line (usually 64 bytes in size) can accommodate multiple objects. This feature attracts researchers to explore possible techniques to improve object temporal and spatial locality at the hardware cache level. Obviously, frequently accessed objects are potential candidates to live in the same cache line. In fact, literature goes further and explores which fields of an object have higher access rate than the others. Object’s fields can be reorganized such that “hot” fields placed next to each other. Consequently, frequently accessed fields of an object placed together in the same cache line. A step further in the research enables hot fields of multiple and different objects to reside in a single cache line.

The main challenge to the cache locality optimization is how to identify hot fields of an object, so that they can be co-located in the same cache line. This section reviews a number of cache locality optimization approaches. A fundamental technique in these approaches is that they profile the program at runtime to record data access information and calculate the hot fields.

Chilimbi and Larus [1998] develop a graph-based technique to identify hot objects and create cache-conscious data structures. At program execution, a data profiling system records the object’s base address for each load operation and enters it in an object access buffer. The garbage collector uses the data profile buffers to construct a weighted undirected object affinity graph. Nodes in the graph encode objects and edges encode temporal affinity. At collection time, the collector uses the affinity graph to layout objects with high temporal affinity next to each other. The outcome of this technique is high spatial and temporal cache locality since frequently accessed objects would reside closely and the same cache line is going to be used soon.

Nonetheless, the average overhead of runtime profiling constitutes up to 6% of a Cecil program’s execution time. Since the object affinity graph is constructed at every collection, they

2.5. OBJECT LOCALITY

apply this technique on the old generation only because minor collections are triggered more often which may cause significant overhead.

Calder et al. [1998] implement a different cache-conscious data placement technique. They use a compiler-directed mechanism that assigns addresses for global variables, stacks, heap objects, and constants to reduce data cache misses. Their technique requires a training run to gather data access information. The collected information is fed to the compiler to map proposed new virtual addresses for local and global variables and constants. For heap objects, they modify the memory allocator to assign the new addresses at runtime. The results show substantial locality improvement for global variables and stack objects; however, heap objects obtain insignificant performance improvement.

Novark et al. [2006] present an approach that enables programmers to change and control the object layout at collection time. Programmers annotate the code and provide a custom object layout for a class, which works as a hint to the runtime system to arrange objects in memory. At garbage collection time, the collector invokes the custom object layout methods to place objects into contiguous memory. Results show that a custom object layout reduces cache misses by 50%.

For non-garbage collected environments, Chilimbi et al. [1999b] propose two techniques for data reorganization: clustering and coloring. Clustering attempts to group data structure elements that have temporal affinity in a cache line. They target tree-like data structure and develop a tool to reorganize the data structure elements into sub-trees that are laid out linearly. On the other hand, the coloring technique organizes data in the cache to avoid re- source conflicts. A cache has limited number of concurrently accessed data elements in a cache block. Thus, coloring maps concurrently accessed elements to non-conflicting regions of the cache to reduce cache conflict misses. However, these techniques require programmer’s intervention to select related objects. In addition, they target L2 cache to get larger cache size and put many objects in the same cache line. When the cache line becomes full and other objects cannot fit into it, the authors attempt to co-locate objects into the same virtual memory page; hence, TLB misses will be minimal. The programmer has to intervene here and add hints that these objects are likely to be accessed together.

The memory hierarchy and the large data structures divert the research on optimizing cache locality to explore fine-grained techniques. Object fields often have different access frequencies. Instead of wasting a cache line with rarely accessed fields, only frequently accessed fields of objects should reside in the cache line.

Chilimbi et al. [1999a] suggest structure splitting to arrange internal organization of structure instances. The main idea of structure splitting is that Java classes have different access frequencies and that enables the class to be divided into hot (frequently accessed) and cold (rarely accessed) portions based on field access profiling. This technique requires static

2.5. OBJECT LOCALITY

analysis to provide class information and dynamic analysis to measure class instantiation and access statistics. Profile data are given to the compiler to generate code with structure splitting optimization. Class splitting involves injecting a pointer from the hot class to the cold class. Accordingly, hot class construction must create cold class instance first. The results showed around 20% performance improvement.

Furthermore, internal organization of large structures that span multiple cache lines can also be reorganized. Fields in a structure are ordered logically, which do not necessarily corre- spond to the temporal access patterns. Consequently, this logical layout may incur unnecessary cache misses. Moreover, a few fields of each class instance are accessed by the most active parts of the code [Truong et al., 1998]. Therefore, laying out these fields that are often referenced together into the same cache line improves the program performance [Panda et al., 1997].

As in class splitting, Chilimbi et al. [1999a] employ static and dynamic analysis to construct a field affinity graph of a structure and generate recommendations for field reorganization. Fields with high temporal affinity are clustered in the same cache line. However, such class splitting and field reordering techniques require substantial programmer effort.

The internal field reorganization of a class may still fill the cache line with unnecessary cold fields. Truong et al. [1998] suggest filling the cache line with frequently accessed fields of different instances of a class and call this technique instance interleaving. This is based on an observation that the reference pattern of a program often accesses a few fields in each instance and these fields are not enough to fill a cache line. In addition, their technique en- sures that when interleaving many instances, these fields are likely to be contemporaneously accessed, therefore, fields are mapped to different cache sets to eliminate conflict misses.

In document Garbage collection optimization for non uniform memory access architectures (Page 38-40)