NUMA Heaps - Garbage collection optimization for non uniform memory access architectures

barrier works as a guard asking the owner thread to move the object to the global heap. The heap layout then consists of per-thread heaplets each with two parts. The first part acts as a traditional nursery with a copying collector. The second part is called the sticky heap, which contains objects that lack read barriers, hence are private and immovable and this part is collected with mark-sweep collector. The shared heap is collected with a stop-the-world collector.

Cohen et al. [2006] attempt to cluster the sub-heaps to reduce the number of thread-local heaplets and the number of objects in the shared heap. The main idea behind this work is that a number of threads share a sub-heap where allocated objects are accessed only by those threads. At collection time, only these threads are suspended while other threads remain executing. They use a clustering software module [Mancoridis et al., 1999, Harman et al., 2005] technique, which is based on a hill climbing algorithm to find optimum heap clusters. The clustering technique works by creating a Thread Dependency Graph (TDG), a directed graph that represents threads as nodes and dependencies between threads as edges. The dependency indicates threads’ accesses to an object. For instance, if object O is accessed by Thread A and B, then a dependency is created between threads A and B. The graph is built using system traces of previous runs and fed into parallel hill climbs. The results provide the proposed heap layout which consists of a number of sub-heaps where each sub-heap is mapped to one or more threads and a shared heap to accommodate objects that break the heap clusters. Experiments show that heap clustering reduces the number of sub-heaps and the total number of shared objects in the shared heap.

2.4 NUMA Heaps

The heap partitioning schemes described in the last section focus on various object charac- teristics. This section surveys a physical memory heap partitioning criterion, the NUMA heap. Heaps in modern machines are physically partitioned between NUMA nodes. There- fore, object placement techniques for NUMA heaps have gained attention recently due to the growing availability of NUMA machines. Moving objects between NUMA nodes can potentially improve mutator threads performance by placing objects frequently accessed by threads of a NUMA node into the same node.

Tikir and Hollingsworth [2005] study the impact of applying dynamic page placement techniques, which are used for applications with regular memory access patterns, on Java applications. They examine three placement policies on the SPECjbb2000 Java application. First, static-optimal: a technique that has the access information of each heap allocation. Objects are placed in a memory page local to the processor accessing them. Second, prior- knowledge: this technique knows the access information about surviving objects and mi-

2.4. NUMA HEAPS

grates them to memory pages local to the processor accessing them the most at garbage collection time. Third, object-migration: measures the access frequency for each object since the start of execution and the garbage collector uses this information to migrate objects to the processor’s local memory pages.

The study results show that the prior-knowledge policy provides the best results in both the young and the old generation. In addition, the object-migration technique reduces the non- local memory accesses in the old generation. Therefore, the authors suggest to partition the heap according to the system’s NUMA topology. The heap will have the following layout: the Eden space of the young generation and the old generation are segregated into a number of segments equal to the number of NUMA nodes. They did not partition the survivor space because the experiments showed low memory accesses to the survivors; hence low potential benefit from partitioning this space. This heap layout is implemented and evaluated using a simulator. Calculating the memory access frequencies is the crucial component of this research; however, the values were obtained from previous runs of the workload and fed to the simulator in advance. The results show a reduction of non-local memory accesses by 40% compared to the original heap layout in the Hotspot JVM.

Ogasawara [2009] criticizes the method used to calculate the memory access information, which was based on trace file processing. The inaccuracy of matching memory access events with objects and the time consumed by the garbage collector to find out the preferred location of an object in which it will be moved to; drive the researcher to consider an easier and lower overhead technique to calculate the preferred object location. He employs heuristic information to determine the preferred object location and calls this the dominant-thread (DoT) information. Heuristics include the thread identifier that acquires the object lock or reserves the object. This information is available in the object header and getting it incurs very low overhead. In case an object does not hold this information, the object gets the preferred location calculated for the object referencing it directly or indirectly.

Moreover, the heap layout is similar to Tikir’s heap, however, Ogasawara partitions the survivor space as well. The old generation consists of a number of segments matching the number of NUMA nodes. The Eden space in the young generation consists of multiple segment groups. Each group contains multiple segments to reflect the NUMA topology. In addition, the survivor space contains one segment group only. Mutators request memory from the corresponding segment in the Eden space. The allocation policy is not strict so it can extend the memory from the next segment as needed.

Garbage collector threads identify the preferred location of a survivor object using the dominant thread information. First, the preferred location of objects directly pointed at from the thread stacks is the NUMA node of the thread running on it and, which can be retrieved by system calls. Second, for objects that are locked or reserved, the preferred location is the

In document Garbage collection optimization for non uniform memory access architectures (Page 35-37)