• No results found

and Moss, 1992] and object pretenuring [Ungar and Jackson, 1992, Blackburn et al., 2001, Singer et al., 2007].

The connectivity between objects is another criterion for clustering objects. Both kinds of connectivity: direct, where object A points to object B or transitive, where object B is reachable from object A can reveal object grouping criteria. A study by Hirzel et al. [2002] examines different connectivity patterns and object lifetime and deathtime. They conclude that connected objects that are reachable only from the stack are shortlived; whereas, objects that are reachable from globals live for long time, perhaps immortally. In addition, objects that are connected by pointers die at the same time.

This object connectivity behavior can be utilized to improve the garbage collection perfor- mance. The same authors, Hirzel et al. [2003] segregate the heap into many partitions, each contains a set of connected objects. They use compiler analysis to determine the connec- tivity of objects and eliminate write barriers by avoiding pointers between partitions. Since connected objects usually die together, the garbage collector chooses some partitions where much space can be reclaimed. To trigger the garbage collector, an estimator is used to anno- tate the partitions with a high proportion of garbage to indicate the need for collection. They developed a simulator gcSim to evaluate their solution and report improved performance over other garbage collection implementations.

Once a garbage collector implements a policy that segregates objects based on certain object properties, it would apply this policy on all objects regardless of the application behavior. The user program usually exhibits different phases, in which the memory requirements are different. The memory manager should recognize and exploit phased behavior. Jones and Ryder [2008] study Java object demographics and find a relationship between allocation sites, for example JIT compiler and the user program allocations, and both the program phase behavior and object lifetime distribution. Objects allocated by these allocation sites cluster strongly and are stable across different inputs. They conclude that allocation sites create objects with consistent behavior; thus the garbage collector works on objects from key allocation sites where objects are expected to die [Jones and Ryder, 2006].

2.7

Data Placement Policies

NUMA architecture increases the space of memory page mapping options. Data can be allocated in a local or remote NUMA node relative to the thread’s node that accesses it. In contrast to UMA architecture, data location may impact the access latency; hence, the overall application performance.

Managed runtime systems reclaim garbage memory automatically and this implies a high likelihood of object movement between NUMA nodes for generational heaps. In fact, an

2.7. DATA PLACEMENT POLICIES

object may live in multiple NUMA nodes during the course of its life. A memory placement policy determines an initial object location for the application threads; however, the garbage collector changes the object location and alters the memory placement policy. Furthermore, programs exhibit different execution phases where data access characteristics at each phase may be different; hence, a static data placement policy may be suboptimal.

Existing deployed data placement strategies consider two factors to provide optimal NUMA machine performance. Firstly, threads and data are placed in the same NUMA node to increase locality and avoid remote access overhead. Secondly, data is distributed across the system’s nodes to avoid bandwidth saturation. This section will review various data placement policies.

In the late 1990s, remote access overhead in NUMA systems took 3 to 5 times longer than local access [Verghese et al., 1996]. This overhead was due to the legacy wiring techniques which resulted in major delays in the interconnection links between nodes. To overcome this issue, a large body of research attempts to improve locality by placing data in the local node of the core accessing it.

Several techniques have been developed to improve data locality. To avoid the large re- mote/local access latency ratio, memory pages can freely move between NUMA nodes to enable local access. This technique is called memory page migration. A memory page that is frequently accessed by a remote core is migrated to the core’s node.

Previous memory page migration policies were developed in the context of non-cache-coherent NUMA systems, for example, Bolosky et al. [1991], LaRowe et al. [1991]. Kernel-based NUMA management policies are modified to explicitly move memory pages in response to page fault events. LaRowe et al. [1991] exploit page fault signals and modify operating sys- tem memory management modules to implement a parameterized memory page migration policy. For instance, a shared memory page between NUMA nodes may have temporal ac- cess. Thus, the memory page may bounce between NUMA nodes and affect access latency. To avoid actively-shared memory page bouncing between NUMA nodes, they set a freeze- windowto hold memory page migration for a certain period then defrost it. They concluded that tunable dynamic memory page migration can have dramatic impact on application per- formance.

Bolosky et al. [1991] use reference traces from a variety of applications to drive simulations of different NUMA page placement policies. In addition, they employ a cost/benefit model to decide whether the cost of moving a memory page outweighs the cost of remote memory access overhead. Chandra et al. [1994] investigate the effectiveness of using TLB misses as an indicator for memory page migration on cache-coherent NUMA systems. However, they report that there is no improvement on the response time for the workloads due to internal issues of their virtual memory system. Instead, they carried out a trace-driven study