Parallel Architectures - Garbage collection optimization for non uniform memory access architec

1 10 16 16 22 3 16 10 16 22 4 16 16 10 16 6 22 22 16 10

Table 3.1: NUMA delay time between nodes.

Figure 3.1: The performance gab between processor and access to main memory. Source: [Hennessy and Patterson, 2011a]

memory allocation policy may separate data from processors and increase memory access latency. Therefore, memory allocation policies need to be aware of the underlying hardware to reduce the overhead of remote memory accesses.

This chapter describes related the hardware and software components that affect an appli- cation’s data placement. Section 3.2 describes parallel architecture evolution. Section 3.3 highlights NUMA architecture and low level hardware specifications. In particular, this section is dedicated to the AMD Opteron processor architecture, which is the platform used for all experiments in this dissertation. An overview of Linux memory allocation policies is provided in Section 3.4, then Section 3.5 discusses the virtual to physical memory page map- ping mechanisms. Section 3.6 describes the implementation of garbage collection policies in the OpenJDK Hotsopt JVM. Specifically, it considers the Parallel Scavenge collector: a Stop-The-World garbage collector and reviews implementation details from the source code.

3.2 Parallel Architectures

The current trend to support scalable performance is to augment the number of cores per processor. These cores can be organized in different ways according to their data flow and control flow. Flynn [1972] identifies four categories of parallel architectures:

3.2. PARALLEL ARCHITECTURES

1. Single-Instruction stream, Single-Data stream (SISD): This architecture consists of a single processing element and can access a program and data storage. For example, unicore processors implement SISD architecture and represent conventional sequential computers following the Von Neumann model.

2. Multiple-Instruction stream, Single-Data stream (MISD): There are multiple processing elements, each executing its own program. However, they have single access to data in the global memory. Processors may execute different instructions but they have identical data as operand. For example, systolic arrays include a network of hard-wired processor nodes to perform a specific operation such as parallel convolu- tion tasks. This type of architecture is very limited and has not been built commercially [Rauber and R¨unger, 2010a].

3. Single-Instruction stream, Multiple-Data stream (SIMD): In this architecture, multiple processing elements execute the same instruction stream but different data is loaded from global memory with private access. Vector processors are good examples of this category.

4. Multiple-Instruction stream, Multiple-Data stream (MIMD): Here, multiple processing elements load separate programs and separate data from the global memory. They work asynchronously. Multicore processors are example of MIMD category.

General-purpose computers largely adopt the MIMD model for parallelism. With its widespread implementation, MIMD computers can be further classified into two categories according to two aspects of their memory organization: the virtual and the physical memory. Proces- sors in MIMD machines could have physically shared memory, called multiprocessors, or physically distributed memory which are called multicomputers. From a virtual view of the memory, MIMD computers could use shared address space or distributed address space. The two views of the memory (physical and virtual) need not be the same; a system with shared virtual address space can run on top of physically distributed memory.

3.2.1 Distributed Memory Architectures

A distributed memory system consists of a number of nodes, each node contains a processing element, local memory, and possibly I/O elements. Nodes are connected via an interconnection network that communicates data between nodes. Data stored in local memory is private to its processor. When a processor needs data from other nodes, it typically exchange send/receive messages with the target node. Therefore, message passing is the preferred parallel programming model for distributed memory systems [Rauber and R¨unger, 2010b].

3.2. PARALLEL ARCHITECTURES

3.2.2 Shared Memory Architectures

A shared memory machine consists of a number of processing elements, a global shared memory, and an interconnection network to connect processors with the global memory. The global memory in modern machines is implemented as a set of multiple memory modules. Data communication between processors is performed by writing or reading from shared variables. Write operations to shared variables must not be concurrent; otherwise a race condition would occur with an unpredictable result.

The shared memory model can form two different architectures with reference to the memory access latency: Uniform Memory Access (UMA) and Non-Uniform Memory Access (NUMA) architectures. In UMA architecture, processors are connected to the shared memory via a central bus, Front-Side Bus. The distance between processors and the shared memory is uniform; therefore processors have equal access latency to the memory. This central bus provides constant bandwidth to all processors; thus it increases access collisions and causes additional access latency. In addition, there is no private memory for processors, but they use a cache hierarchy to accelerate access to data. Hierarchical organization of processors, cores, and caches imposes communication overhead between processing elements based on the distance between each other [Cruz et al., 2010]. For instance, the AMD Bull- dozer micro-architecture incorporates a shared L2 cache between every two cores. A data placement policy should consider such a design to improve cache efficiency.

A Symmetric Multiprocessor (SMP) is an implementation of UMA shared memory architecture, where multiple processing elements “cores” are integrated in a single die [Gepner and Kowalik, 2006]. Each core is considered as a processor and has its own resources. In addition, each core has its own cache subsystem and may share part of the cache hierarchy with other cores in the same die. SMPs usually employ a small number of processors be- cause adding more processors to the SMP chip would increase memory access collisions on the central bus. Consequently, scalability of SMP processors is limited [Esmaeilzadeh et al., 2012]. The maximum number of processors in a bus-based SMPs is between 32 and 64 [Rauber and R¨unger, 2010a]. Figure 3.2 depicts a standard multicore UMA architecture. For more scalable SMP processors, a processor can integrate a number of cores in a single chip and distribute the memory among the cores. The memory address space is shared between cores, however, the memory access latency is non-uniform. A core may exhibit long access latency time to access remote memory. To minimize remote memory access overhead, the cores may use a cache subsystem. Cores must ensure that a memory address contains the most recently updated value by using a suitable cache coherency protocol. This kind of architecture is called cache coherent NUMA (ccNUMA). Figure 3.3 presents an example of multi-hop NUMA architecture.

3.2. PARALLEL ARCHITECTURES

Core

Memory

Core Core Core

BU

S

Multicore Processor

Cache Cache Cache Cache Shared Cache

Figure 3.2: A standard multicore UMA architecture diagram.

Core M em ory Core Core Core Mul tic o re P ro ce ss o r Cache Core Memory

Core Core Core

Multicore Processor Cache Core M em ory Cor e Core Core Mul tic o re P ro ce ss o r Cache Core Memory Core Core Core Multicore Processor Cache

In document Garbage collection optimization for non uniform memory access architectures (Page 51-55)