Background and Related Work - Improving Storage Performance with Non-Volatile Memory-based Cach

5.2.1 Checkpoint/Restart

There are two types of C/R tools: application-level C/R tools and system-level C/R tools. Application-level C/R tools come with applications themselves; only data needed for restart are stored, so the checkpoint data size could be very small. System-level C/R tools are transparent to applications and usually checkpoint the whole memory space touched by the applications; thus, the checkpoint data size could be much larger. System-level C/R tools are used to checkpoint applications without innate C/R func- tionalities.

Here, we use a very popular system-level C/R tool, DMTCP (Distributed Multi- Threaded CheckPointing) [99], as a reference to explain how C/R tools work. DMTCP is in user space, does not require root privilege, and is independent from system kernel ver- sion, which makes it very flexible and user-friendly. DMTCP has a dmtcp coordinator

CRIS Confidential Compute node 1 Compute node 2 Compute node x

…

Application A Application N

Reading Computation Checkpointing

Compute node 3 Compute node 4 Application B

Compute node 5

Time

Figure 5.1: An example of HPC application execution patterns

process which must be started before operating dmtcp checkpoint or dmtcp restart. Checkpoints can be performed automatically on an interval, or they can be initiated manually on the command line of the dmtcp coordinator. Once issued a checkpoint request, the dmtcp coordinator will inform all the corresponding processes to halt, and each process will generate a checkpoint image individually. At the same time, a script is created for restart purposes.

5.2.2 HPC Application Characteristics

In a typical HPC cluster with hundreds or thousands of compute nodes, usually there are tens or hundreds of applications running concurrently. We used the showq command to show the job queue of the Mesabi cluster at the Minnesota Supercomputing Institute and found that 636 active jobs were running [100]. Also, the online real-time job queue report of the Stampede supercomputer at the Texas Advanced Computing Center showed 699 active jobs were running [101].

Figure 5.1 is a high-level simplified illustration of HPC application execution patterns. As shown in the figure, many applications, which start at different times, are running in the cluster. These applications need to read data (usually from PFSs) and perform computation. Applications with C/R requirements will perform checkpointing with frequencies set by the applications or users. After one checkpointing operation

BB Coordinator … Compute Node Compute Node … HPC Application I BB CKPT Coordinator BB …

Parallel File System

Storage Servers Metadata Servers … … Compute Node Compute Node … HPC Application II BB CKPT Coordinator BB Compute Node Compute Node … HPC Application N BB CKPT Coordinator BB Control Path Data Path Compute Node BB

Figure 5.2: An overview of the CDBB coordination system

is done, the computation resumes. This pattern repeats until either the application is finished or any failures happen, in which case the applications will restart from the latest checkpointing image.

Figure 5.1 clearly shows that the execution patterns of compute nodes assigned to the same application are quite similar to each other whereas that of the compute nodes assigned to different applications could be quite distinct. For example, when the compute nodes running Application A are performing checkpointing, the compute nodes running Application B are doing computation. In addition, some applications do not perform checkpointing at all, so they will continuously do computation until the end (Figure 5.1 Application N ). These insights give CDBB opportunities to perform optimization on BB utilization. If there is only one application running in the whole cluster or all the applications in the cluster happen to have the exact same execution patterns, then CDBB would not contribute too much since all the BBs are either being used or idle at the same time.

5.2.3 Non-volatile Memory

Current memory technologies such as DRAM and SRAM face technological limita- tions to continued improvement [31]. As a result, there are intense efforts to develop new DRAM-alternative memory technologies. Most of these new technologies are non- volatile memories, because non-volatility can provide additional advantages such as new power saving modes for quick wakeup as well as faster power-off recovery and restart for HPC applications [31]. These new technologies include PCM, STT-RAM, MRAM, RRAM, and 3D XPoint.

Phase Change Memory (PCM) is one of the most promising new NVM technologies and can provide higher scalability and storage density than DRAM [44, 45]. In general, PCM still has a 5–10× longer latency than DRAM. To overcome PCM’s speed defi- ciency, various system architectures have been designed to integrate PCM into current systems without performance degradation [25, 46, 47, 48, 49, 50, 51]. Magnetic RAM (MRAM) and Spin-Torque Transfer RAM (STT-RAM) are expected to replace SRAM and DRAM within the next few years [52, 53, 54]. STT-RAM reduces the transistor count and, consequently, provides a low-cost, high-density solution. Many enterprise and personal devices use MRAM for an embedded cache memory. Resistive RAM (RRAM) is considered a potential candidate to replace NAND Flash memory [55]. SanDisk and Hewlett Packard Enterprise are actively developing next generation RRAM technology. Micron and Intel recently introduced 3D XPoint non-volatile memory technology that is presently considered another DRAM alternative [56]. 3D Xpoint technology has high endurance, high density, and promising performance that is much better than NAND Flash but slightly slower than DRAM. Thus, it is expected to target high-performance in-memory processing [57].

In document Improving Storage Performance with Non-Volatile Memory-based Caching Systems (Page 91-94)