Coding for the High Performance Computing Environment

Chapter 3 Simulation Method

3.2 Coding for the High Performance Computing Environment

The programming scope of this project and fits squarely into the realm of Big Data. Big Data is defined as, “Information assets characterized by such a high volume, ve- locity and variety to require specific technology and analytical methods for its trans- formation into value”[13]. In this project, over one million dumbbells are simulated to provide clear results from the stochastic differential equations describing them; fulfilling the volume requirement. Second, in order to do this efficiently, parallel computation is written in CUDA C. This involves developing an understanding of the memory and executable hardware architecture [11]. Moreover, the scale necessitates moving to a high performance computing (HPC) cluster where GPU accelerators and file storage systems can process and store the large amounts of data produced. These are the specific technological requirements. Finally, a second set of MATLAB codes is used to analyze the data and produce useful information. Together these tools form the complex workflow necessary for Big Data. Indeed, the results achieved in this project are not currently possible with routine coding methods and computation platforms.

The simulation code consists of about 3400 lines of CUDA C. The cuRAND library is used to generate random numbers on the GPU [36]. Beyond this however, no special packages or libraries are employed. The main execution code follows a macro-micro loop design, macro for the CPU and micro for the GPU (See Figure 3.1). The main execution loop is as follows: Code on the CPU sends a full set of the dumbbell data to the GPU to be evolved for a set number of time steps; The GPU evolves each dumbbell the set amount of times in parallel and then returns the data back to the CPU; The new configuration is recorded by the CPU and the old is updated; The loop repeats until the desired time is reached. At the end of the simulation only the configurations recorded on the CPU are written to the csv file for output. The upshot of this arrangement is that it avoids sending large amounts of data between

the CPU and GPU allowing for efficient computation. However, the downside is that configuration changes on the GPU are not recorded.

In certain configurations the simulation code can quickly produce large amount of data. At 20 bytes of data per dumbbell per time step its easy to see how 102400 dumbbells quickly add-up over the course of simulations which have an average of 108 time steps (That’s 2048T of data from a single run). The macro-micro loop itself helps to limit the data size, however, it alone is not enough. In order to manage and derive information from the output much of the analysis revolves around metadata, such as the calculated stress, average lengths, average angle variance, etc. These results are stored in the output csv file. When the configuration of every dumbbell is needed, this is done at specific intervals and over time periods of interest, such as the steady state. This type of data was used to create figures such as the dumbbell configuration histograms in figure 4.15 for example. It is also possible to track every change of a single dumbbell as is displayed in figure 4.16. Enabling these features results in a bin file which is produced concurrently with the csv output.

Simulations where run in batches on multi-code nodes with Nvidia Tesla M2090, K80 and P100 GPU Accelerators. Since the memory footprint on the GPU is small, and the CPU core is usually fully utilized multiple simulations were run simultaneously depending on the environment. The table 3.1 lists the CPU-GPU combinations used in this work. Simulations were found to run significantly faster when writing to local storage. For certain file systems, not writing to local storage was enough to slow down a large cluster, and therefore care should be exercised if this is the case. Com- putation time ranges from mins to days depending on simulation parameters, with low flow rate steady shear flow and high frequency SAOS requiring the most time. Improving code performance beyond usable runtimes was not the primary concern of this work, and thus there are many areas for improvement.

Main Loop (Macro Time Step)

GPU Loop (Micro Time Step)

Determine Species Type Evolve Active Evolve Dangling Transform Species Loop No Change End Micro Loop? Record Configuration Data

Calculate Metadata (i.e. Stress)

File I/O Close Down Initialize RNGs CPU and GPU

Generate Initial State Parse Input Initial File I/O Allocate Memory Data to CPU Data to GPU End Macro Loop? Yes Yes No No

CPU Code

GPU Code

Figure 3.1: Code flow chart. The design of the code can be divided into CPU (left) and GPU (right) parts. Sending data between the CPU and GPU is a time intensive operation. The loop on the GPU represents a single thread and is run independently for each of the 1024000 dumbbells in the simulation. The term ‘RNG’ refers to random number generator.

Table 3.1: Table of computation environments and run combinations. The code has a small memory footprint on the GPU but fully utilizes the CPU core. Therefore, to saturate the hardware, multiple runs were executed simultaneously.

CPU GPU Number of Simultane-

ous Simulations

Intel Xeon X5660 @2.8GHz

Nvidia Tesla M2090 1

Intel Xeon E5-2620 v3 @2.4GHz

Nvidia Tesla K80 4 (2 per GPU core)

Intel Xeon E5-2680 v4 @2.4Ghz

Nvidia Tesla P100 3

In document The Warped One: Nationalist Adaptations of the Cuchulain Myth (Page 38-41)