4.6 DEEP Booster Architecture
4.6.3 Prototype Performance Evaluation
In this section, the test environment and performance evaluation of internode MIC-to- MIC communication using micro-benchmarks of the Extoll software stack, the OSU Micro-Benchmarks, and the LAMMPS molecular dynamics simulator, are described. 4.6.3.1 Hardware Environment
The BIC prototype node is a standard server machine with two Intel Xeon E5-2630 processors running at 2.30 GHz and 128 GB of memory. An Extoll NIC of the Galibier [107] generation is used, which utilizes a Xilinx Virtex6 FPGA design with a 128 bit-wide data path, running at 156.25 MHz, and one x4 16 Gbits/s Extoll link. The Galibier card has a memory mapped region of 16 GB, which can be used to manage two Intel Xeon Phi coprocessors (MICs) with 8GB GDDR RAM each. The network link of the Galibier card is used to connect to the seventh link of the BNC to get access to the 3D torus. The test environment contains one BNC with two Altera StratixV FPGAs. The FPGAs implement a Galibier-compatible 128 bit-wide data path running at 100 MHz and seven x4 16 Gbits/s links. Each StratixV Extoll NIC is connected to one MIC with a x8 PCIe Gen2 PCIe host interface. The StratixV FPGAs are connected to each other via an Extoll x4 16Gbits/s link.
4.6.3.2 Software Environment
The BIC runs CentOS 6.3 with kernel version 2.6.32-279.19.1.el6.x86_64 as the operat- ing system, and has Intel MPSS 2.1.6720-16 and Intel Composer_xe_2013_sp1.2.144 installed. The micro-benchmarks of the Extoll software stack are used to evaluate the latency and bandwidth of MIC-to-MIC communication over Extoll. In addition, the MPI performance is evaluated between two MICs connected over Extoll (using OpenMPI 1.6.1) and directly connected to the BIC utilizing SCIF and OFED/SCIF (using Intel MPI Library 4.1.3.049 and OFED-1.5.4.1). OpenFabrics Enterprise Distribution(OFED) [108] provides an open-source software solution for RDMA and kernel bypass applications. The Symmetric Communications InterFace (SCIF) [87] is used for internode communication within a single system. Four different system configurations are used:
Booster This setup is the DEEP prototype booster system connecting two BNs with one BNC.
TCP/SCIF In this setup, two MICs are directly connected to the BIC over PCIe. The BIC acts as the host and runs the Intel MPSS without any OFED support. All communication is tunneled over SCIF.
OFED/SCIF The setup is an optimized version of the mic0-mic1 TCP/SCIF setup, where the Intel MPSS is run on top of the OFED stack. The communication is virtualized over the OFED/SCIF software stack, which implements RDMA by virtualizing direct access to a hardware InfiniBand Host Channel Adapter (HCA) between two MICs.
rEXTOLL Two hosts, equivalent to the BIC, are connected over Extoll. Each host has one MIC attached over PCIe.
4.6.3.3 Micro-benchmark Evaluation
For the micro-benchmark experiments, two prototype BNs are used, denoted as mic0 and mic1. The micro-benchmarks are launched on mic0. The communication is set up over the low-level user-space library of the Extoll NIC. Figure 4.18a displays the results of the latency benchmark for the FPGA-based booster implementation. For messages smaller than 64B between mic0 and mic1, the latency performance of VELO outperforms the RMA unit by about 50%. Figure 4.18b presents the bandwidth results. For small messages, VELO provides a better bandwidth than RMA. The peak bandwidth provided by RMA is about 1.2 GB/s. These performance results
4 Network-Attached Accelerators
(a) Latency. (b) Bandwidth.
Figure 4.18: Micro-benchmarks performance of internode MIC-to-MIC commu-
nication using the Extoll interconnect.
have a direct impact on the direct accelerator-to-accelerator communication results. All communication traffic between the BNs is tunneled over Extoll.
4.6.3.4 MPI Performance Evaluation
The point-to-point MPI benchmarks of the OSU Micro-Benchmarks 4.3 (OMB) [74] are used for the evaluation. Each benchmark is run 100 times. The results in the graphs are calculated as the arithmetical average of the runs. All benchmark results are verified by the Intel MPI Benchmark 3.2.3 (IMB) [109].
Latency Figure 4.19a displays the half round-trip latency results for small messages (<2 KB). The results for mic0-mic1 TCP/SCIF are not displayed, because the half round-trip latency is too large (>300 usec). Even though the prototype only uses an FPGA implementation of the NIC, the half round-trip latency using the booster architecture is improved compared to the latency measured when using the OFED/SCIF software stack. Furthermore, figure 4.19b shows that the half round-trip latency of large messages is also competitive compared to OFED/SCIF, although the bandwidth of the underlying hardware (PCIe Gen2 x16) is much higher (>6 GB/s).
With the ASIC implementation of Extoll, the latency will be even smaller. The
mic0-mic1 TCP/SCIF bandwidth is only displayed as a reference since it has a
very poor performance. This is probably because of the need to perform a standard kernel-level TCP/IP communication on the MIC.
Bandwidth Figure 4.20 displays the performance results of the bandwidth and bidirectional bandwidth tests. The FPGA implementation is competitive with the OFED/SCIF solution, the peak bandwidth corresponds with the measured low-level 86
(a) Small messages. (b) Large messages.
Figure 4.19: Half round-trip latency performance of internode MIC-to-MIC
communication using MPI.
(a) Bandwidth. (b) Bidirectional bandwidth.
Figure 4.20: Bandwidth and bidirectional bandwidth performance of internode MIC-to-MIC communication using MPI.
performance of the RMA unit. It is noteworthy that the peak MPI bandwidth using OFED/SCIF over PCIe is unable to utilize the bandwidth of the underlying PCI Express fabric.
4.6.3.5 Application Level Evaluation
In addition to the micro-benchmark and MPI performance evaluation, the communi- cation architecture for network-attached accelerators is evaluated using a life science application. The MPI version of the LAMMPS (Large-scale Atomic/Molecular Massively Parallel Simulator) molecular dynamics simulator is used. LAMMPS is a classical molecular dynamics code [110]. It is written in C++ and MPI. The benchmark considered for the evaluation performs a bead-spring polymer melt of 100-mer chains, with finite extensible nonlinear elastic (FENE) bonds, Lennard-Jones interactions with a 2(1/6)σ cutoff (5 neighbors per atom), and micro-canonical (NVE)
4 Network-Attached Accelerators
Table 4.1: Description of LAMMPS timings output.
Name Description
Loop Total time spent in benchmark. Comm Time spent in communications.
Bond Time spent computing forces due to covalent bonds. Pair Time spent computing pairwise interactions.
Neigh Time spent computing new neighbor lists.
Outpt Time to output restart, atom position, velocity and force files. Other Difference between loop time and all other times listed.
Figure 4.21 shows the impact of the communication architecture on the communi- cation time for the LAMMPS Bead-spring polymer melt benchmark. The benchmark is run for 8, 16, 32, and 64 threads whereby the threads are equally distributed to the two MICs. It can be observed that the communication time for 32 threads/MIC is improved by 32%, while smaller runs with up to 4threads/MIC provide an improve- ment of the communication time up to 47%. Furthermore, running the LAMMPS simulator with a Lennard-Jones (LJ) benchmark (atomic fluid, LJ potential with 2.5σ cutoff (55 neighbors per atom), NVE integration) and the embedded atom model (EAM) metallic solid benchmark (metallic solid, copper EAM potential with 4.95 Angstrom cutoff (45 neighbors per atom), NVE integration) results in a similar improvement of communication time.
Figures 4.22a–4.22c display the overall application time for the bead-spring polymer melt, Lennard-Jones, and copper metallic solid benchmarks run with 32 threads/MIC. Table 4.1 provides a summary of timings used by figure 4.22. It can be seen that most of the application time is spent in communication. Therefore, the optimization of internode MIC-to-MIC communication time plays a crucial part in optimizing the overall application execution time.
4.6.3.6 Comments
Compared to the FPGA, the ASIC version of Extoll offers vastly improved network performance with its 128 bit-wide data path running at 750 MHz. As a result, the network link will provide an approximate bandwidth of 100 Gbits/s. As mentioned before, MVAPICH2-MIC is a proxy-based implementation of the MVAPICH2 MPI library. It reports a unidirectional bandwidth of up to 5.2 GB/s for internode MIC- MIC communication with InfiniBand HCAs. The usage of Tourmalet is expected to provide the sevenfold peak bandwidth.
(a) 2 MICs: 4 Threads/MIC. (b) 2 MICs: 8 Threads/MIC.
(c) 2 MICs: 16 Threads/MIC. (d) 2 MICs: 32 Threads/MIC.
Figure 4.21: Communication time for the bead-spring polymer melt benchmark.
(a) Bead-spring polymer melt. (b) Lennard-Jones.
(c) Copper metallic solid.
4 Network-Attached Accelerators
Figure 4.23: Production-ready GreenICE cube.