Implementation - Algorithms and architectures for MCMC acceleration in FPGAs

3.5.1 IP implementation and FPGA system integration

All FPGA samplers were implemented in VHDL. The RIFFA framework (version 1.0) [143] was used for prototyping. RIFFA wraps the PT IP and uses a PCI-express connection to transfer data between the FPGA and the host PC. All the I/O modules on the hardware side and the software drivers on the host side are handled by the framework. A small piece of C code was written for the host side in order to initialize the FPGA, start the run and receive the outputs. Moreover, a double buffering architecture was designed on top of RIFFA in order to be able to send output data (MCMC samples and weights)

to the host PC at the same time they are generated by the IP. The measured FPGA-to-host throughput of the double buffering PCI-express memory architecture is 120 MB/sec. This throughput is enough for all experiments presented in this chapter (i.e. it is enough for I/O not to be the bottleneck of the system). This number is close to the reported RIFFA throughput in [143]. It can be significantly improved by using more PCI-express lanes. All FPGA samplers were synthesized, placed and routed using Xilinx XPS 13.1. The clock frequency was set to f req= 210 MHz for all designs.

The sequence of operations for a complete run of the PT sampler are the following: The C appli- cation in the host allows the user to select the parameters of the PT sampler (e.g. the constants

N, B, Temp1:M, θ1:M(1) andσ1:M2 . The number of chains (M) must be fixed before synthesis. RIFFA

driver functions are used to send the initialization data directly to the FPGA IP and to order the IP to start (no CPU operates on the FPGA). The IP performs the PT run and writes the output MCMC samples (and the weights for WPT) to the double buffer. The double buffers work simultaneously with the IP, sending data back to the host PC, where the RIFFA driver receives them and stores them in a text file.

3.5.2 Platforms and devices

The performance of the proposed PT samplers is evaluated using a number of devices from each platform (Table 3.4 contains details). The devices represent recent and older generation of each platform. The two GPU generations roughly correspond to the two FPGA generations in terms of release dates. For the multi-core CPU, Intel Xeon devices were used with numbers of cores ranging from 4 to 20. Some of the devices consist of a pair of chips placed on separate sockets. All processors were installed on Imperial College’s High Performance Computing cluster. All runs were performed using 16 GBs of RAM and the code was compiled using Intel’s C++ compiler (ICC version 2015.1) and applying the -O3 optimization flag and flags designed to optimize for the targeted CPU architecture. Every effort was made to select the combinations of optimizations (including Cilk optimizations) that maximize performance in each scenario.

For the GPU platform, measurements are presented from one device of the Nvidia GeForce 200 series (Tesla architecture) and five devices of the Nvidia GeForce 400 series (Fermi architecture). Actual runs were performed only for the C2050 model (hosted by an Intel Core 2 Q9550 CPU with 8 GBs of RAM, running Linux). The remaining measurements came from the GPGPU-Sim simulator [144]

(version 3.2.2). This simulator can construct a model of any GPU device by configuring various parameters in a text file. It then predicts the time required to run a CUDA kernel on the device accurately. 97-98% accuracy is reported for Tesla and Fermi architectures, which is enough for the purposes of this chapter. CUDA version 1.3 (for Tesla) and version 2.0 (for Fermi) were used. The Nvidia compiler (NVCC) [145] was used for compiling the GPU kernels and the Intel C++ compiler (ICC version 2015.1) was used for compiling the part of the code that runs on the CPU (the same optimization flags mentioned for the CPU sampler were applied here).

For the FPGA platform, results are presented for one device of the Xilinx Virtex 6 series and six devices of the Xilinx Virtex 7 series. Actual runs were performed only for the Virtex 6 LX240T model (placed on an ML605 board and hosted by an Intel Core i7-2600 CPU with 4 GBs of RAM, running Linux). Performance estimates for the other devices come from combining post-place and route resource utilization, device resources and equation (3.7) (either for baseline, WPT or MPPT). Two sequential reference implementations were used. The first was a sequential implementation in C++, which ran on an Intel Core i7-2600 device with one core activated and with all compiler and Cilk optimizations deactivated. This is an attempt to capture the approach of MCMC practitioners who are not familiar with any form of code optimization or parallelization. The second reference implementation was an identical sequential implementation in C++, which also ran on an Intel Core i7-2600 with one core activated but with all Intel compiler optimizations activated (Cilk optimization were deactivated).

In order to produce power and energy consumption results, the Xilinx Power Estimator [146] was used for the FPGA samplers (assuming full device utilization) and the nominal thermal design power of the CPU and GPU devices was used.

3.5.3 Runs in hardware and software

As mentioned above the three FPGA samplers (baseline, WPT and MPPT) were compiled and run only on the LX240T device. Nevertheless, bitstreams were not generated for all parameter and preci- sion combinations (i.e. number of chains M, number of mantissa bits)1. The parameter and precision combinations for which a bitstream was generated are shown in Table 3.5, separately for each algo- rithm. The baseline sampler was compiled for M= 8 and M = 32. The WPT sampler was compiled for

Table 3.4: Detailed list of platforms and devices

Platform Family/Device Release

date

Fabrication process Multi-core CPU Intel Xeon

E5-2620 (4 cores), 2 x X5650 (2 x 6 cores), 2 x E5-2660 v2 (2 x 10 cores) 2010-13 32nm GPU GeForce 200 GTX285 2009 55nm GeForce 400 GT420, GT440, GTS450, GTX460SE, GTX465, C2050 2010-11 40nm FPGA Virtex 6 2009 40nm LX240T Virtex 7 2011 28nm VX330T, VX415T, VX485T, VX550T, VX690T, VX1140T

M= 8 combined with all custom precision configurations ((4, 11), (6, 11), (8, 11), (10, 11), (14, 11), (20, 11), (24, 11), (40, 11) and (53, 11)). The MPPT sampler was compiled for M = 8 combined with custom precision configuration (24, 11) only. Thus the runtimes and mixing results for the above combinations come from real runs on the LX240T FPGA, while the resource utilization results for these combinations are post place and route results. Runs for the remaining combinations of M and precision were performed in software using a C++ implementation (no bitstream was compiled). In order to “emulate” the custom precision calculations, the MPFR library [147] was used. This library allows all arithmetic operators to be performed in any custom precision inside C++ code. The resource utilization of the above combinations (which ran in software) was calculated using the post place and route results of the compiled designs in combination with post place and route resource utilization of the Probability Evaluation blocks in different precisions, i.e. Probability Evaluation blocks were compiled in all precisions but did not run in hardware for all precisions. FPGA runtimes were esti- mated based on the latency and runtime equations of Section 3.3.1 (where P was defined based on the resource utilization of the generic parts and the custom precision Probability Evaluation blocks).

Table 3.5: This table shows the PT parameter combinations for which actual FPGA bitsreams were generated and FPGA runs were performed, separately for each PT algorithm (baseline, WPT, MPPT). Also, it shows the combinations for which software runs were performed instead of FPGA runs. Soft- ware runs were implemented in C++ code, using the MPFR library for custom precision calculations.

Algorithm Implementation Parameter combinations

Baseline On FPGA (M = 8), (M = 32) Baseline In software (+MPFR) (M = 128), (M = 512), (M = 2048), (M = 8192), (M = 32768) WPT On FPGA (M = 8, m = 4, e = 11), (M = 8, m = 6, e = 11), (M = 8, m = 8, e = 11), (M = 8, m = 10, e = 11), (M = 8, m = 14, e = 11), (M = 8, m = 20, e = 11), (M = 8, m = 24, e = 11), (M = 8, m = 40, e = 11), (M = 8, m = 53, e = 11)

WPT In software (+MPFR) All other combinations

MPPT On FPGA (M = 8, m = 24, e = 11)

MPPT In software (+MPFR) All other combinations

In document Algorithms and architectures for MCMC acceleration in FPGAs (Page 114-118)