3.3 Accelerating PT
3.3.1 Baseline accelerators
FPGA
Here, a baseline hardware architecture for PT in double floating point precision is proposed. Figure 3.1 shows the block diagram of the architecture. There are four computational blocks (Sample Proposal, Probability Evaluation, Accept/Reject and Exchange) and three memories (Sample, Probability and Data). There is also one Gaussian and two uniform random number generators and four FIFO registers. Most control signals are omitted for clarity.
The architecture is based on extensively pipelining all computational blocks. Pipelining is possible because PT chains perform the same computations on different data during Global updates and ex- changes. The system can be thought of as a long pipeline which works iteratively, performing the same steps for each Global update/exchange.
One MCMC iteration (outer loop in line 2 of Algorithm 6) includes the following (for iteration num- ber i): The current samples of all parallel chains (θ(i−1)j for all j∈ {1,...,M}) are read from Sample memory and forwarded to the Sample Proposal block. The Sample Proposal block proposes candidate samplesθ∗ by adding Gaussian random numbers to the current samples. If the sample’s dimension
is larger than one, the block proposes values for all dimensions in parallel (aided by parallel Gaus- sian RNGs). The candidates are then moved to the Probability Evaluation block (shown in Figure 3.2), which computes the candidate probabilities pj(θ∗). This is done by calculating n probability
sub-densities (n is the number of data), finding their logarithms and summing them to get the total log-density. For big n, the Probability Evaluation block becomes the bottleneck of the system. To reduce this bottleneck, multiple parallel pipelines can be instantiated inside this block. The number of pipelines is limited only by the FPGA chip size. Moreover, since each sub-density requires that a datum be read from the Data memory, the Data memory is designed to provide data to all pipelines in parallel at the same cycle (each memory address stores multiple data). The sizes of all system memo- ries are shown in Table 3.2. An adder tree performs the reduction. The Accept/Reject block receives
Sample Proposal Gaussian RNG Probability Evaluation Accept/ Reject Uniform RNG Exchange Probability memory
Sample memory Data memory Uniform RNG theta theta p( ) p(theta ) p(theta ) p(theta ) accept/reject signal FIFOs FIFOs Multiplexer Multiplexer D theta p(theta ) Interface to host PC
FPGA
(i) * (i) * (i) (i+1) (i+1) theta * theta * theta (i) Figure 3.1: Baseline FPGA architecture1010
10 10 9 9 9 9 8
Probability Evaluation (bottleneck)
1010 10 10 9 9 9 9 8 1010 10 10 9 9 9 9 8 accept/ reject for chain 6 accept/ reject for chain 5 1010 10 10 9 9 9 9 8 + + + 11 proposed sample for chain 11 Sample proposal D(13:16) address 4 data 12 5 Accept/Reject 6 datum datum datum datum proposed sample proposed sample proposed sample proposed sample proposed sample Data memory Adder tree theta* D(9:12) D(5:8) D(1:4) proposed sample for chain 12 theta*
Figure 3.2: Chain streaming through the Sample Proposal, Probability Evaluation and Update pipelines. Occupied stages are grey, unoccupied white. Numbers represent the chains that occupy each stage. There are four pipelines in the Probability Evaluation block (P= 4). The data size is
n= 16. A probability value is generated every four cycles (n
P = 4). The Sample proposal and
Update pipelines are under-utilized.
Table 3.2: Baseline architecture memories. P is the number of sub-density pipelines, M, n and s are defined in Table 3.1
Memory Description Depth
(entries)
Width (bits)
Sample memory Stores current samples of all PT chains M 64s
Probability memory Stores probabilities of current samples of all chains
M 64
Data memory Stores the data set used for inference Pn 64P
the candidate probabilities pj(θ∗) and reads the previous probabilities of each chain (pj(θ(i−1)j )) from
Probability memory. It also computes the temperature Tempj= (M+1− jM )
2
. All these values (along with a uniform random number) are used to find the Metropolis ratio and accept or reject each candi- date sample.
The above steps comprise the update operation. The updated samples also pass through the Exchange block before they are written back to Sample memory. Unlike CPU and GPU implementations (which are presented in the following sections), the Global update (update of all chains) does not need to finish before starting the Global exchange. As soon as a chain is updated, it is forwarded to the Exchange block while the next chains are processed by the Update block.
Each exchange is performed between a pair of chains. Therefore, the block has to wait for two chains to be updated and then attempt the exchange. Because exchanges are performed between neighbouring chains only (see lines 11-12 in Algorithm 6), pairs of potentially exchanged samples conveniently reach the block successively (since chains are updated in order). The temperatures and the updated sample probabilities are used to accept or reject the exchange. FIFOs are used to store these values
when they first become available (earlier in the pipeline), removing the need to write them to the memories after updates and read them back for exchanges (which is necessary in CPU and GPU implementations [15, 16]).
Finally, the new samples and probabilities are written back to memories. When all chains have tra- versed the pipeline, the Global update and exchange are complete and the next MCMC iteration starts. At every iteration, the current samples and probabilities of the first chain are read from the Sample memory and sent to a BRAM-based buffer, which is able to transmit the data to the host PC in real time (using double buffering). Details on the implementation of the buffer are given in Section 3.5.
Performance model
This section gives exact formulas for the latency and throughput of the various blocks in the PT architecture. The latency of the Sample Proposal block (for generating a candidate sample for one chain) is:
Latsp= Latadd (3.1)
where Lataddis the latency of a double floating point adder (needed to add a Gaussian random number
to the previous sample of the chain). All dimensions of the proposed sample are generated in parallel.
The latency of the Probability Evaluation block (to compute the probability density of one chain) is:
Latpe= Csubdensity+
n
P
(3.2)
where Csubdensityis the latency of a single sub-density evaluation pipeline (depends on the target distri-
bution) and the termPn is the latency for passing all the n data through the P parallel pipelines. Each pipeline can receive one data input per cycle (has a throughput of 1 data per cycle). It is assumed that
n> P, which is the case for all non-trivial data sets.
The latency of the Accept/Reject block (for accepting or rejecting a candidate for one chain) is given by the following equations:
Latar= Latmult+ Latsub+ Latcomp (3.3)
where Latmultis the latency of a double floating point multiplier needed to multiply the proposed and
previous log-densities with the temperature of the chain. Two multipliers are needed and they work in parallel. The results are the numerator and denominator of the acceptance ratio in line 6 of Algorithm
6 (using log-values). Latsubis the latency of a double floating point subtracter (needed to subtract the
numerator and denominator in order to compute the acceptance ratio) and Latcomp is the latency of a
double floating point comparator (needed to compare the ratio with the logarithm of a uniform random number).
The latency of the Exchange block (for accepting or rejecting a candidate for one chain) is given by the following equations:
Latex= Latmult+ Latadd+ Latsub+ Latcomp (3.4)
Four parallel multipliers are needed to find the four terms in the exchange ratio of line 13 in Algorithm 6 (using log-values). Two parallel adders compute the numerator and denominator of the exchange ratio and a subtracter is needed to compute the value of the ratio. The comparator compares the value with the logarithm of a uniform random number.
The above four blocks constitute the system pipeline. By making use of chain pipelining, as described in the previous paragraphs, it is possible to feed one new chain to the pipeline everyPn clock cycles (since this is the number of cycles for which the Probability Evaluation block is busy processing each chain). The total latency of the baseline architecture’s pipeline (the number of cycles necessary to process one Global Update and one Global exchange, i.e. one PT iteration in the loop of line 2 in Algorithm 6) is the following:
Latiter pt= Latsp+Csubdensity+ M ·
n
P + Latup+ Latex (3.5)
where the term M·Pn is the number of cycles needed to pass all chains through the system pipeline. The latencies of the update and exchange blocks are counted only once because they overlap with the Probability Evaluations (which take significantly more cycles for realistic scenarios where n is large). Figure 3.2 demonstrates this point more clearly; it shows the utilization of the Sample Proposal, Accept/Reject and Probability Evaluation pipelines by PT chains when n= 16, P = 4 andn
P = 4. A
sample reaches the block every four cycles. At the same time, n= 16 data are sent to the block (one quadruple per cycle). The Data memory is designed to “match” the consumption rate of the block, as discussed previously.
Latpt= N · Latiter pt (3.6)
and the total time of the system is:
Timetotal pt= Latpt
f req+ Timeinput pt (3.7)
where f req is the clock frequency of the PT IP in Hz and Timeinput ptis the time needed to send input
arguments from the host PC to the FPGA. No time is spent for outputting the MCMC samples, since this happens simultaneously with processing (using a double buffering memory architecture).
The throughput of the baseline architecture (MCMC iterations it processes per second, where an MCMC iteration comprises a Global update and a Global exchange), excluding Timeinput pt, is:
T Ppt=
f req Latiter pt
(MCMC iterations / sec) (3.8)
This throughput is equal to the throughput of the Probability Evaluation block. It is clear that the critical factor for the performance of the system isnP, the cycles the block needs in order to process each chain. By fitting more parallel sub-density pipelines in the FPGA fabric, P can be increased, resulting in higher throughput.
Multi-core CPU
An optimized implementation of PT was implemented on a multi-core CPU in order to achieve a fair performance comparison. In order to exploit PT’s parallelism, pragmas and Intel Cilk keywords [140] were embedded in the sequential C++ code. Also, Intel Compiler optimizations [141] (including the -O3 flag and the optimizations related to the CPU architecture) were applied. More specifically: 1) The Global update loop was transformed into a cilk for loop and the granularity of the parallelization was optimized by ordering the compiler to group loop iterations into groups of a certain number of iterations each (using the granularity pragma). Depending on the number and type of CPU cores and the amount of work per iteration, a specific granularity maximizes performance. 2) The reduction op- eration (necessary to sum the sub-densities and evaluate the total probability density) were parallelized using the simd reduction pragma. A parameter is also used here to specify the granularity of paral-
lelization [140]. 3) Sub-density evaluations were vectorized by converting the respective functions to Cilk elemental (vectorized) functions, using the attribute ((vector)) keyword.
GPU
An optimized GPU implementation was also created based on the state-of-the-art CUDA code of Lee et al. [16]. In Lee et al. [16] the main computational work of the implementation is split into two kernels, the global update and the global exchange kernels. There are also kernels for random number generation and initialization. All of the remaining work is done on the CPU. The global update kernel updates all chains once. It exploits chain parallelism, assigning the work of every PT chain to a separate thread. This results in an implementation which does not exploit all available parallelism; it ignores intra-chain parallelism. The exchange kernel performs exchanges between neighboring chains and these are also parallelized.
Here, a PT implementation which uses an enhanced global update kernel is presented. All the remain- ing components of the implementation of Lee et al. [16] remain the same (for more details on these components see [16]). The changes to the global update kernel aim at increasing thread utilization and maximizing performance. They are listed below:
1) Intra-chain computations are parallelized by assigning the calculation of the sub-densities (or groups of them) of each chain to separate threads, in contrast to Lee et al. [16] where all sub-densities of a chain were assigned to the same thread. This makes the comparison to other platforms fair.
2) The global update kernel processes M chains, each of which contains n sub-density evaluations. These Mn tasks can be allocated to CUDA blocks and threads in many combinations. The number of blocks ranges from 1 to M; within each block, 1 to n tasks are allocated to each thread. Combi- nations which allocate the work of a chain into separate blocks are not examined because this would require communication between different blocks during chain updates (which is expensive since the Global GPU memory needs to be used instead of the Shared memory). For each(M, n) setting, the combination of blocks and tasks per thread which maximizes the kernel’s throughput is chosen. For example, when few PT chains are used, there is not enough parallelism in the inter-chain level. It is then beneficial to assign each sub-density (task) of each chain to a separate thread to introduce as much intra-chain parallelism as possible. On the other hand, when the number of chains reaches a few thousands, it is preferable to assign more one sub-density task to each thread, because 1) there is
now enough inter-chain parallelism to saturate the device, 2) inter-chain parallelism does not require a reduction, unlike intra-chain parallelism. The above optimization is described in detail in Section 3.6.4 and leads to increased GPU utilization compared to Lee et al. [16] (also considering that much larger data sets are used compared to [16]).
3) Reduction operations inside the density computation of each chain are unrolled using the technique proposed in [142]. After the independent sub-densities are computed by the threads, they need to be summed to get the likelihood. This requires communication between threads through the shared memory. Although this reduction can be easily done using a reduction tree, this forces half of the threads to be inactive in the first tree stage, 75% of the threads to be inactive in the second stage, etc. In the proposed implementation, the technique proposed in [142], which completely unrolls the reduction calculations and minimizes thread imbalance is applied.
4) The implementation of Lee et al. [16] stores all the data in the GPU’s constant memory (typically limited to a few dozens of KBs). This is possible because the data sizes used are small (100 data point, 4 bytes each). Here, data are stored in global GPU memory, which is a realistic strategy given the data sizes in real applications. During execution, data are moved, in chunks, to the shared memory of all blocks. All the chains of a block can use the data to compute part of the log-density before the next chunk is read, increasing the compute-to-memory ratio of the kernel by a factor equal to the number of chains per block (ranging from 2 to 32).