Dataflow programming on FPGAs - Case Studies in Acceleration of Heston s Stochastic Volatility

This approach to accelerate code, takes advantage of the fact that most of the time the Central Processing Unit (CPU) is busy figuring out the scheduling of the instructions and the branch prediction of the program. The purpose of anFPGA is to provide a customisable “field-programmable” chip that can be optimised to perform calculation for a specific problem domain. This is achieved by allowing the logic blocks on the chip to be re-wirable. This way, even after a board has been shipped, it can be re-wired and re-purposed.

This re-wiring is achieved via a Hardware Description Language (HDL). The language then offers the ability to interconnect the logic blocks into different combinations and cater for complex combinatorial functions, and also manage the on-chip memory ¹.

5.1 Applications of FPGAs in computational fi-nance

One company that has made breakthroughs in accelerating financial models on FPGAs is Maxeler Technologies. Their paradigm shift, from the Von Neumann control-flow architecture to the Dataflow architecture, allows for much higher computational specialisation and acceleration. The control-flow architecture can be likened to a mechanics workshop, where one person does all stages of con-structing a product, e.g. a motorcycle. This work doesn’t have to be sequential in any order. While building the motorcycle, the mechanic can also side-step and work on a part of a car before returning to the motorcycle according to a sched-ule. The antipode of this paradigm is a motorcycle production line, where each station on the production is optimised to perform one action. The overall process is as quick as the flow. What the FPGAs provide is a way to create workers

1This is mostly in the form of either flip-flops, or more structured blocks of memory.

–kernels as defined by Maxeler– that are highly specialised and extremely quick at conducting a specialised operation. The kernels are large synchronous dataflow pipelines that implement the mathematics and the control of the problem. They are asynchronously coupled to other kernels and I/O sources and sinks (DRAM, Peripheral Component Interconnect express (PCIe), inter-chip links, etc.) by the manager.

An additional benefit of this technology is its low power footprint when com-pared against a standard CPU. Since the clock cycle on a CPU is much higher than in an FPGA the electrical resistance on the transistors causes the release of energy in the form of heat. This heat accumulates in a server room and needs to be dissipated. In large server clusters, the cooling of the servers can amount to a significant cost. Typically the main power consumption for a cluster would be half and half for the servers themselves, and for the cooling systems of those servers. FPGA chips tend to offer a significant reduction in the electricity costs of maintenance.

5.2 FPGA versus Intel multi-core

Thus far, CPUdevelopments have adhered to Moore’s law². This prediction has been followed up to this point, however limitations of the scale that transistors can achieve coupled with the issue of power consumption increasing the more transistors are fitted in a chip, casts doubt into the relevance of Moore’s law.

However, what might actually happen instead is that the transistors will double their numbers every 18 months, but mainly because the number of cores in each chip would double. What this means is that Operating Systems (OSs) will be able to take advantage on multiple cores within a CPU and via efficient scheduling maximise performance while minimising power consumption.

The benefits of such of the Intel multi-core approach is that the current pro-gramming paradigm can abide and most existing code could be easily –compared to more exotic implementation on GPUs and FPGAs – ported to the many-core architecture much quicker.

On the one hand the FPGA can leverage two advantages over the CPU ap-proach. First it has more silicon dedicated to calculations compared to theCPU.

And second it relies on the DataFlow architecture to do away with the taxing as-pects of instruction scheduling and branch-predictions. This way the calculations pipeline is always full and a result is calculated every clock cycle [see Figure5.2for

2Moore stated in 1965 that the transistor density of semiconductor chips would double roughly every 18 months.

Figure 5.1

The process architecture on a CPU where the ALU is referred as the Function Unit. Data has to be moved into the Funtion Unit form memory and then moved back into memory for storage. (Photo used by permission of Maxeler Technolo-gies).

more details]. On the other hand the CPU, as shown in Figure 5.1, needs to han-dle concurrent threads vying for their turn on the Arithmetic Logic Unit (ALU) in order to progress their calculation status.

5.3 Scope for FPGAs

The FPGA lends itself more aptly to problems of a difference engine nature. For instance it has been successfully used in the Seismic Acquisition Industry to per-form finite difference modelling of the geophysical models, to perper-form reverse time migrations, and to do CRS stackings. Lately implementations in Credit Deriva-tives Pricing have been appearing as well [WSRM12]. Basically any mathematical process that can be decomposed to distinguishable self-sufficient sequential cal-culation can achieve high acceleration on the FPGA architecture.

Figure 5.2

The FPGA architecture as implemented by Maxeler Technologies. The Max-Compiler constructs the DataFlow tree which defines the circuit architecture on the FPGA chip. From then on data from memory gets piped into the different DataFlow cores until it exits the calculation pipe and is committed to memory.

(Photo used by permission of Maxeler Technologies).

5.4 Application to the Heston model

The implementation of Heston’s stochastic volatility model has two aspects to it. First the code that is run on the host, and second the code that defines the circuit architecture of the FPGA and performs the necessary calculations.

Since only repetitive calculations can benefit from the DataFlow architecture there are certain elements that need to run on the host and others on theFPGA card. Maxeler Technologies use the nomenclature of a kernel and a manager.

The kernel comprises of a set of calculations that produce a distinct result, e.g.

a 3-value moving average. The manager’s responsibility is to instantiate and ad-minister the life cycle and functions of each kernel that is assigned to it. For this implementation the manager creates numerous pipes within a given MaxCard³ to

3Latest models of Maxeler’s FPGA cards provide an ever increasing number of resources on-board the chip.

Figure 5.3

This figure illustrates how code interacts between the host CPU and the FPGA kernels.(Photo used by permission of Maxeler Technologies)

handle different operations. The more pipes that can be filled into the available silicon the better the overall performance of theFPGA. The manager is responsi-ble to create and to populate the pipes with kernels to generate random variates from the Gamma distribution, and also kernels that calculate the next values for the variance and the price of the underlying. Once all the prices of the underlying have been generated for every timestep, the results are aggregated back on the host’s CPU.

Chapter 6

In document Case Studies in Acceleration of Heston s Stochastic Volatility Financial Engineering Model: GPU, Cloud and FPGA Implementations (Page 49-54)