Chapter 3 ABB – Interfacing networks to hosts
3.7 DMA engine variations
We have tried four versions of development boards for the Active Buffer project. Figures 3-32 to 3-35 list the relevant boards. On all the boards, we have implemented the DMA engine running on the transaction layer of PCI Express. The PCI Express core from Xilinx takes care of transactions on the physical layer and the data link layer. In Virtex4 FPGA, the PCI Express core is a “soft” one. From Virtex5 on, the PCI Express core is hardware primitive. But for the transaction layer user, the implementation is no difference.
Figure 3-32 ABB1 (ZITI) 0 1 2 0 1 2 0 1 1 0 1 1 1 0 0
Figure 3-33 MPRACE2 (ZITI)
Figure 3-34 ABB2 (AVNET)
Figure 3-35 ML605 (Xilinx)
Xilinx provides general DMA engine reference designs for their FPGAs. These designs do not fit well in our applications, so we develop DMA engine of our own.
3.7.1 ABB1 – Virtex4 FX20/FX60
The first one has a Xilinx Virtex4 FPGA as the center device, with soft PCI Express core, 4 lanes. We used to use this board with Virtex4 FX20 FPGA, but due to tight resource limitation, we upgraded it to FX60, which is pin-out compatible to FX20. However, the resource limitation is still not yet totally away. That is why the PCI Express core in this version is 32-bit wide in data bus, not 64-bit. On this
Chapter 3 ABB – Interfacing networks to hosts board, 32 MB DDR SDRAM is available, and possibly more memory chips can be plugged into the mezzanine slots. With this board, we have experience of DPR (Dynamic Partial Reconfiguration) practice. [70]
The resource consumption can be found in Appendix A.1.
3.7.2 MPRACE2
MPRACE2 is another Virtex4 board, it is equipped with two Virtex4 FPGAs, one is FX20 for PCI Express interface, focused on data transfer; the other is FX60, larger and focused on scientific computing acceleration, e.g. SPH (Smoothed Particle Hydrodynamics). The FX20 is labelled BRIDGE and the FX60 is labelled MAIN. DDR2 modules are attached to the MAIN as the work storage. Between these two FPGAs, 10Gbps Aurora links are implemented, because the data rate is high. As illustrated in figure 3-36. [71] [72] [73]
Ideally the communication between should be like with one FPGA. However, the PCI Express module should be always alive and the reconfiguration of the MAIN is through the BRIDGE. In the design phase, DPR was not so powerful.
Figure 3-36 MPRACE2 structure
The DMA engine is ported to the smaller FPGA on the MPRACE2 board, namely BRIDGE FPGA. BRIDGE FPGA manages all the PCI Express interface transactions, including PIO and DMA. The host machine has access to the MAIN via the BRIDGE, over the Aurora links (5Gbps × 2). We have 4 BARs for this system,
BAR[0] is the system registers in BRIDGE, 64KB; BAR[1] is BRAM space in BRIDGE, 1MB.
BAR[2] and BAR[3] is resident on the larger FPGA, namely MAIN. BAR[2] is register space in MAIN, 64KB;
BAR[3] is memory space in MAIN, 1MB.
For real operation implementation, BAR[1] is set to dummy to save resource in the BRIDGE. The MAIN
BRIDGE Aurora links
PCIe 4x
DMA write transfer data from the host to the MAIN over the BRIDGE, and DMA read is the opposite direction.
DMA write moves data to the Aurora link buffer, then into the MAIN RAM space. After SPH computation, the host software gets the status by polling, and then issues a GET-DATA command to the MAIN. The MAIN puts the requested data into the Aurora link, and the data arrive in a link buffer in the BRIDGE. At the same time, the host initiates a DMA read, which will wait until the BRIGE link buffer has had the amount of data for the first packet, and then move them into the host memory, and waits for the next packet of data to be ready.
MPRACE2 uses separate FPGAs to have the flexible reconfigurability. However, the high-speed links between the two FPGAs caused some problem in debugging and development. The link does not always work in the proper way when flow control happens. The external high-speed link has more difficulty into stable operation than the internal one. As a conclusion, the external high-speed link takes more effort in debugging. So, for a preparation of the later discussion, the DPR technology is going to be a better solution for quite similar purposes.
The resource consumption can be found in Appendix A.2.
3.7.3 ABB2 – AVNET Virtex5 PCIE development board
The third development board is from AVNET Inc., namely AES-XLX-V5LXT-PCIE110-G, with a Virtex5 LX110T FPGA, 8-lane hardware PCI Express core, 64-bit transaction layer interface, larger DDR2 SDRAM memory. [74]
We use it in 4-lane configuration. DMA read performance is about 543 MB/s and DMA write performance is about 790 MB/s. DPR is also practised on this board. It is widely used in recent CBM DAQ beam tests. A generalized HDL design of this project has been committed to OpenCores.org.
[75]
The resource consumption can be found in Appendix A.4.
3.7.4 Viroquant application – ML605
We also transport our DMA engine to a Virtex 6 FPGA, XC6VLX240T on ML605 board, sketched in figure 3-37. On that board, the memory interface is upgraded to DDR3 SDRAM modules. Other than in the MPRACE2 project, the data for DMA read is not commanded by the host PIO command. Instead, the data are requested by the read DMA engine itself. [76]
Because the DDR3 memory controller has larger overhead to process a read request, the MAX_PAYLOAD_SIZE parameter is too small for it. So for this project, we adapt the DMA read module to aggregate the multiple smaller read requests into one for every DMA descriptor.
Accordingly, the DMA read DONE status is redesigned to give a correct END signal because the data fetching mechanism in this project is different with the others.
Chapter 3 ABB – Interfacing networks to hosts