ARM System
3.3 Performance Considerations
The HPS on the Cyclone V provides a highly configurable compute platform. The Cortex-A9 MPCore processor system alone provides innumerable configuration options. This section attempts to provide justification and intuition for some of the configuration choices made for the LegUp ARM Hybrid system. Sections 3.3.1 and 3.3.2 investigate the configuration of the ARM Cortex-A9 processor, while Section 3.3.3 explores the memory interface between the accelerator and processor memory.
3.3.1 Configuration of the ARM Cortex-A9 Processor
The performance of the Cortex-A9 processor relies heavily on certain configuration options. In particular, configuration of the caches, branch predictor, and MMU have a significant effect on processor performance. Since this thesis is focused on accelerating the performance of applica-tions executed on the processor by using hardware accelerators in an FPGA, it is important to first explore the performance of the processor itself. In this section we will measure the effect on execution performance of the key features. We will experiment with six different con-figurations, starting with those that are simplest to control. For example, the L1 instruction
Chapter 3. ARM System 23 cache and branch prediction can both be enabled by flipping a single bit in the System Control Register, while enabling the L1 data cache requires the MMU to first be enabled, which, in turn requires the translation tables to be constructed and several settings to be made in control registers.
Throughout this section we will use a set of fourteen benchmarks composed of the CHStone suite [25], plus the Dhrystone and Mandelbrot benchmarks. These benchmarks were chosen be-cause they form the core of LegUp’s benchmark set and are supported by LegUp’s software flow targeting the soft Tiger MIPS processor. In addition, these benchmarks encompass a variety of application domains and include both compute-bound (e.g. Mandelbrot) and memory-bound (e.g. CHStone/motion) applications. The same compiler flags are used when compiling the benchmarks for the first five configurations, and -O3 optimizations are enabled. For the sixth configuration, an additional flag is passed to the compiler to enable the generation of NEON instructions.
Table 3.1 on page 25 shows the cycle counts of the fourteen benchmarks with the processor in six different configurations:
1. The Baseline configuration leaves most features of the processor disabled. This is essen-tially the state of the processor when the preloader passes off control to the LegUp startup code, except that the L1 instruction cache and branch predictor have been disabled.
2. The + L1 Instruction Cache configuration is the same as configuration 1, with the addition of the L1 cache. Although all benchmarks benefit from the addition of the L1 instruction cache, the compute-bound Mandelbrot application sees a significant 12×
speedup.
3. The + Branch Prediction configuration is the same as configuration 2, with the addition of branch prediction. Branch prediction provides a nearly 40% speedup for Mandelbrot and a 34% speedup for CHStone/mips; the other benchmarks see a much more modest improvement. This is the state of the processor when control is passed from the preloader to the LegUp startup code.
4. The + L2 Cache configuration is the same as configuration 3, with the addition of the L2 cache. Since the L2 cache is a shared data and instruction cache, all benchmarks see a significant improvement, with the exception of Mandelbrot. The Mandelbrot benchmark does not see much benefit from the L2 cache since it has almost no memory operations;
the computation is simple enough that all operands can be kept in registers, and there is only a single store to memory for each calculated pixel.
5. The + L1 Data Cache and MMU configuration is the same as configuration 4, with the addition of the MMU and L1 data cache. These are bundled together since, as mentioned earlier, the L1 data cache cannot be enabled until the MMU is enabled. The MMU is configured with the inner and outer write-back, write allocate cache attribute. An analysis of MMU configuration is presented in the next section. The performance benefit is similar to enabling the L2 cache, since all memory operations are now much more efficient.
6. The + NEON configuration is the same as configuration 5, with the addition of enabling NEON vector processing instructions. NEON instructions are generated by passing the appropriate flags to the compiler. Overall, enabling NEON instruction generation shows a speedup on this benchmarks set; however, some benchmarks do show a slowdown as a result of using NEON instructions. In particular chstone/blowfish and chstone/jpeg show an appreciable slowdown. This slowdown could be due to the compiler producing sub-optimal NEON instruction sequences, or the nature of the NEON unit itself: the NEON unit performs in-order execution and can therefore be more aversely affected by memory latency. The chstone/motion benchmark is the only one which shows a significant speedup with NEON instructions. It is also likely that this benchmark set does not thoroughly exercise the capabilities of the NEON unit.
Chapter3.ARMSystem25
Table 3.1: Effect of MPU features on cycle count.
+ L1 Instruction + Branch + L2 + L1 Data Cache
Benchmark Baseline Cache Prediction Cache and MMU + NEON
chstone/adpcm 1,164,469 948,724 935,621 276,809 59,603 58,869
chstone/aes 404,265 249,641 240,396 81,933 29,703 29,433
chstone/blowfish 9,883,747 7,917,584 7,661,289 2,344,230 497,646 508,008
chstone/dfadd 105,951 36,612 36,043 17,377 8,547 8,679
chstone/dfdiv 249,905 61,967 54,336 23,701 12,842 12,914
chstone/dfmul 37,831 15,650 15,161 7,223 3,743 3,717
chstone/dfsin 9,470,851 1,722,331 1,422,394 598,749 354,455 353,579
chstone/gsm 360,433 197,776 192,133 68,619 17,866 17,188
chstone/jpeg 55,058,607 30,061,048 27,177,185 7,754,998 1,800,510 1,834,268
chstone/mips 647,963 236,063 176,072 57,315 23,376 23,086
chstone/motion 426,015 181,650 179,545 43,294 9,688 7,113
chstone/sha 13,086,651 6,437,187 5,510,548 1,984,479 429,716 418,550
dhrystone 330,529 160,532 156,845 42,260 9,786 9,968
mandelbrot 423,452,731 34,162,488 24,599,551 23,202,160 22,907,228 22,907,072
Geomean 1,545,331 611,919 553,828 205,730 71,431 69,722
Aggregate Speedup 1.000 2.525 2.790 7.511 21.634 22.164
Table 3.2 shows a comparison of the two processors used in LegUp’s software and hybrid flows. The soft Tiger MIPS processor was synthesized using the Quartus II software, ver-sion 15.0.1, targeting the Cyclone V FPGA (5CSEMA5F31C6). The achieved frequency was 74.55 MHz. By contrast, the ARM processor runs at 800 MHz. This frequency difference alone results in an order-of-magnitude performance difference between the two processors. In addi-tion, the more advanced nature of the ARM processor results in it requiring fewer cycles to execute the same code. Over our set of benchmarks this results in a 3× speedup for the ARM processor compared to the MIPS processor in terms of raw cycles. Together with the frequency advantage, this results in a 35× speedup for the ARM processor.
3.3.2 Configuration of the MMU
On the ARM core there are three memory designations: normal, device, and strongly ordered.
Reads and writes to/from normal memory may be coalesced and rearranged in a nearly arbitrary fashion. Caching and prefetching may also be applied to accesses to normal memory. By contrast, reads and writes to device and strongly ordered memory are always the same size as expressed in the program, and always occur in the same order as expressed in the program.
The primary difference between device and strongly ordered memory is that reads and writes to normal memory can be reordered around reads and writes to device memory. Reads and writes to strongly ordered memory essentially act as an implicit memory barrier.
For normal memory, there are a number of cachability attributes that can be applied to the memory region. These include whether the memory is write-through or write-back, and whether write-allocation is allowed. Write-allocation, also known as ‘fetch on write’, causes the cache line for a missed write to be loaded to the cache. Cacheability attributes can be specified separately for both inner (L1) and outer (L2) cache systems.
Tables 3.3 and 3.4 compare the cacheability attributes that can be used with the MMU.
Table 3.3 shows the number of cycles required to execute each of our 14 benchmarks when the MMU is configured with each of five different cacheability attribute combinations. The strongly ordered memory type is used as the baseline, and geomean cycle count and speedup are shown with respect to this baseline. Enabling caching and using write-back caching are the two factors which most significantly affect performance. Table 3.4 shows the relative speedups obtained for all possible combinations of cacheability for the inner (L1) and outer (L2) caches. Inner and outer write-through with no write-allocate is used as the baseline for the comparison. It is clear that enabling write-back on the inner cache produces the most significant performance
Chapter 3. ARM System 27 Table 3.2: Comparison of Tiger MIPS Soft Processor and Hard ARM Processor.
MIPS ARM
Frequency Time Frequency Time
Benchmark Cycles (MHz) (µs) Cycles (MHz) (µs)
chstone/adpcm 193,607 74.55 2,597 58,869 800 74
chstone/aes 73,777 74.55 990 29,433 800 37
chstone/blowfish 954,563 74.55 12,804 508,008 800 635
chstone/dfadd 16,496 74.55 221 8,679 800 11
chstone/dfdiv 71,507 74.55 959 12,914 800 16
chstone/dfmul 6,796 74.55 91 3,717 800 5
chstone/dfsin 2,993,369 74.55 40,153 353,579 800 442
chstone/gsm 39,108 74.55 525 17,188 800 21
chstone/jpeg 29,802,639 74.55 399,767 1,834,268 800 2,293
chstone/mips 43,384 74.55 582 23,086 800 29
chstone/motion 36,753 74.55 493 7,113 800 9
chstone/sha 1,209,523 74.55 16,224 418,550 800 523
dhrystone 28,855 74.55 387 9,968 800 12
mandelbrot 45,868,987 74.55 615,278 22,907,072 800 28,634
Geomean 227,146 - 3,047 69,722 - 87
Speedup 1.000 - 1.000 3.258 - 35.022
boost, while enabling inner and outer write-allocate and outer write-back all offer more modest improvements. To obtain the best performance from the ARM core, programs are run from normal memory with inner and outer write back and write allocate enabled.
3.ARMSystem28
Table 3.3: Comparison of MMU cacheability attributes: part 1.
Normal Memory; Normal Memory; Normal Memory;
Strongly Inner and Outer Inner and Outer Inner and Outer
Ordered Device Write-Through, Write-Back, No Write-Back,
Benchmark Memory Memory No Write-Allocate Write-Allocate Write-Allocate
chstone/adpcm 2,163,373 1,465,479 355,881 58,899 58,922
chstone/aes 474,147 414,331 94,575 29,583 29,399
chstone/blowfish 15,891,977 8,948,215 2,621,543 515,069 508,064
chstone/dfadd 111,875 103,869 20,124 8,527 8,559
chstone/dfdiv 217,705 216,937 28,505 13,012 12,892
chstone/dfmul 40,665 37,943 8,794 3,719 3,705
chstone/dfsin 7,752,407 7,743,651 738,315 354,288 354,123
chstone/gsm 416,551 318,199 82,003 17,167 17,314
chstone/jpeg 56,588,081 44,633,431 9,122,910 1,825,810 1,825,063
chstone/mips 514,063 505,895 71,130 23,043 22,814
chstone/motion 50,167 29,565 17,117 8,109 7,032
chstone/sha 16,150,701 10,888,183 2,361,962 420,734 418,830
dhrystone 285,275 228,107 48,216 9,826 9,865
mandelbrot 418,287,671 411,864,129 23,202,244 22,907,078 22,907,088
Geomean 1,428,010 1,158,206 223,803 70,352 69,481
Speedup 1.000 1.233 6.381 20.298 20.552
Chapter 3. ARM System 29 3.3.3 Configuration of the FPGA to HPS Interfaces
Section 3.1.3 introduced several ways in which HPS-side memory can be accessed from the FPGA fabric. For hybrid processor-accelerator systems it is important that the processor and accelerator share a coherent view of memory. There are a number of ways this can be achieved:
• DMA can be used to transfer memory between the processor system and FPGA fabric before and after accelerator invocations. All FPGA-side memory accesses could then read directly from on-chip memory on the FPGA.
• Processor caches can be flushed to SDRAM before an accelerator is invoked. This would allow all FPGA memory accesses to be routed directly to the SDRAM controller, or to the SDRAM controller via the L3 interconnect.
• Memory accesses from the FPGA fabric can be routed to memory through the ACP.
Coherency would be inherently maintained by the ACP.
The second and third options could potentially benefit from an FPGA-side cache.
LegUp accelerators generally read a single word from memory at a time, and stall until it is received. This means that round-trip latency to memory is very important for LegUp accelerator performance. However, in [17], the authors demonstrate use of a DMA (direct memory access) engine with LegUp accelerators in order to transfer large memory buffers between the memory system and FPGA accelerators. This serves to increase the effective memory bandwidth of the accelerators. Unfortunately the flow presented in [17] requires manual work to instantiate the DMA cores, so it cannot be used with an automatic accelerator selection flow.
Table 3.5 shows the latencies for reading a single word from the processor memory subsystem from the FPGA. Figure 3.3 shows the read latencies for the three different memory interfaces.
Accessing memory via the L3 interconnect is very slow, and is comparable to the time required to service a coherent memory access that misses in the processor caches. Accessing memory
Table 3.4: Comparison of MMU cacheability attributes: part 2.
Inner Cache Attribute
Write-through Write-back Write-back Outer Cache Attribute No write-allocate No write-allocate Write-allocate
Write-through, no write-allocate 1.000 3.097 3.137
Write-back, no write-allocate 1.170 3.181 3.165
Write-back, write-allocate 1.204 3.205 3.221
directly through the SDRAM controller is slightly slower than an ACP access that results in a cache hit.
Table 3.5: Single-word read latencies for FPGA to HPS interfaces.
HW Frequency (MHz) 50 100 150
ACP Hit Latency (Cycles) 10 14 17
ACP Miss Latency (Cycles) 15-16 23-25 34-35 SDRAM-Direct Latency (Cycles) 10-11 16-17 21-22