ARM System
3.2 Multiprocessor Unit Subsystem
The low-cost SoC FPGAs of both Xilinx and Intel FPGA feature Cortex-A9 MPCore processors.
Several features of this processor are important for achieving good performance, including the caches, branch predictor, memory management unit, SIMD engine and floating-point unit.
These features are discussed below.
3.2.1 L1 Caches
Each processor has its own L1 instruction cache and L1 data cache. Both caches are 4-way set-associative and are 32 KB in size. L1 cache lines are 32 bytes long, and are filled critical-word-first, meaning that the cache line may become available out of order, such that the requested
word is available as soon as possible. The L1 instruction cache does not have any prerequisites before it is enabled, while the L1 data cache cannot be used until the memory management unit (MMU) is configured and enabled. This restriction is necessary because some parts of the address space are not suitable for caching. The MMU is required to distinguish between regular memory devices such as RAM and Flash memory, and memory-mapped peripheral devices and configuration registers. Caching accesses to peripherals and memory-mapped configuration registers would be undesirable. It typically takes 1-2 cycles to access the L1 caches [8]. The L1 cache is physically indexed and physically tagged, meaning that the physical, rather than virtual, address is used for tagging and indexing.
3.2.2 L2 Cache
A single 512 KB L2 cache is shared by both processors in the MPCore cluster. The L2 cache is backed by the SDRAM controller, which, on the DE1-SoC board, is connected to off-chip DDR3 memory. Before use, the L2 cache controller must be properly configured using memory-mapped control registers. The L2 cache supports both instruction and data prefetch, as well as lockdown features that can prevent certain instructions or data from being evicted from the cache. The L2 cache is physically indexed and physically tagged. The typical access time for the L2 cache is 8 cycles; access to offchip DRAM varies from 30-100 cycles [8].
3.2.3 Branch Predictor
Each Cortex-A9 core has a branch prediction unit to assist in the prediction of program control flow. The branch predictor includes a branch target address cache for predicting the target of both conditional and unconditional branches. Also included is a return stack for predicting the target of return instructions. The branch predictor is software-visible, allowing maintenance operations to be performed on it. For example, the branch predictor can be invalidated via software. The branch predictor is enabled using the System Control Register, and does not have any prerequisite requirements.
3.2.4 Memory Management Unit
Each core in the Cortex-A9 MPCore processor has a dedicated memory management unit, or MMU. The MMU translates virtual memory addresses to physical memory addresses. This translation is accomplished by breaking the virtual address space into pages, each of which maps to part of the physical memory. A page table contains a page table entry for each page.
Chapter 3. ARM System 21 In addition to specifying the corresponding physical page, each page table entry specifies the permissions and memory attributes of the physical page. For example, the page table entry may specify whether the page is read-only, whether caching is allowed on the page at each cache level, whether caching is write-back or write-through, whether the memory is executable, whether non-secure access is allowed to the page, and several other attributes. ARM uses two levels of page tables. There are several types of first-level page table entries:
• fault entries mean no valid virtual to physical mapping is available.
• section entries point to a 1MB section of physical memory.
• supersection entries point to a 16MB section of physical memory.
• page table entries point to a second-level page table that contains page table entries that provide finer granularity.
The Cortex-A9 supports two page tables; the active page table is chosen using a configuration register. Page tables are also referred to as translation tables in the ARM documentation.
Since the page table is implemented in memory, the MMU maintains small caches of page table entries to speed up address translation. These caches are called translation lookaside buffers, or TLBs. The Cortex-A9 includes separate instruction and data micro TLBs that provide single-cycle access to translations; each has 32 entries. The micro TLBs are backed by a shared main TLB with 128 entries. If the requested page cannot be found in any of the TLBs, the appropriate page table entry must be read from the memory system.
3.2.5 NEON SIMD Engine with FPU
The base ARM instruction set can be extended in a non-intrusive way using what are called
‘coprocessors’. The ARM architecture supports up to 16 different coprocessors which can be accessed using special instructions which are described in more detail in section 3.4.3. Using the coprocessor interface allows for low latency accesses to a peripheral, especially compared to mapping the peripheral’s registers into the ARM memory space. On the Cortex-A9, each processor includes an ARM NEON single instruction, multiple data (SIMD) engine connected to the ARM processor using the coprocessor interface. The SIMD coprocessor also includes a vector floating point unit that conforms to the IEEE-754 specification, and supports add, subtract, multiply, divide, multiply accumulate, and square root operations. Together, these
features allow the ARM processor to excel at many computationally-intensive applications, especially when compared to soft processors implemented in the FPGA fabric.
3.2.6 Snoop Control Unit
The snoop control unit, or SCU, maintains data coherency between the processors in the MP-Core cluster. In addition, the SCU also manages accesses from the accelerator coherency port, or ACP, which enables cache-coherent memory accesses from the L3 interconnect, and therefore the FPGA fabric. The SCU can issue speculative linefill requests to the L2 cache.
3.2.7 Performance Monitor Unit
Each of the Cortex-A9 cores has a performance monitoring unit (PMU) that provides hardware counters for collecting statistics related to the processor and memory system during operation.
Six counters are available for counting nearly five dozen different events. The PMU can be controlled and configured through software using an ARM coprocessor interface. Chapter 4 describes the use of the PMU and its features in greater depth.