Introduction
SW/SWJ-DP
1.5 Prefetch Unit
The purpose of the Prefetch Unit (PFU) is to:
• Fetch instructions in advance and forward PC relative branch instructions. Fetches are speculative in the case of conditional branches
• Detect Thumb-2 instructions and present these as a single instruction word. • Perform vector loads.
The PFU fetches instructions from the memory system that can supply one word each cycle. The PFU buffers up to three word fetches in its FIFO, which means that it can buffer up to three Thumb-2 instructions or six Thumb instructions.
The majority of branches that are generated as the ALU addition of PC plus immediate are generated no later than the decode phase of the branch opcode. In the case of conditionally executed branches, the address is speculatively presented (consuming a fetch slot on the bus), and the forwarded result determines if the branch path flushes the fetch queue or is preserved.
Short subroutine returns are optimized to take advantage of the forwarding behavior in the case of BX LR.
branch is presented at least a cycle earlier than when the opcode reaches execute. Branch forwarding increases the performance of the core, because branches are a significant part of embedded controller applications. Branches affected are PC relative with immediate offset, or use LR as the target register. For conditional branches, by opcode definition or within IT block, that are forwarded, the address must be presented speculatively as the condition evaluation in an internal critical path.
Branch forwarding loses a fetch opportunity if speculated on a conditional opcode, but is mitigated by a three-entry fetch queue and a mix of 16/32-bit opcodes and single cycle ALU. The additional penalty is a cycle of pipeline stalling. The worst case is three 32-bit load/store single opcodes, the instructions word-unaligned, with no data waitstates. The BRCHSTAT interface provides information on forwarded branches to conditional execution, the direction if conditional, and a trailing registered evaluation of success of the preceding conditional opcode. For more information on BRCHSTAT see Branch status interface on page 16-6.
The performance of the core with ICODE registered with prefetch is effectively the same as the core without the branch forwarding interface, around 10% slower. Branch forwarding can be thought of as the internal address generation logic pre-registration to the address interface, increasing flexibility to the memory controller if you have the timing budget to make use of the information a cycle sooner. For example lower MHz power sensitive targets, in 0.13u down to 65nm. Otherwise, you have the flexibility of having access to this early address in your memory controller for lookups before registration to the system.
Branch speculation is more costly against a wait-stated memory because of
mispredictions. To avoid this overhead, a rule in the controller that conditional branches are not speculated but instead registered gives subroutine calls and returns the benefits of branch forwarding without the mispredictions penalty. A refinement is to only predict backward conditional branches to accelerate loops. Alternatively, with ARM compilers favouring loops with unconditional branch backwards at the bottom and then
conditional branch forward tests on the loop limit, the core fetch queue being ahead at the start of the loop yields good behavior.
The BRCHSTAT also includes other information about the next opcode to reach execute. Unlike the forwarded branches where BRCHSTAT is incident with the transaction, BRCHSTAT with respect to execute opcodes is a hint unrelated to any transaction and can be asserted for multiple cycles. The controller can use this information to suppress additional prefetching because it knows a branch is taken shortly. This helps to avoid any trailing waitstates of the controller prefetch from impacting the branch target when it is generated in execute.
The following scenarios show how you can use branch forwarding and the BRCHSTAT control to get the best performance from your memory system. The scenarios focus on the ideal Harvard setup, where instructions execute from ICODE, literals execute from DCODE (unified to ICODE), and stack/heap/application data executes from SYSTEM. • Zero waitstate
• Zero waitstate, registered fetch interface (ICODE)
• One wait state flash
• One wait state flash, registered fetch interface (ICODE)
• Two wait states flash on page 1-18. 1.6.1 Zero waitstate
Branch prediction provides approximately 10% gain over not having the feature, and except for extreme cases, the processor has all the benefits of 100% branch prediction but with no penalty from branch speculation.
1.6.2 Zero waitstate, registered fetch interface (ICODE)
Branch forwarding results in more aggressive timing on the ICODE interface. If this bus is a critical path in the system, the ICODE interface might be registered. To avoid an approximate 25% penalty of adding a wait state, you can add a circuit that acts as a single-entry prefetcher to avoid the push-pull problems of the registered interface.
1.6.3 One wait state flash
Adding wait states to the flash impacts performance of any core. You can use a cache to lessen this penalty, but this has a dramatic effect on determinism and silicon area. A line prefetcher with two line entries can provide comparable performance to a cache using many less gates. 128-bits is a common prefetch width for ARM7 targets because of the 32-bit instruction set. The processor has the benefit of Thumb-2, a mixed 16/32-bit instruction set. This means that a 64-bit prefetch width provides comparable benefits to a 128-bit interface.
1.6.4 One wait state flash, registered fetch interface (ICODE)
If the ICODE interface must be registered, you can reduce the cost of mispredictions to just the slave side of the prefetch controller. The core still loses the opportunity of the fetch queue request on the ICODE interface, as in the zero wait state case. However, the trailing registered BRCHSTAT[3] status of the conditional execution can mask the external mispredict on the output of the controller's registered system interface, appearing as an idle cycle.
to which the compiler tools reduce the overhead of branches, conditioning loops towards the strengths of the hardware, the less the effects of the mismatch between core and memory system speeds. A 128-bit interface is better at this point.