Architectural Physical R - Cortex -A Series. Programmer s Guide. Version: 2.0. Copyright 2011 A

R1 LR_USR CPSR _P0 P1 P2 P3 Flag 0 Flag 1

4.5 Branch prediction

As we have seen, branch prediction logic is an important factor in achieving high throughput in Cortex-A series processors. With no branch prediction, we would have to wait until a

conditional branch executes before we could determine where to fetch the next instruction from. The first time that a conditional jump instruction is fetched, there is little information on which to base a prediction about the address of the next instruction. Older ARM processors used static branch prediction. This is the simplest branch prediction method as it needs no prior information about the branch. We speculate that backward branches will be taken, and forward branches will not. A backward branch has a target address that is lower than its own address. This can easily be recognized in hardware as the branch offset is encoded as a two’s complement number. We can therefore look at a single opcode bit to determine the branch direction. This technique can give reasonable prediction accuracy owing to the prevalence in code of loops, which almost always contain backward-pointing branches and are taken more often than not taken. Due to the pipeline length of Cortex-A series processors, we get better performance by using more complex branch prediction schemes, which give better prediction accuracy. This comes with a small price, as additional logic is required.

Dynamic prediction hardware can further reduce the average branch penalty by making use of history information about whether conditional branches were taken or not taken on previous execution. A Branch Target Address Cache (BTAC), also called Branch Target Buffer (BTB) in the Cortex-A8 processor, is a cache which holds information about previous branch instruction execution. It enables the hardware to speculate on whether a conditional branch will or will not be taken.

The processor must still evaluate the condition code attached to a branch instruction. If the branch prediction hardware predicts correctly, the pipeline does not need to be stalled. If the branch prediction hardware speculation was wrong, the processor will flush the pipeline and refill it.

4.5.1 Return stack

Readers who are not at all familiar with ARM assembly language may want to omit this section until they have read Chapter 5 and Chapter 6.

The description in Branch prediction looked at strategies the processor can use to predict whether branches are taken or not. For most branch instructions, the target address is fixed (and encoded in the instruction). However, there is a class of branches where the branch target destination cannot be determined by looking at the instruction. For example, if we perform a data processing operation which modifies the PC (for example, MOV, ADD or SUB) we must wait for

the ALU to evaluate the result before we can know the branch target. Similarly if we load the PC from memory, using an LDR, LDM or POP instruction, we cannot know the target address until

the load completes.

Such branches (often called indirect branches) cannot, in general, be predicted in hardware. There is, however, one common case that can usefully be optimized, using a last-in-first-out stack in the pre-fetch hardware (the return stack). Whenever a function call (BL or BLX)

instruction is executed, we enter the address of the following instruction into this stack. Whenever we encounter an instruction which can be recognized as being a function return instructions (BXLR, or a stack pop which contains the PC in its register list), we can speculatively

pop an entry from the stack and start fetching instructions from that address. When the return instruction actually executes, the hardware compares the address generated by the instruction with that predicted by the stack. If there is a mismatch, the pipeline is flushed and we restart from the correct location.

ARM Registers, Modes and Instruction Sets

The return stack is of a fixed size (eight entries in the Cortex-A8 or Cortex-A9 processors, for example). If a particular code sequence contains a large number of nested function calls, the return stack can predict only the first eight function returns. The effect of this is likely to be very small, as most functions do not invoke eight levels of nested functions.

4.5.2 Programmer’s view

For the majority of application level programmers, branch prediction is a part of the hardware implementation which can safely be ignored. However, knowledge of the processor behavior with branches can be useful when writing highly optimized code. The hardware performance monitor counters can generate information about the numbers of branches correctly or incorrectly predicted. This hardware is described further in Chapter 17.

Branch prediction logic is disabled at reset. Part of the boot code sequence will typically be to set the Z bit in the CP15:SCTLR, System Control Register, which enables branch prediction. There is one other situation where the programmer might need to take care. When moving or modifying code at an address from which code has already been executed in the system, it might be necessary (and is always prudent) to remove stale entries from the branch history logic by using the CP15 instruction which invalidates all entries.