The changes from ARMv3 to ARMv4 includes: a new execution mode, supporting a new instruction, long multiplication, halfword memory access operations and signed memory access operations. The changes from ARMv4 to ARMv4T consists of a new branch instruction, a new bit in the PSRs, and triggering an undefined exception when trying to enter Thumb state.
3.4.1
System Mode
ARMv4 introduces the system execution mode. System mode shares its registers with user mode, and does not have any banked registers. It is a privileged mode, and is therefore able to modify the program status registers. Since it shares its registers with the user mode, adding support for system mode does not require as many changes as that of abort mode or undefined mode. The only change required is to allow system mode to modify the entire PSR.
3.4.2
Long Multiplication
ARMv4 adds long multiplication and long multiply accumulate instructions. These instructions use 32-bit registers as operands, and stores the 64-bit result in a pair of 32-bit registers, whereas ARMv3 stores the result in a single 32-bit register. The Amber processor tile used an implementation of Booths multiplication algorithm to perform 32-bit multiplication [27]. This implementation consist of 33 cycles of shifts and additions. If it is a multiply accumulate instruction there is one extra addition cycle. This multiplication unit was replaced with an FPGA specific multiplication unit which is able to perform 64-bit multiplication in one cycle. Although the multiplication unit is able to do multiplication in one cycle, the new design was unable to meet timing constraints, and the clock frequency of the processor was lowered. The multiplication unit was therefore changed to do multiplication in two cycles instead, in order to maintain the processor clock frequency.
The Amber processor tile was only able to read 3 register values each cycle, while a long multiplication accumulate instruction requires to read 4 registers in a single cycle. In order to support this instruction, Rav adds support for 4 register read. This has been done by splitting a single multiplexer which was used for both Rd and Rs into two multiplexers, as seen in Figure 3.4.
The new implementation of the multiplication unit performs a signed 33-bit mul- tiplication with two registers, and optionally a 64-bit addition with the 64-bit multiplication result, producing a 64-bit result.
Two register are used to store the 64-bit result. The register bank was there- fore modified to allow two register writes simultaneously. The reg_high and
3.4. IMPLEMENTING RAV WITH ARMV4T SUPPORT 33
Figure 3.4: Going from three register read support (left) in ARMv3, to four register
read support (right) in ARMv4.
The reason for using 33-bit multiplication is due to the fact that the FPGA multiplication unit always performs a signed multiplication. The use of an extra bit in the argument is to force unsigned multiplication behavior. For an unsigned long multiplication, umull, or unsigned long multiply accumulate, umlal, the register arguments are padded with an extra 0 as the most significant bit. For signed multiplication, smull and smlal, both register arguments are sign extended by replicating the most significant bit, to produce two 33-bit arguments.
The new multiplication unit is also able to handle the old word sized mul and mla instructions. This is done by discarding the top bits of the result and only write the 32 low bits. If long_multiply signal is high Rn and Rd are concatenated to form the 64-bit argument, else Rd is sign extended into a 64-bit argument. This argument is only added to the result when the mult_accumulate signal is high.
3.4.3
Halfword Load and Store
ARMv4 adds new load and store instructions. These are load and store half- word, load signed halfword and load signed byte, ldrh, strh, ldrsh and ldrsb respectively.
The Amber processor tile memory interfaces writes a word, 4 bytes, at a time. It uses a 4-bit signal, byte_enable, to select which of these 4 bytes that should be written. The byte_enable signal is, in the case of halfword memory access instructions, decoded from the memory address. This is done by looking at the second least significant bit. If it is 1, the two upper bytes in the word should be written, if it is 0 the two lower bytes in the word should be written.
In order to load a value into a register, the Amber processor tile memory interface loads a word from memory. The value then passes through the load function, which processes the value based on three input values. The byte signal tells the load function if a single byte is to be loaded. The desired byte is selected based on the two least significant bits in the memory address. The halfword signal tells the load function if a halfword is to be loaded. The desired bytes are selected by looking at the second least significant bit in the memory address. The sign signal tells
the load function if it should sign extend the value. This value cannot be 1 unless either halfword or byte signal is 1. If neither the byte nor the halfword signal is high, the memory transaction is a regular load and the entire word is written to a register.
3.4.4
Branch and Exchange
In order to be ARMv4T compliant, the bx instruction needs to be supported. It is a branch instruction which also allows the user to change the execution state from regular ARM state to Thumb state.
Figure 3.5: The program status register format in ARMv4T.
Since the value in the PC register is word aligned, the least significant bit of any value being written is ignored. The T-bit is added to the PSRs as shown in Figure 3.5 This bit is used by bx for setting Thumb state, of CPSR. If this bit is 1 the processor enters the 16-bit thumb execution state. Otherwise the processor stays in 32-bit ARM execution state. As the Thumb execution state is not supported by Rav, writing 1 to the T-bit of CPSR triggers a special thumb exception. The decode stage will then upon the next instruction trigger an undefined exception and put the processor back in 32-bit ARM execution state. This exception can then be checked for by looking at the T-bit in SPSR in the undefined exception handler.