Processor core
• The processor core implements the ARMv7-M architecture. It has the following main features:
• Thumb-2 (ISA) subset consisting of all base Thumb-2 instructions, 16-bit and 32-bit.
• Harvard processor architecture enabling simultaneous instruction fetch with data load/store. • Three-stage pipeline.
• Single cycle 32-bit multiply. • Hardware divide.
• Thumb and Debug states. • Handler and Thread modes. • Low latency ISR entry and exit.
• — Processor state saving and restoration, with no instruction fetch overhead. Exception vector is fetched
from memory in parallel with the state saving, enabling faster ISR entry. — Support for late arriving interrupts.
— Tightly coupled interface to interrupt controller enabling efficient processing of late-arriving interrupts. — Tail-chaining of interrupts, enabling back-to-back interrupt processing
• without the overhead of state saving and restoration between interrupts.
• Interruptible-continued LDM/STM, PUSH/POP. • ARMv6 style BE8/LE support.
Registers
The processor contains:
• 13 general purpose 32-bit registers
•
Link Register (LR)
•
Program Counter (PC)
•
Program Status Register, xPSR
Memory interface
The processor has a Harvard interface to enable simultaneous instruction fetches with data load/stores. Memory accesses are controlled by:
• A separate Load Store Unit (LSU) that decouples load and store operations from
the Arithmetic and Logic Unit (ALU).
• A 3-word entry Prefetch Unit (PFU).
One word is fetched at a time. This can be two Thumb instructions, one word-aligned Thumb-2 instruction, or the upper/lower halfword of a
halfword-aligned Thumb-2 instruction with one Thumb instruction, or the lower/upper halfword of another halfword-aligned Thumb-2 instruction. All fetch
addresses from the core are word aligned. If a Thumb-2 instruction is
halfword aligned, two fetches are necessary to fetch the Thumb-2 instruction. However, the 3-entry prefetch buffer ensures that a stall cycle is only
NVIC
• The NVIC is tightly coupled to the processor core. This facilitates low latency exception processing. The main features include:
• a configurable number of external interrupts, from 1 to 240
• a configurable number of bits of priority, from three to eight bits • level and pulse interrupt support
• dynamic reprioritization of interrupts • priority grouping
• support for tail-chaining of interrupts
• processor state automatically saved on interrupt entry, and restored on interrupt exit, with no instruction overhead.
BUS MATRIX
The bus matrix connects the processor and debug interface to the external buses. The bus matrix interfaces to the following external buses:
• ICode bus. This is for instruction and vector fetches from code space. This is a 32-bit AHB-Lite bus.
• DCode bus. This is for data load/stores and debug accesses to code space. This is a 32-bit AHB-Lite bus
• System bus. This is for instruction and vector fetches, data
load/stores and debug accesses to system space. This is a 32-bit AHB-Lite bus.
• PPB. This is for data load/stores and debug accesses to PPB space. This is a 32-bit APB (v2.0) bus.
•
The bus matrix also controls the following:
Unaligned accesses. The bus matrix converts
unaligned processor accesses into
aligned accesses. Bit-banding. The bus matrix
converts bit-band alias accesses into bit-band region
accesses. It performs:
—bit field extract for bit-band loads
— atomic read-modify-write for bit-band stores.
Write buffering. The bus matrix contains a one-entry
write buffer to decouple bus stalls from the
processor core.
FPB
The FPB unit implements hardware breakpoints
and patches accesses from code space to system
space. The FPB has eight comparators as follows:
•
You can individually configure six instruction
comparators to either remap instruction fetches
from code space to system space, or perform a
hardware breakpoint.
•
Two literal comparators that can remap literal
DWT
The DWT unit incorporates the following debug
functionality:
•
Four comparators that you can configure either as a
hardware watchpoint, an ETM trigger, a PC sampler
event trigger, or a data address sampler event trigger.
•
Several counters or a data match event trigger for
performance profiling.
•
Configurable to emit PC samples at defined intervals,
and to emit interrupt event information.
ITM
The ITM is a an application driven trace source that supports
application event trace and printf style debugging.
The ITM provides the following sources of trace information:
• Software trace. Software can write directly to ITM stimulus
registers. This causes packets to be emitted.
• Hardware trace. These packets are generated by the DWT,
and emitted by the ITM.
• Time stamping. Timestamps are emitted relative to
packets.
MPU
An optional MPU is available for the processor to provide memory protection. The MPU checks access permissions and memory attributes. It contains eight regions, and an optional background region that implements the default memory map attributes.
ETM
The ETM is a low-cost trace macrocell that supports instruction trace only.
TPIU
The TPIU acts as a bridge between the Cortex-M3 trace data from the ITM, an ETM if present, and an off-chip Trace Port Analyzer. You can configure the TPIU to support either serial pin trace for low-cost debug, or multi-pin trace for higher bandwidth race. The TPIU is CoreSight compatible.
SW/SWJ-DP
You can configure the processor to have SW-DP or SWJ-DP debug port interfaces. The debug port provides debug access to all registers and memory in the system,
Prefetch Unit
The purpose of the
Prefetch Unit (PFU) is to:
• Fetch instructions in advance and forward PC
relative branch instructions. Fetches are
speculative in the case of conditional branches
Fetches 3 thumb 2 insturction.
• Detect Thumb-2 instructions and present
these as a single instruction word.
Branch target forwarding(BRCHSTAT)
• Provides memory transaction earlier than reaching , EXECUTE• Increases performance of the core
• It loses a fetch opportunity if speculated on conditional opcode. • The additional penalty is a cycle of pipeline stalling
• Brach forwarding can be thought of assigning internal memory for
branch
• Branch forwarding is costly than wait statement
– Gives control to subroutine when conditional branch is there
– A refinement is to only predict backward conditional branches to accelerate
loops
– with ARM compilers favoring loops with unconditional branch backwards at
the bottom and then conditional branch forward tests on the loop limit, the core fetch queue being ahead at the start of the loop yields good behavior