Memory Ordering
10.1.2 Normal memory
Normal memory is used to describe most parts of the memory system. All ROM and RAM devices are considered to be Normal memory.
The properties of Normal memory are as follows: • The core can repeat read and some write accesses.
• The core can pre-fetch or speculatively access additional memory locations, with no side-effects (if permitted by MMU access permission settings). The core will not perform speculative writes, however.
• Unaligned accesses can be performed.
• Multiple accesses can be merged by core hardware into a smaller number of accesses of a larger size. Multiple byte writes could be merged into a single double-word write, for example.
Regions of Normal memory must also have cacheability attributes described. See Chapter 8 for details of the supported cache policies. The ARM architecture supports cacheability attributes for Normal memory for two levels of cache, the inner and outer cache. The mapping between these levels of cache and the implemented physical levels of cache is implementation defined. Inner refers to the innermost caches, and always includes the core level 1 cache. An
implementation might not have any outer cache, or it can apply the outer cacheability attribute to an L2 or L3 cache. For example, in a system containing a Cortex-A9 processor and the L2C-310 level2 cache controller, the L2C-310 is considered to be the outer cache. The Cortex-A8 L2 cache can be configured to use either inner or outer cache policy.
Shareability
Normal memory must also be designated either as Shareable or Non-Shareable. A region of Normal memory with the Non-Shareable attribute is one that is used only by this core. There is no requirement for the core to make accesses to this location coherent with other cores. If other cores do share this memory, any coherency issues must be handled in software. For example, this can be done by having individual cores perform cache maintenance and barrier operations.
Memory Ordering
The Outer Shareable attribute enables the definition of systems containing multiple levels of coherency control. For example, an Inner Shareable domain could consist of an Cortex-A15 cluster and Cortex-A7 cluster. Within a cluster, the data caches of the cores are coherent for all data accesses that have the Inner Shareable attribute. The Outer Shareable domain, meanwhile, might consist of this cluster and a graphics processor with multiple cores. An Outer Shareable domain can consist of multiple Inner Shareable domains, but and Inner Shareable domain can only be part of one Outer Sharable domain.
A region with the Shareable attribute set is one that can be accessed by other agents in the system. Accesses to memory in this region by other processors within the same shareability domain are coherent. This means that you do not have to take care of the effects of data or caches. Without the Shareable attribute, in situations where cache coherency is not maintained between cores for a region of shared memory, you would have to explicitly manage coherency yourself.
The ARMv7 architecture enables you to specify Shareable memory as Inner Shareable or Outer Shareable (this latter case means that the location is both Inner and Outer Shareable).
Memory Ordering
10.2
Memory barriers
A memory barrier is an instruction that requires the core to apply an ordering constraint between memory operations that occur before and after the memory barrier instruction in the program. Such instructions can also be called memory fences in other architectures.
The term memory barrier can also be used to refer to a compiler mechanism that prevents the compiler from scheduling data access instructions across the barrier when performing optimizations. For example in GCC, you can use the inline assembler memory clobber, to indicate that the instruction changes memory and therefore the optimizer cannot re-order memory accesses across the barrier. The syntax is as follows:
asm volatile("" ::: "memory");
ARM RVCT includes a similar intrinsic, called __schedule_barrier().
Here, however, we are looking at hardware memory barriers, provided through dedicated ARM assembly language instructions. As we have seen, core optimizations such as caches, write buffers and out-of-order execution can result in memory operations occurring in an order different from that specified in the executing code. Normally, this re-ordering is invisible to you. Application developers do not normally have to worry about memory barriers. However, there are cases where you might have to take care of such ordering issues, for example in device drivers or when you have multiple observers of the data that must be synchronized.
The ARM architecture specifies memory barrier instructions, that enable you to force the core to wait for memory accesses to complete. These instructions are available in both ARM and Thumb code, in both user and privileged modes. In older versions of the architecture, these were performed using CP15 operations in ARM code only. Use of these is now deprecated, although preserved for compatibility.
Let’s start by looking at the practical effect of these instructions in a single core system. This description is a simplified version of that given in the ARM Architecture Reference Manual, this section is intended to introduce the use of these instructions. The term explicit access is used to describe a data access resulting from a load or store instruction in the program. It does not include instruction fetches.
Data Synchronization Barrier (DSB)
This instruction forces the core to wait for all pending explicit data accesses to complete before any additional instructions stages can be executed. There is no effect on pre-fetching of instructions.
Data Memory Barrier (DMB)
This instruction ensures that all memory accesses in program order before the barrier are observed in the system before any explicit memory accesses that appear in program order after the barrier. It does not affect the ordering of any other instructions executing on the core, or of instruction fetches.
Instruction Synchronization Barrier (ISB)
This flushes the pipeline and prefetch buffer(s) in the core, so that all instructions following the ISB are fetched from cache or memory, after the instruction has completed. This ensures that the effects of context altering operations, for example, CP15 or ASID changes or TLB or branch predictor operations, executed before the ISB instruction are visible to any instructions fetched after the ISB. This does not, in itself, cause synchronization between data and instruction caches, but is required as a part of such an operation.
Memory Ordering
Several options can be specified with the DMB or DSB instructions, to provide the type of access and the shareability domain it applies to, as follows:
SY This is the default and means that the barrier applies to the full system, including all cores and peripherals.
ST A barrier that waits only for stores to complete.
ISH A barrier that applies only to the Inner Shareable domain.
ISHST A barrier that combines ST and ISH. That is, it only stores to the Inner Shareable. NSH A barrier only to the Point of Unification (PoU). (See Point of coherency and
unification on page 8-19).
NSHST A barrier that waits only for stores to complete and only out to the point of unification.
OSH Barrier operation only to the Outer Shareable domain.
OSHST Barrier operation that waits only for stores to complete, and only to the Outer Shareable domain.
To make sense of this, you must use a more general definition of the DMB and DSB operations in a multi-core system. The use of the word processor (or agent) in the following text does not necessarily mean a core and also could refer to a DSP, DMA controller, hardware accelerator or any other block that accesses shared memory.
The DMB instruction has the effect of enforcing memory access ordering within a shareability domain. All processors within the shareability domain are guaranteed to observe all explicit memory accesses before the DMB instruction, before they observe any of the explicit memory accesses after it.
The DSB instruction has the same effect as the DMB, but in addition to this, it also synchronizes the memory accesses with the full instruction stream, not only other memory accesses. This means that when a DSB is issued, execution will stall until all outstanding explicit memory accesses have completed. When all outstanding reads have completed and the write buffer is drained, execution resumes as normal.
It might be easier to appreciate the effect of the barriers by considering an example. Consider the case of a quad core Cortex-A9 cluster. The cluster forms a single Inner Shareable domain. When a single core within the cluster executes a DMB instruction, that core will ensure that all data memory accesses in program order before the barrier complete, before any explicit memory accesses that appear in program-order after the barrier. This way, it can be guaranteed that all cores within the cluster will see the accesses on either side of that barrier in the same order as the core that performs them. If the DMB ISH variant is used, the same is not guaranteed for external observers such as DMA controllers or DSPs.
Memory Ordering