Embedded Parallel
Computing
Lecture 5 -
The anatomy of a modern multiprocessor, the multicore processors
Outline
Modern Multicore
• Symmetrical Multiprocessing (SMP)
• Multicore
MIMD:
Symmetrical Multiprocessing
• A multi-core architecture with Symmetrical Multiprocessing (SMP) is defined by the following characteristics:
‣ Architecture consists of two or more identical CPU cores.
‣ All cores share a common system memory and are controlled by a single Operating system.
‣ Each CPU is capable of operating independently on different workloads and whenever possible, is also capable of sharing workloads with the other CPU.
Multicore
• [Wikipedia]: A multi-core processor is a single computing component with two or more independent actual processors
(called "cores"), which are the units that read and execute program instructions
• A many-core processor is a multi-core processor in which the number of cores is large enough that traditional
multi-processor techniques are no longer
efficient — largely because of issues with congestion in supplying instructions and
Several tens of cores!
An Intel Core 2 Duo
Multicore
• Symmetric multiprocessing (SMP) designs
using discrete CPUs exists since a long time
• Thus the issues regarding implementing
multi-core processor architecture and supporting it with software are well known
• Utilizing a proven processing-core design
Single Core Design is
Hitting the
f....
Wall
• Greatly diminished gains in processor performance
from increasing the operating frequency. This is due to three primary factors:
‣ The memory wall
‣ The ILP wall
Multicore SMP
• The proximity of multiple CPU cores on the
same die allows the cache coherency circuitry to operate at a much higher clock-rate than is possible if the signals have to travel off-chip
• Combining equivalent CPUs on a single die
significantly improves the performance of cache snoop (alternative: Bus snooping) operations
Thread-level Parallelism
• For thread-level parallelism, ARM needed to
improve exception handling to prepare for the increased complexity in handling
multithreading on multiple processors
• These requirements added inherent complexity
in the interrupt handler, scheduler, and context
MPcore Semaphores
• Earlier ARM architectures implemented semaphores with the
swap instruction, which held the external bus until completion. One processor can hold the entire bus until completion,
disallowing all other processors. Unacceptable!
• ARMv6 introduced two new instructions—load-exclusive
LDREX and store-exclusive STREX—which take advantage of an exclusive monitor in memory:
‣ LDREX loads a value from memory and sets the exclusive
monitor to watch that location, and
‣ STREX checks the exclusive monitor and, if no other write has
taken place to that location, performs the store to memory and returns a value to indicate if the data was written.
Physically Tagged Caches
• Usage of Virtual or Physical addresses in the
cache?
• A virtually tagged cache must be flushed every
time a context switch takes place because the cache contains old virtual-to-physical
translations
• In ARM11 with MPcore the memory
management unit logic resides between the level 1 cache and the processor core
Atomic Instructions
• Traditionally swap-based and compare-
and-exchange-based semaphores have been used to control access to critical data
cmpxchg8b
• Many are using the Intel cmpxchg8b instruction
in these lock-free routines because it can exchange and compare 8 bytes of data
atomically.
• Typically, this involved 4 bytes for payload and
4 bytes to distinguish between payload
versions that could otherwise have the same value—the so-called A-B-A problem.
• The ARM exclusives provide atomicity using the
data address rather than the data value, so that the routines can atomically exchange data without
experiencing the A-B-A problem
• Exploiting this would, however, require rewriting much of the existing two-word exclusive code.
• Consequently, ARM added instructions for
performing load-and-store exclusives using various payload sizes -- including 8 bytes -- thus ensuring the direct portability of existing multithreaded code.
Misc MP improvements
• Improved access to localized data
• Power-conscious spin-locks
Two Main Enhancements
• The ARM11 multiprocessor includes two main
SMP enhancements:
• Generic Interrupt Controller (GIC) providing interprocessor communication
• Snoop Control Unit (SCU), an intelligent memory-communication system providing cache coherence
Cache Coherency
• The ARM11 MPCore implements a Snoop
Control Unit (SCU) between the processors. Operating at CPU frequency.
• This configuration also provides a very rapid
path for data to move directly between each CPU’s cache.
MESI
• Modified
‣ The cache line is present only in the current cache, and is dirty; it has been modified from the value in main memory. The cache is required to write the data back to main memory at some time in the future, before permitting any other read of the (no longer valid) main memory state. The write-back
changes the line to the Exclusive state.
• Exclusive
‣ The cache line is present only in the current cache, but is clean; it matches main memory. It may be changed to the Shared state at any time, in
response to a read request. Alternatively, it may be changed to the Modified state when writing to it.
• Shared
‣ Indicates that this cache line may be stored in other caches of the machine and is "clean" ; it matches the main memory. The line may be discarded
MOESI
• The processor maintains cache coherence with
an optimized version of the MESI (modified, exclusive, shared, invalid) protocol.
• In addition to the four common MESI protocol
states, there is a fifth "Owned" state
representing data that is both modified and
shared. This avoids the need to write modified data back to main memory before sharing it.
Interrupt System
• Generic Interrupt Controller (GIC)
• External interrupts
• Internal Interrupts
‣ Example: One processor allocates virtual memory
-> all others needs to update their memory
translations -> ARM uses GIC to quickly signal that between processors
Distributed Interrupt Controller
• masking of interrupts
• prioritization of the interrupts
• distribution of the interrupts to the target MP11
CPUs
• tracking the status of interrupts
• generation of interrupts by software
MP 11 CPUs Distributed Interrupt Controller I N T E R F A C E
Applications Using MPCore
• Frostbite is an example of a game engine that
employs job-based parallelism. This engine is used by the popular Battlefield: Bad Company series of games. It is an engine that is capable of using as many threads as the underlying hardware platform provides. The engine performs the primary Game and Render tasks on the GPU and divides up the other system related work into jobs.
• Each job typically consists of 15K to 200K lines of C+ + code with the average job size being around 25K lines of code. Most of these jobs are independent while some have interdependencies.
• Each frame of the game would typically contain two hundred to three hundred jobs and the engine
Questions
• Study-support questions
Links
Multi-core <http://en.wikipedia.org/wiki/Multi_core>;
SMP - Symmetric Multiprocessor System <http://en.wikipedia.org/wiki/Symmetric_multiprocessor> ABA problem <http://en.wikipedia.org/wiki/ABA_problem>
MESI <http://en.wikipedia.org/wiki/MESI>
MOESI <http://en.wikipedia.org/wiki/MOESI_protocol>
Embedded moves to multicore <http://embedded-computing.com/embedded-moves-multicore> Goodacre, J.; Sloss, A.N.; , "Parallelism and the ARM instruction set architecture," IEEE
Computer , vol.38, no.7, pp. 42- 50, July 2005 doi: 10.1109/MC.2005.239
<http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1463106&isnumber=31455>
Goodacre, J., "Details of a New Cortex Processor Revealed, Cortex-A9", Presentation at the ARM developers' Conference, October 2007.
<http://www.arm.com/files/downloads/cortex-a9_devcon-talk_introduction_final-02.pdf> Stevens A., ”Introduction to AMBA 4 ACE”, ARM White paper June 6, 2011.