[NORMAL] Embedded Parallel Computing

(1)

Embedded Parallel

Computing

Lecture 5 -

The anatomy of a modern multiprocessor, the multicore processors

(2)

Outline

Modern Multicore

• Symmetrical Multiprocessing (SMP)

• Multicore

(3)

MIMD:

Symmetrical Multiprocessing

• A multi-core architecture with Symmetrical Multiprocessing (SMP) is defined by the following characteristics:

‣ Architecture consists of two or more identical CPU cores.

‣ All cores share a common system memory and are controlled by a single Operating system.

‣ Each CPU is capable of operating independently on different workloads and whenever possible, is also capable of sharing workloads with the other CPU.

(4)

Multicore

• [Wikipedia]: A multi-core processor is a single computing component with two or more independent actual processors

(called "cores"), which are the units that read and execute program instructions

• A many-core processor is a multi-core processor in which the number of cores is large enough that traditional

multi-processor techniques are no longer

efficient — largely because of issues with congestion in supplying instructions and

Several tens of cores!

An Intel Core 2 Duo

(5)

Multicore

• Symmetric multiprocessing (SMP) designs

using discrete CPUs exists since a long time

• Thus the issues regarding implementing

multi-core processor architecture and supporting it with software are well known

• Utilizing a proven processing-core design

(6)

Single Core Design is

Hitting the

f....

Wall

• Greatly diminished gains in processor performance

from increasing the operating frequency. This is due to three primary factors:

‣ The memory wall

‣ The ILP wall

(7)

Multicore SMP

• The proximity of multiple CPU cores on the

same die allows the cache coherency circuitry to operate at a much higher clock-rate than is possible if the signals have to travel off-chip

• Combining equivalent CPUs on a single die

significantly improves the performance of cache snoop (alternative: Bus snooping) operations

(8)

(9)

Thread-level Parallelism

• For thread-level parallelism, ARM needed to

improve exception handling to prepare for the increased complexity in handling

multithreading on multiple processors

• These requirements added inherent complexity

in the interrupt handler, scheduler, and context

(10)

MPcore Semaphores

• Earlier ARM architectures implemented semaphores with the

swap instruction, which held the external bus until completion. One processor can hold the entire bus until completion,

disallowing all other processors. Unacceptable!

• ARMv6 introduced two new instructions—load-exclusive

LDREX and store-exclusive STREX—which take advantage of an exclusive monitor in memory:

‣ LDREX loads a value from memory and sets the exclusive

monitor to watch that location, and

‣ STREX checks the exclusive monitor and, if no other write has

taken place to that location, performs the store to memory and returns a value to indicate if the data was written.

(11)

Physically Tagged Caches

• Usage of Virtual or Physical addresses in the

cache?

• A virtually tagged cache must be flushed every

time a context switch takes place because the cache contains old virtual-to-physical

translations

• In ARM11 with MPcore the memory

management unit logic resides between the level 1 cache and the processor core

(12)

Atomic Instructions

• Traditionally swap-based and compare-

and-exchange-based semaphores have been used to control access to critical data

(13)

cmpxchg8b

• Many are using the Intel cmpxchg8b instruction

in these lock-free routines because it can exchange and compare 8 bytes of data

atomically.

• Typically, this involved 4 bytes for payload and

4 bytes to distinguish between payload

versions that could otherwise have the same value—the so-called A-B-A problem.

(14)

• The ARM exclusives provide atomicity using the

data address rather than the data value, so that the routines can atomically exchange data without

experiencing the A-B-A problem

• Exploiting this would, however, require rewriting much of the existing two-word exclusive code.

• Consequently, ARM added instructions for

performing load-and-store exclusives using various payload sizes -- including 8 bytes -- thus ensuring the direct portability of existing multithreaded code.

(15)

Misc MP improvements

• Improved access to localized data

• Power-conscious spin-locks

(16)

Two Main Enhancements

• The ARM11 multiprocessor includes two main

SMP enhancements:

• Generic Interrupt Controller (GIC) providing interprocessor communication

• Snoop Control Unit (SCU), an intelligent memory-communication system providing cache coherence

(17)

Cache Coherency

• The ARM11 MPCore implements a Snoop

Control Unit (SCU) between the processors. Operating at CPU frequency.

• This configuration also provides a very rapid

path for data to move directly between each CPU’s cache.

(18)

(19)

MESI

• Modified

‣ The cache line is present only in the current cache, and is dirty; it has been modified from the value in main memory. The cache is required to write the data back to main memory at some time in the future, before permitting any other read of the (no longer valid) main memory state. The write-back

changes the line to the Exclusive state.

• Exclusive

‣ The cache line is present only in the current cache, but is clean; it matches main memory. It may be changed to the Shared state at any time, in

response to a read request. Alternatively, it may be changed to the Modified state when writing to it.

• Shared

‣ Indicates that this cache line may be stored in other caches of the machine and is "clean" ; it matches the main memory. The line may be discarded

(20)

MOESI

• The processor maintains cache coherence with

an optimized version of the MESI (modified, exclusive, shared, invalid) protocol.

• In addition to the four common MESI protocol

states, there is a fifth "Owned" state

representing data that is both modified and

shared. This avoids the need to write modified data back to main memory before sharing it.

(21)

(22)

(23)

(24)

Interrupt System

• Generic Interrupt Controller (GIC)

• External interrupts

• Internal Interrupts

‣ Example: One processor allocates virtual memory

-> all others needs to update their memory

translations -> ARM uses GIC to quickly signal that between processors

(25)

Distributed Interrupt Controller

• masking of interrupts

• prioritization of the interrupts

• distribution of the interrupts to the target MP11

CPUs

• tracking the status of interrupts

• generation of interrupts by software

MP 11 CPUs Distributed Interrupt Controller I N T E R F A C E

(26)

(27)

(28)

Applications Using MPCore

• Frostbite is an example of a game engine that

employs job-based parallelism. This engine is used by the popular Battlefield: Bad Company series of games. It is an engine that is capable of using as many threads as the underlying hardware platform provides. The engine performs the primary Game and Render tasks on the GPU and divides up the other system related work into jobs.

• Each job typically consists of 15K to 200K lines of C+ + code with the average job size being around 25K lines of code. Most of these jobs are independent while some have interdependencies.

• Each frame of the game would typically contain two hundred to three hundred jobs and the engine

(29)

(30)

Questions

• Study-support questions

(31)

Links

Multi-core <http://en.wikipedia.org/wiki/Multi_core>;

SMP - Symmetric Multiprocessor System <http://en.wikipedia.org/wiki/Symmetric_multiprocessor> ABA problem <http://en.wikipedia.org/wiki/ABA_problem>

MESI <http://en.wikipedia.org/wiki/MESI>

MOESI <http://en.wikipedia.org/wiki/MOESI_protocol>

Embedded moves to multicore <http://embedded-computing.com/embedded-moves-multicore> Goodacre, J.; Sloss, A.N.; , "Parallelism and the ARM instruction set architecture," IEEE

Computer , vol.38, no.7, pp. 42- 50, July 2005 doi: 10.1109/MC.2005.239

<http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1463106&isnumber=31455>

Goodacre, J., "Details of a New Cortex Processor Revealed, Cortex-A9", Presentation at the ARM developers' Conference, October 2007.

<http://www.arm.com/files/downloads/cortex-a9_devcon-talk_introduction_final-02.pdf> Stevens A., ”Introduction to AMBA 4 ACE”, ARM White paper June 6, 2011.