Embedded Systems: map to FPGA, GPU, CPU?

(1)

Embedded Systems:

map to FPGA, GPU, CPU?

Jos van Eijndhoven

[email protected]

Bits&Chips Embedded systems

Nov 7, 2013

(2)

Moore’s law versus Amdahl’s law

Computational Capacity Software Performance # of transi stors Introduction of multicore technology Hardware capabilities underutilized Programming bottleneck time

(3)

Multi-core CPUs are here to stay

AMD Fusion Llano

nVidia Tegra3

_{CPUs grow to 2, 4, 8, .. 64 .. 256 cores}

Mobile, desktop, server

Multi-threaded programming model to

keep cores busy

Complex multi-level caches, hardware

cache coherency

(4)

Creating parallel programs is hard…

Edward A. Lee, EECS professor at U.C. Berkeley:

“Although threads seem to be a small step from

sequential computation, in fact, they represent a huge step. They discard the most essential and appealing properties of sequential computation: understandability, predictability, and determinism.”

Herb Sutter, chair of the ISO C++ standards committee, Microsoft:

“Everybody who learns concurrency thinks they understand it, ends up finding

mysterious races they thought weren’t possible, and discovers that they didn’t actually understand it yet after all”

(5)

Learning raises the feeling of complexity

Provides good insight in C++ concurrency

C++11 standardizes several concurrency primitives Warns for many many subtle problems

The authorative description (4th edition)

Apparently requires 1300+ pages...

Safe concurrency by defensive design Shows that Java shares many

(6)

Further appetite for performance?

General-purpose CPUs are (traditionally) designed to handle

code with complex control-flow

Their effective usage of silicon for computations is low

Area(ALU’s)/Area(total die)

is about

1%

How to

significantly

increase operations/sec/$ and operations/J ?

Hand-off compute load to:

Function-specific hardware accelerators

(H264 decode, LTE channel decode, GFX rendering, IP packet processing, ...)

GP-GPU:

general-purpose programmable graphics processor units

(7)

Offload CPU: computational efficiency

GP-GPU:

High floating point

performance (>1TFlops)

Large off-chip memory

bandwidth

Needs thousands of

concurrent threads

Few inter-thread data

dependencies and little

data-dependent control

High-end chips take huge

power (>100W)

FPGA:

High integer performance

(>1Tops)

Good power efficiency

Needs hundreds of

concurrent instructions

Takes HW design expertise

and effort.

High-end chips are very

expensive (>$1000)

(8)

CPU – FPGA combinations

Xilinx 'Zync' or Altera ‘Cyclone’

with dual-core ARM

(9)

CPU – GPGPU combinations

NVidia Tesla for high-end

compute AMD Fusion for desktop, gaming, …

Intel Haswell: desktop, laptop, …

ODROID: ARM quad-core with embedded GP-GPU

(10)

Intel for embedded: don’t underestimate

And furthermore:

Intel Atom “Bay Trail”: dual- and quad-core, 22nm, with embedded GP-GPU

Intel Quark: 1/10 power of Atom, 32-bit x86 architecture.

Arduino-style development board….

Intel NUC (Next Unit of Compute) core-i3 or i5 on 4”x4”

(11)

CPU - Accelerator application mapping

Conceptually nice picture, real implementation hurdles:

Application I/O to hardware is shielded by any 'real' operating system Thread control (sleep/wakeup) interacts with Accelerator progress C-code of SW thread mapped to FPGA through ‘high level synthesis’

Application

CPU-thread 1 FPGA

Accelerator CPU-thread 2

Channel Channel

Functional partitioning

Create SW thread with appropriate functionality

Channels for synchronized inter-thread communication

(12)

Creation of an FPGA Accelerator

FPGA hardware implementation inter-thread communication API (channels, shared memory, mutex, …) HW implementation of same communication API Compute kernel: C source code in SW thread Software functional reference FPGA HW implementation of compute kernel

High-level synthesis tooling (e.g. Xilinx’ Vivado)

Choose local (embedded) memories for some of the C variables, synthesize shared-memory access for others.

Balance amount of hardware with required performance (loop unrolling) HLS tooling

(13)

HW/SW communication stack

Application SW virtual address space

CPU-side stack

Linux

Multi-core CPU with MMU and caches

PCI-e / AXI memory bus

PCIe / AXI interface Lapack accelerator Crypto accelerator DDR Accelerator-side (FPGA) stack

Snoop Control unit

User-level driver

Kernel driver

Fifo interfaces to accelerators

DMA streaming, caches

Shared access to local srams Compute library

e.g. lapack, crypto _Channel

(14)

ARM (A9) multicore example

FPGA or GPU DDR L2 Cach e

(15)

Intel (i5) multicore example

FPGA or GPU Memory bus DDR

Device reads will be pulled from CPU L1/L2/L3 caches



(16)

Memory-mapped communication?

Shared-memory paradigm to communicate with GPU/FPGA?

Matches C/Java programming model Highly efficient, low run-time overhead

 No system calls for data transport: just CPU load/stores  Take advantage of existing on-chip caches to buffer data

Sounds nice… Can I transfer a C/Java object pointer through my

channel, for dereferencing inside my accelerator?

Well… that would require tackling:

Cache coherency issues

(17)

Shared memory with GP-GPU?

Today, Nvidia’s “CUDA” is the popular programming environment

Based on separated memories (use on-card memory)

Explicit data transport to/from GPU card, avoid shared memory Allows a streaming model, where CPU and GPU are concurrently active

Providers of integrated GPUs

(AMD, Intel, ARM)

are working to

improve on this programming model:

Integrated GPUs do share the global memory with the CPU, no need to really copy data.

MMUs are being added to the GPU, allowing to share pointers Cache coherency support remains (for now) only partial,

(18)

Shared memory with FPGA?

FPGA vendors are late to provide SW+tools to integrate an

accelerator with host CPU+OS:

Support for OpenCL programming model is coming 

Rely on explicit data transport to/from FPGA local memory

Creating ‘mmap’ capable device drivers can be done by yourself? 

Also, MMU sharing can be implemented by yourself in the FPGA? 

GPU vendors are ahead of FPGA vendors in attracting customers

with SW-oriented tooling.

(19)

Evaluating an application mapping (1)

Vector fabrics did study the mapping of a particular video object

recognition algorithm for one of our customers:

Its compute kernel contained a 2-D convolution to match images. The software reference implementation performed 0.9G multiply-adds per second on a desktop PC: too low for actual deployment. We created performance estimates for potential mapping to

(20)

Evaluating an application mapping (2)

One week of optimization of the algorithm to an

Intel i5

platform

Multi-threading to utilize the available 4 cores, and vectorization (SSSE3) to speed-up pixel operations

Reaching 25G multiply-adds /sec.

One week of mapping the C kernel to an

FPGA

implementation

(not including the CPU-FPGA communication)

Rewriting the C kernel for use in a synthesis tool (Xilinx’ Vivado) Carefully tune on-chip memory architecture for high parallelism Reaching amazingly the same 25G multiply-adds/sec for a (ballpark) 200€ FPGA chip.

Few days to study mapping to a midrange

Nvidia GPU card

.

A rough estimate showed potential to achieve about 75G multiply-adds/sec.

Required the mapping of a much larger code portion to avoid frequent data transfers. Would be a really difficult task.