• No results found

Embedded Systems: map to FPGA, GPU, CPU?

N/A
N/A
Protected

Academic year: 2021

Share "Embedded Systems: map to FPGA, GPU, CPU?"

Copied!
22
0
0

Loading.... (view fulltext now)

Full text

(1)

Embedded Systems:

map to FPGA, GPU, CPU?

Jos van Eijndhoven

[email protected]

Bits&Chips Embedded systems

Nov 7, 2013

(2)

Moore’s law versus Amdahl’s law

Computational Capacity Software Performance # of transi stors Introduction of multicore technology Hardware capabilities underutilized Programming bottleneck time

(3)

Multi-core CPUs are here to stay

AMD Fusion Llano

nVidia Tegra3

CPUs grow to 2, 4, 8, .. 64 .. 256 cores

Mobile, desktop, server

Multi-threaded programming model to

keep cores busy

Complex multi-level caches, hardware

cache coherency

(4)

Creating parallel programs is hard…

Edward A. Lee, EECS professor at U.C. Berkeley:

“Although threads seem to be a small step from

sequential computation, in fact, they represent a huge step. They discard the most essential and appealing properties of sequential computation: understandability, predictability, and determinism.”

Herb Sutter, chair of the ISO C++ standards committee, Microsoft:

“Everybody who learns concurrency thinks they understand it, ends up finding

mysterious races they thought weren’t possible, and discovers that they didn’t actually understand it yet after all”

(5)

Learning raises the feeling of complexity

Provides good insight in C++ concurrency

C++11 standardizes several concurrency primitives Warns for many many subtle problems

The authorative description (4th edition)

Apparently requires 1300+ pages...

Safe concurrency by defensive design Shows that Java shares many

(6)

Further appetite for performance?

General-purpose CPUs are (traditionally) designed to handle

code with complex control-flow

Their effective usage of silicon for computations is low

Area(ALU’s)/Area(total die)

is about

1%

How to

significantly

increase operations/sec/$ and operations/J ?

Hand-off compute load to:

Function-specific hardware accelerators

(H264 decode, LTE channel decode, GFX rendering, IP packet processing, ...)

GP-GPU:

general-purpose programmable graphics processor units

(7)

Offload CPU: computational efficiency

GP-GPU:

High floating point

performance (>1TFlops)

Large off-chip memory

bandwidth

Needs thousands of

concurrent threads

Few inter-thread data

dependencies and little

data-dependent control

High-end chips take huge

power (>100W)

FPGA:

High integer performance

(>1Tops)

Good power efficiency

Needs hundreds of

concurrent instructions

Takes HW design expertise

and effort.

High-end chips are very

expensive (>$1000)

(8)

CPU – FPGA combinations

Xilinx 'Zync' or Altera ‘Cyclone’

with dual-core ARM

(9)

CPU – GPGPU combinations

NVidia Tesla for high-end

compute AMD Fusion for desktop, gaming, …

Intel Haswell: desktop, laptop, …

ODROID: ARM quad-core with embedded GP-GPU

(10)

Intel for embedded: don’t underestimate

And furthermore:

Intel Atom “Bay Trail”: dual- and quad-core, 22nm, with embedded GP-GPU

Intel Quark: 1/10 power of Atom, 32-bit x86 architecture.

Arduino-style development board….

Intel NUC (Next Unit of Compute) core-i3 or i5 on 4”x4”

(11)

CPU - Accelerator application mapping

Conceptually nice picture, real implementation hurdles:

Application I/O to hardware is shielded by any 'real' operating system Thread control (sleep/wakeup) interacts with Accelerator progress C-code of SW thread mapped to FPGA through ‘high level synthesis’

Application

CPU-thread 1 FPGA

Accelerator CPU-thread 2

Channel Channel

Functional partitioning

Create SW thread with appropriate functionality

Channels for synchronized inter-thread communication

(12)

Creation of an FPGA Accelerator

FPGA hardware implementation inter-thread communication API (channels, shared memory, mutex, …) HW implementation of same communication API Compute kernel: C source code in SW thread Software functional reference FPGA HW implementation of compute kernel

High-level synthesis tooling (e.g. Xilinx’ Vivado)

Choose local (embedded) memories for some of the C variables, synthesize shared-memory access for others.

Balance amount of hardware with required performance (loop unrolling) HLS tooling

(13)

HW/SW communication stack

Application SW virtual address space

CPU-side stack

Linux

Multi-core CPU with MMU and caches

PCI-e / AXI memory bus

PCIe / AXI interface Lapack accelerator Crypto accelerator DDR Accelerator-side (FPGA) stack

Snoop Control unit

User-level driver

Kernel driver

Fifo interfaces to accelerators

DMA streaming, caches

Shared access to local srams Compute library

e.g. lapack, crypto Channel

(14)

ARM (A9) multicore example

FPGA or GPU DDR L2 Cach e

(15)

Intel (i5) multicore example

FPGA or GPU Memory bus DDR

Device reads will be pulled from CPU L1/L2/L3 caches

(16)

Memory-mapped communication?

Shared-memory paradigm to communicate with GPU/FPGA?

Matches C/Java programming model Highly efficient, low run-time overhead

 No system calls for data transport: just CPU load/stores  Take advantage of existing on-chip caches to buffer data

Sounds nice… Can I transfer a C/Java object pointer through my

channel, for dereferencing inside my accelerator?

Well… that would require tackling:

Cache coherency issues

(17)

Shared memory with GP-GPU?

Today, Nvidia’s “CUDA” is the popular programming environment

Based on separated memories (use on-card memory)

Explicit data transport to/from GPU card, avoid shared memory Allows a streaming model, where CPU and GPU are concurrently active

Providers of integrated GPUs

(AMD, Intel, ARM)

are working to

improve on this programming model:

Integrated GPUs do share the global memory with the CPU, no need to really copy data.

MMUs are being added to the GPU, allowing to share pointers Cache coherency support remains (for now) only partial,

(18)

Shared memory with FPGA?

FPGA vendors are late to provide SW+tools to integrate an

accelerator with host CPU+OS:

Support for OpenCL programming model is coming 

Rely on explicit data transport to/from FPGA local memory

Creating ‘mmap’ capable device drivers can be done by yourself? 

Also, MMU sharing can be implemented by yourself in the FPGA? 

GPU vendors are ahead of FPGA vendors in attracting customers

with SW-oriented tooling.

(19)

Evaluating an application mapping (1)

Vector fabrics did study the mapping of a particular video object

recognition algorithm for one of our customers:

Its compute kernel contained a 2-D convolution to match images. The software reference implementation performed 0.9G multiply-adds per second on a desktop PC: too low for actual deployment. We created performance estimates for potential mapping to

(20)

Evaluating an application mapping (2)

One week of optimization of the algorithm to an

Intel i5

platform

Multi-threading to utilize the available 4 cores, and vectorization (SSSE3) to speed-up pixel operations

Reaching 25G multiply-adds /sec.

One week of mapping the C kernel to an

FPGA

implementation

(not including the CPU-FPGA communication)

Rewriting the C kernel for use in a synthesis tool (Xilinx’ Vivado) Carefully tune on-chip memory architecture for high parallelism Reaching amazingly the same 25G multiply-adds/sec for a (ballpark) 200€ FPGA chip.

Few days to study mapping to a midrange

Nvidia GPU card

.

A rough estimate showed potential to achieve about 75G multiply-adds/sec.

Required the mapping of a much larger code portion to avoid frequent data transfers. Would be a really difficult task.

(21)

Conclusion

Multi-core CPUs are everywhere, yet multi-threaded

programming is difficult and error-prone. Heterogeneous system

programming adds further complexity.

GP-GPU vendors did a nicer approach to the SW-programmer

than FPGA vendors, by delivering integrated compilers and OS

device drivers (and now proceed with memory-mapped

integration).

Spending three weeks on code tuning and mapping was

sufficient to obtain good insights on heterogeneous architecture

opportunities.

(22)

Thank you

References

Related documents

Mendis, "Evaluation Of The Flexural Strength And Serviceability Of Geopolymer Concrete Beams Reinforced With Glass-Fibre-Reinforced Polymer (GFRP) Bars"

The site is strategically important to compare the water quality from shallower and deeper aquifers, as Ranney wells water sample represent ground water quality in shal- low

First, the Core component of the Stockholm syndrome scale was predicted to have the strongest association with reported aggression, while the Damage and Love components were

Given the extensive literature available on the direct and indirect benefits of natural ecosystems for human health and well-being (e.g., food, water, fuel etc.), the HPRU

We compare the testing result on the original sampled points using the SVM model by training the original sinc data set and the simplified sinc data set.. Then the error

majority of the learners acknowledged having learned in various ways from both group work and individual work methods of English language learning and teaching

Power flow method based on Fuzzy logic has been suggested which regulate angles and voltage magnitude at various transmission buses of power system. The fuzzy