Embedded Systems:
map to FPGA, GPU, CPU?
Jos van Eijndhoven
[email protected]
Bits&Chips Embedded systems
Nov 7, 2013
Moore’s law versus Amdahl’s law
Computational Capacity Software Performance # of transi stors Introduction of multicore technology Hardware capabilities underutilized Programming bottleneck timeMulti-core CPUs are here to stay
AMD Fusion Llano
nVidia Tegra3
CPUs grow to 2, 4, 8, .. 64 .. 256 cores
Mobile, desktop, server
Multi-threaded programming model to
keep cores busy
Complex multi-level caches, hardware
cache coherency
Creating parallel programs is hard…
Edward A. Lee, EECS professor at U.C. Berkeley:
“Although threads seem to be a small step from
sequential computation, in fact, they represent a huge step. They discard the most essential and appealing properties of sequential computation: understandability, predictability, and determinism.”
Herb Sutter, chair of the ISO C++ standards committee, Microsoft:
“Everybody who learns concurrency thinks they understand it, ends up finding
mysterious races they thought weren’t possible, and discovers that they didn’t actually understand it yet after all”
Learning raises the feeling of complexity
Provides good insight in C++ concurrency
C++11 standardizes several concurrency primitives Warns for many many subtle problems
The authorative description (4th edition)
Apparently requires 1300+ pages...
Safe concurrency by defensive design Shows that Java shares many
Further appetite for performance?
General-purpose CPUs are (traditionally) designed to handle
code with complex control-flow
Their effective usage of silicon for computations is low
Area(ALU’s)/Area(total die)
is about
1%
How to
significantly
increase operations/sec/$ and operations/J ?
Hand-off compute load to:
Function-specific hardware accelerators
(H264 decode, LTE channel decode, GFX rendering, IP packet processing, ...)
GP-GPU:
general-purpose programmable graphics processor unitsOffload CPU: computational efficiency
GP-GPU:
High floating point
performance (>1TFlops)
Large off-chip memory
bandwidth
Needs thousands of
concurrent threads
Few inter-thread data
dependencies and little
data-dependent control
High-end chips take huge
power (>100W)
FPGA:
High integer performance
(>1Tops)
Good power efficiency
Needs hundreds of
concurrent instructions
Takes HW design expertise
and effort.
High-end chips are very
expensive (>$1000)
CPU – FPGA combinations
Xilinx 'Zync' or Altera ‘Cyclone’
with dual-core ARM
CPU – GPGPU combinations
NVidia Tesla for high-end
compute AMD Fusion for desktop, gaming, …
Intel Haswell: desktop, laptop, …
ODROID: ARM quad-core with embedded GP-GPU
Intel for embedded: don’t underestimate
And furthermore:
Intel Atom “Bay Trail”: dual- and quad-core, 22nm, with embedded GP-GPU
Intel Quark: 1/10 power of Atom, 32-bit x86 architecture.
Arduino-style development board….
Intel NUC (Next Unit of Compute) core-i3 or i5 on 4”x4”
CPU - Accelerator application mapping
Conceptually nice picture, real implementation hurdles:
Application I/O to hardware is shielded by any 'real' operating system Thread control (sleep/wakeup) interacts with Accelerator progress C-code of SW thread mapped to FPGA through ‘high level synthesis’
Application
CPU-thread 1 FPGA
Accelerator CPU-thread 2
Channel Channel
Functional partitioning
Create SW thread with appropriate functionality
Channels for synchronized inter-thread communication
Creation of an FPGA Accelerator
FPGA hardware implementation inter-thread communication API (channels, shared memory, mutex, …) HW implementation of same communication API Compute kernel: C source code in SW thread Software functional reference FPGA HW implementation of compute kernelHigh-level synthesis tooling (e.g. Xilinx’ Vivado)
Choose local (embedded) memories for some of the C variables, synthesize shared-memory access for others.
Balance amount of hardware with required performance (loop unrolling) HLS tooling
HW/SW communication stack
Application SW virtual address space
CPU-side stack
Linux
Multi-core CPU with MMU and caches
PCI-e / AXI memory bus
PCIe / AXI interface Lapack accelerator Crypto accelerator DDR Accelerator-side (FPGA) stack
Snoop Control unit
User-level driver
Kernel driver
Fifo interfaces to accelerators
DMA streaming, caches
Shared access to local srams Compute library
e.g. lapack, crypto Channel
ARM (A9) multicore example
FPGA or GPU DDR L2 Cach eIntel (i5) multicore example
FPGA or GPU Memory bus DDR
Device reads will be pulled from CPU L1/L2/L3 caches
Memory-mapped communication?
Shared-memory paradigm to communicate with GPU/FPGA?
Matches C/Java programming model Highly efficient, low run-time overhead
No system calls for data transport: just CPU load/stores Take advantage of existing on-chip caches to buffer data
Sounds nice… Can I transfer a C/Java object pointer through my
channel, for dereferencing inside my accelerator?
Well… that would require tackling:
Cache coherency issues
Shared memory with GP-GPU?
Today, Nvidia’s “CUDA” is the popular programming environment
Based on separated memories (use on-card memory)
Explicit data transport to/from GPU card, avoid shared memory Allows a streaming model, where CPU and GPU are concurrently active
Providers of integrated GPUs
(AMD, Intel, ARM)are working to
improve on this programming model:
Integrated GPUs do share the global memory with the CPU, no need to really copy data.
MMUs are being added to the GPU, allowing to share pointers Cache coherency support remains (for now) only partial,
Shared memory with FPGA?
FPGA vendors are late to provide SW+tools to integrate an
accelerator with host CPU+OS:
Support for OpenCL programming model is coming
Rely on explicit data transport to/from FPGA local memory
Creating ‘mmap’ capable device drivers can be done by yourself?
Also, MMU sharing can be implemented by yourself in the FPGA?
GPU vendors are ahead of FPGA vendors in attracting customers
with SW-oriented tooling.
Evaluating an application mapping (1)
Vector fabrics did study the mapping of a particular video object
recognition algorithm for one of our customers:
Its compute kernel contained a 2-D convolution to match images. The software reference implementation performed 0.9G multiply-adds per second on a desktop PC: too low for actual deployment. We created performance estimates for potential mapping to
Evaluating an application mapping (2)
One week of optimization of the algorithm to an
Intel i5
platform
Multi-threading to utilize the available 4 cores, and vectorization (SSSE3) to speed-up pixel operations
Reaching 25G multiply-adds /sec.
One week of mapping the C kernel to an
FPGA
implementation
(not including the CPU-FPGA communication)
Rewriting the C kernel for use in a synthesis tool (Xilinx’ Vivado) Carefully tune on-chip memory architecture for high parallelism Reaching amazingly the same 25G multiply-adds/sec for a (ballpark) 200€ FPGA chip.
Few days to study mapping to a midrange
Nvidia GPU card
.
A rough estimate showed potential to achieve about 75G multiply-adds/sec.
Required the mapping of a much larger code portion to avoid frequent data transfers. Would be a really difficult task.