Parallelism and Energy-efficient GPUs
from Mobile to Supercomputers
Jem Davies
ARM Fellow, VP of Technology
Media Processing Division
The Eras of Computing
U ni ts 1M 10M Mainframe Mini 1st Era 100M 1 Billion PC Desktop Internet 2nd Era100 Billion
The Internet of Things
10 Billion
The Eras of Computing
U ni ts 1M 10M Mainframe Mini 1st Era 100M 1 Billion PC Desktop Internet 2nd Era100 Billion
The Internet of Things
10 Billion
Merging Our Digital and Physical Worlds
ARM® technology and the ARM ecosystem scale from sensors to servers, connecting our world, from the Internet of Things to supercomputers
ARM’s partners make SoCs containing heterogeneous compute engines
From 1mm
3to 1km
3Battery Solar Cells
Processor, SRAM and PMU
8.75mm3 platform
solar cell 0.18µm Cortex™-M3
12µAh Li-ion battery
University of Michigan
1mm3 platform
4200 ARM powered Neutrino Detectors 70 bore holes 2.5km deep
60 detectors per bore hole
Mobility Is Reshaping Behaviour
71%
of enterprises plan to establish own store Tablets outsell laptop PCs60%
Facebook users access from mobileIDG, Pew Research
79%
Of iPad users use exclusively
for Internet
1 billion smartphones will ship in 2013
50% of mobile computing devices in 2013 will be ARM-based
36%
Chinese Android™ users spend 2+ hrs a day on mobile appsPushing the Boundaries of Mobile Computing
1080p HD video and beyond, multiple streams, HEVC Live image andscene recognition for augmented reality Intuitive gesture and natural language input Video & image editing, manipulation, and rendering Console- quality gaming experience Facial recognition and unlock
Delivering tomorrow’s features and capabilities
2.5k/4k displays, multiple simultaneous displays Multi-tasking and multi- window
Interactive Mobile Devices Drive Graphics
Graphics capability is a key factor in consumer purchasing decisions
Rich graphics is a priority for anything with a screen Smartphones, DTVs, STBs, Tablets, hand-held games consoles, netbooks In-car infotainment
In the post-PC era, data from the cloud andgraphically-rich, interactive, connected devices means a growing market for GPUs
4 billion internet-connected screens in 2016, most with embedded graphics
Performance/Power is the Big Challenge
Requirements on the GPU continue to grow exponentially but still have to fit within constant power boundaries
Mali GPU power already in mobile power budget; 35% additional energy efficiency improvements required every year to fit new performance requirements within SoC thermal limits
ARM GPU and System savings of 35% annually R el at iv e Po w er
>230 OEM products shipping with Mali GPUs
#1 graphics in Android tablets (>50%)
Delivering Tomorrow’s Features and Capabilities
>20% Android smartphones
ARM Mali GPU Momentum
#1 in graphics- enabled DTV (>70%) 75 licences across 54 partners, making Mali-based SoCs Over 150M Mali™-based GPUs reported in 2012
ARM is not just about Mobile…
…but Mobile has been driving a lot of innovation
and enabling diversity
Super-phone Mali -T678 Connectivity SOI
Mobile Revolution Built On Diversity
40LP 20nm 32 HKMG FinFET 14/16 Cortex -A53 Cortex -A7 Cortex
-A5 Cortex-A15 Cortex -A9 Mass- market smartphone Tablet Clamshell Mali -T628 Mali -T604 Mali -450 Mali -400 Video
Memory Reference Design
Modem Cortex -A57 Camera
Android Windows Phone
Firefox Mobile Chrome OS Windows RT Aliyun OS
Post-PC World Built on Diversity
Rapid pace of development has embraced numerous changes, built on diverse, nimble, fast-responding ecosystem of partners One size does NOT fit all (nor should it!)
Not just desktops and laptops any more
New form factors and new usage models: Phones, tablets, phablets, netbooks, hybrid tablets, TVs, consoles, Chromebooks, watches!
Always-on, always-connected, long battery-life
Different form factors in different regions
Diverse Silicon foundry ecosystem: bulk CMOS, FDSOI, FinFET
ARM’s partners are driven by innovation and user requirements, not by ARMbig Performance, LITTLE Power
Pe rf orm an ce Highest Performancebig
Smartphone needs to be very responsive Constant connectivity and usage Voice, SMS, trickle feeds are low-intensity tasks
Mobile performance
needs are highly elastic
GPU Compute Making the Difference
Computer Vision
Real Time Still and Moving Image Perfection
Multi-Perspective Vision
2D to 3D
Information Extraction
Multi-User Interaction
Benefits
More energy-efficient processing BOM reduction Improved accuracy/quality Improved existing use cases Unlock new use cases
Light-Field Photography Computational Photography Trends Heterogeneous computing Portability Parallel computation Hardware acceleration GPU Computing
Unifying CPU and GPU Compute
High efficiency approaches for:
Computational photography
Immersive visual computing
Augmented Reality
Improved user interfaces
GPU for highly-threaded tasks and CPU for low-threaded tasks
Efficient ARM IP synergy makes it worthwhile to move even small tasks to the GPU
Full Profile GPU Compute is keyContinued Performance & Efficiency Gains
Right-sized computing
Energy Efficiency Underlies It All
Efficiency is a requirement for a connected world
Energy-efficiency Has Always Mattered
Segment differentiation matters less than you think
High-end Mobile driven by thermal constraints And “You can never be too thin”
Sensors need 10 years from a button cell battery
Everybody hates fans
Desktops struggle to: Dissipate more than ~150W out of a single chip package
Supply more than ~300W from a PSU onto a PCI card
Servers struggle: To get more than ~10kW into a rack
GPUs from Mobile to Supercomputer
Mont-Blanc project selects Samsung Exynos 5 Processor - 13 Nov 2012The project continues its research effort towards an energy efficient HPC prototype using low-power embedded technology
Salt Lake City, 13th November 2012.- The Mont-Blanc European project has selected theSamsung Exynos platform as the building block for powering its first integrated low power- High Performance Computing (HPC) prototype. The aim of Mont-Blanc project is to design a new type of computer architecture capable of setting future global HPC standards, built from today’s energy efficient solutions used in embedded and mobile devices
The Samsung Exynos 5 Dual is built on 32nm low-power HKMG (High-K Metal Gate), and features a dual-core 1.7GHz mobile CPU built on ARM® Cortex™-A15 architecture plus an integrated ARM Mali™-T604 GPU for increased performance density and energy
efficiency. It has been featured and market proven in consumer and mobile devices such as Samsung
The Search for Parallelism
ARM (like the industry) utilizes parallelism across all segments, across multiple devices to drive down energy use CPUs use instruction-level parallelism: modern ARM CPUs have IPC well over 3
CPUs and OSs exploit task-level parallelism: fast context swap, MMUs, virtual memory, hypervisors
Graphics is one of the few problem spaces that have very high levels of parallelism: Do <foo> on every vertex in the frame (10s - 100s of kvertices/frame)
Do <bar> on every pixel in the frame (millions of pixels/frame)
Data-level parallelism through thread-level parallelism
GPU compute (throughput architectures/implementations) has great advantages: Performance and Energy efficiency
CPUs remain “easier” to programAmdahl’s Law is Alive and Well
100% parallel 95% parallel 90% parallel 75% parallel 50% parallel 0 4 8 12 16 20 24 28 32 M ax im u m s p ee du pSpeedup on parallel processors is limited by the sequential portion of the program
Sequential portion need not be large to constrain speedup significantly !
ARM’s Heterogeneous Computing Advantages
ARM has always been involved in heterogeneous systems For 23 years, mobile SoCs have contained CPUs, DSPs and other compute elements
ARM’s first multicore CPU was produced 10 years ago ARM has produced several generations of coherent interconnect
ARM introduced the world’s first embedded multicore GPU Now have TFLOP GPUs
big.LITTLE™ and GPU Computing Right task, right processor
Today’s modern mobile SoC contains ISPs, DSPs, GPUs, baseband accelerators,crypto accelerators, GPUs and VPUs
“Helper” ARM CPUs, security, power controllers, baseband, WiFi
Heterogeneous, multicore applications CPUs
ARM is introducing multicore VPUsImproving the Granule of Offload
Hardware: sharing data Traditional CPU -> GPU was unidirectional offload
GPU is I/O-coherent with CPU
◦ GPU reads snoop CPU data
Frame buffer is not cached
Too big, especially with multi-buffering
Typically large, pre-arranged buffers
Set up by a driver
GPU Compute is much more bidirectional GPU needs to be fully coherent with the CPU
Smaller-scale sharing of data
Exchanging pointers to data structures
CPU GPU processor Display
CPU GPU
Object Database
CPU + GPU Heterogeneous benefits
ARM leadership in system coherency is key to enable heterogeneous computing and the future of graphics
GPU participating in system coherencyenables a broad set of benefits and values
Avoids unnecessary cache flush/invalidate operations Maintain warm caches: lower power, higher performance
Avoids DRAM access to data present in CPU caches Lower power by elimination of off-chip memory accesses
Eliminates software-managed cache coherency Greater efficiency, reduced complexity
Faster TTM by removing source of complex synchronization bugs
CoreLink ™ CCN-504 Cache Coherent Network
with AMBA® 4 ACE™ Interfaces
CoreLink DMC 520 System and I/O NIC Network interconnect
L2 L2 L2
Up to Quad
Building Smarter Systems
Visual computing is about designing the best computesystems to match the competing demands of power and
performance
ARM provides the four fundamental IP building blocksneeded to enable high-performance, energy efficient, coherent System-on-Chips
ARM’s unique experience and portfolio of processors,memory systems, physical IP and interconnect designed together
ARM’s leading system IP extends coherency foroptimized multicore solutions
ARMv8: 64-bit Processing Now Available In Mobile Power
64-bit processing and fullbackward compatibility with 32-bit
Mobile devices will need more than 4GBytes
Scalable power and performance with big.LITTLE
Sub 1mm2 processors
Co-designed with Mali-T600 GPUs 32-bit 64-bit CRYPTO ARMv8 Scalar FP Advanced SIMD ARM v7A SoftwareComplexity
Out-of-order superscalar CPUs introduces some complexity
Multicore CPUs/GPUs – more complexity Memory consistency model
Threading models
Heterogeneous computing - more complexity
Some GPU compute engines are: Complex
Badly described
Difficult to reason about
Most graphics developers want to create visually stunningComplexity drives the need for standards
If we do not abstract away from this complexity and present a simpler world to the developer… Either they won’t use these new systems
Or, they’ll use it and get it wrong
Either way, it will be our fault, and they will hate us for it
If we all provide different abstractions… They will hate us for it
We need standard(s)ARM and HSA
Heterogeneous System Architecture (HSA) Foundation Not-for-profit consortium founded by industry leaders to create hardware and software standards for heterogeneous computing to:
Simplify the programming environment Make compute at low power pervasive
Introduce new capabilities in modern computing devices
ARM is a founding member and board member
Heterogeneous computing is the essence of efficient computation: the right processor for the right task
More research required – not a solved problem Better models, compilation technologies, debug, programming methodologies
Diverse Partners Driving Future of Heterogeneous Computing
Founders Promoters Supporters Contributors AcademicTowards the Future
Don’t miss opportunities. Push the boundaries!
Innovation in the post-PC era requires: Focus on energy-efficiency across diverse market segments with diverse devices
Continue to utilize parallelism (of all forms)
Compute on the right device
Highest efficiency CPU for required performance Use the right programmable device for the right task,
such as GPU computing
Domain-specific processors for specific tasks
Not just hardware problem, need more software investment: Operating systems, hypervisors, managed environments, APIs, tools
Application developers across the ecosystem