• No results found

Parallelism and Energy-efficient GPUs from Mobile to Supercomputers. Jem Davies ARM Fellow, VP of Technology Media Processing Division 24-Feb-13

N/A
N/A
Protected

Academic year: 2021

Share "Parallelism and Energy-efficient GPUs from Mobile to Supercomputers. Jem Davies ARM Fellow, VP of Technology Media Processing Division 24-Feb-13"

Copied!
38
0
0

Loading.... (view fulltext now)

Full text

(1)

Parallelism and Energy-efficient GPUs

from Mobile to Supercomputers

Jem Davies

ARM Fellow, VP of Technology

Media Processing Division

(2)
(3)

The Eras of Computing

U ni ts 1M 10M Mainframe Mini 1st Era 100M 1 Billion PC Desktop Internet 2nd Era

100 Billion

The Internet of Things

10 Billion

(4)

The Eras of Computing

U ni ts 1M 10M Mainframe Mini 1st Era 100M 1 Billion PC Desktop Internet 2nd Era

100 Billion

The Internet of Things

10 Billion

(5)

Merging Our Digital and Physical Worlds

ARM® technology and the ARM ecosystem scale from sensors to servers, connecting our world, from the Internet of Things to supercomputers

ARM’s partners make SoCs containing heterogeneous compute engines

(6)

From 1mm

3

to 1km

3

Battery Solar Cells

Processor, SRAM and PMU

8.75mm3 platform

solar cell 0.18µm Cortex™-M3

12µAh Li-ion battery

University of Michigan

1mm3 platform

4200 ARM powered Neutrino Detectors 70 bore holes 2.5km deep

60 detectors per bore hole

(7)

Mobility Is Reshaping Behaviour

71%

of enterprises plan to establish own store Tablets outsell laptop PCs

60%

Facebook users access from mobile

IDG, Pew Research

79%

Of iPad users use exclusively

for Internet

1 billion smartphones will ship in 2013

50% of mobile computing devices in 2013 will be ARM-based

36%

Chinese Android™ users spend 2+ hrs a day on mobile apps

(8)

Pushing the Boundaries of Mobile Computing

1080p HD video and beyond, multiple streams, HEVC Live image and

scene recognition for augmented reality Intuitive gesture and natural language input Video & image editing, manipulation, and rendering Console- quality gaming experience Facial recognition and unlock

Delivering tomorrow’s features and capabilities

2.5k/4k displays, multiple simultaneous displays Multi-tasking and multi- window

(9)

Interactive Mobile Devices Drive Graphics

Graphics capability is a key factor in consumer purchasing decisions

Rich graphics is a priority for anything with a screen

 Smartphones, DTVs, STBs, Tablets, hand-held games consoles, netbooks In-car infotainment

In the post-PC era, data from the cloud and

graphically-rich, interactive, connected devices means a growing market for GPUs

 4 billion internet-connected screens in 2016, most with embedded graphics

(10)
(11)

Performance/Power is the Big Challenge

 Requirements on the GPU continue to grow exponentially but still have to fit within constant power boundaries

 Mali GPU power already in mobile power budget; 35% additional energy efficiency improvements required every year to fit new performance requirements within SoC thermal limits

ARM GPU and System savings of 35% annually R el at iv e Po w er

(12)

>230 OEM products shipping with Mali GPUs

#1 graphics in Android tablets (>50%)

Delivering Tomorrow’s Features and Capabilities

>20% Android smartphones

ARM Mali GPU Momentum

#1 in graphics- enabled DTV (>70%) 75 licences across 54 partners, making Mali-based SoCs Over 150M Mali™-based GPUs reported in 2012

(13)

ARM is not just about Mobile…

…but Mobile has been driving a lot of innovation

and enabling diversity

(14)

Super-phone Mali -T678 Connectivity SOI

Mobile Revolution Built On Diversity

40LP 20nm 32 HKMG FinFET 14/16 Cortex -A53 Cortex -A7 Cortex

-A5 Cortex-A15 Cortex -A9 Mass- market smartphone Tablet Clamshell Mali -T628 Mali -T604 Mali -450 Mali -400 Video

Memory Reference Design

Modem Cortex -A57 Camera

Android Windows Phone

Firefox Mobile Chrome OS Windows RT Aliyun OS

(15)

Post-PC World Built on Diversity

Rapid pace of development has embraced numerous changes, built on diverse, nimble, fast-responding ecosystem of partners

 One size does NOT fit all (nor should it!)

 Not just desktops and laptops any more

New form factors and new usage models:

 Phones, tablets, phablets, netbooks, hybrid tablets, TVs, consoles, Chromebooks, watches!

 Always-on, always-connected, long battery-life

 Different form factors in different regions

Diverse Silicon foundry ecosystem: bulk CMOS, FDSOI, FinFET

ARM’s partners are driven by innovation and user requirements, not by ARM

(16)

big Performance, LITTLE Power

Pe rf orm an ce Highest Performance

big

 Smartphone needs to be very responsive

 Constant connectivity and usage  Voice, SMS, trickle feeds are low-intensity tasks

Mobile performance

needs are highly elastic

(17)

GPU Compute Making the Difference

Computer Vision

Real Time Still and Moving Image Perfection

Multi-Perspective Vision

2D to 3D

Information Extraction

Multi-User Interaction

Benefits

More energy-efficient processing BOM reduction Improved accuracy/quality Improved existing use cases Unlock new use cases

Light-Field Photography Computational Photography Trends Heterogeneous computing Portability Parallel computation Hardware acceleration GPU Computing

(18)

Unifying CPU and GPU Compute

High efficiency approaches for:

Computational photography

Immersive visual computing

Augmented Reality

Improved user interfaces

GPU for highly-threaded tasks and CPU for low-threaded tasks

Efficient ARM IP synergy makes it worthwhile to move even small tasks to the GPU

Full Profile GPU Compute is key

Continued Performance & Efficiency Gains

Right-sized computing

(19)

Energy Efficiency Underlies It All

Efficiency is a requirement for a connected world

(20)

Energy-efficiency Has Always Mattered

Segment differentiation matters less than you think

High-end Mobile driven by thermal constraints

And “You can never be too thin”

Sensors need 10 years from a button cell battery

Everybody hates fans

Desktops struggle to:

 Dissipate more than ~150W out of a single chip package

 Supply more than ~300W from a PSU onto a PCI card

Servers struggle:

 To get more than ~10kW into a rack

(21)

GPUs from Mobile to Supercomputer

Mont-Blanc project selects Samsung Exynos 5 Processor - 13 Nov 2012

The project continues its research effort towards an energy efficient HPC prototype using low-power embedded technology

Salt Lake City, 13th November 2012.- The Mont-Blanc European project has selected the

Samsung Exynos platform as the building block for powering its first integrated low power- High Performance Computing (HPC) prototype. The aim of Mont-Blanc project is to design a new type of computer architecture capable of setting future global HPC standards, built from today’s energy efficient solutions used in embedded and mobile devices

The Samsung Exynos 5 Dual is built on 32nm low-power HKMG (High-K Metal Gate), and features a dual-core 1.7GHz mobile CPU built on ARM® Cortex™-A15 architecture plus an integrated ARM Mali™-T604 GPU for increased performance density and energy

efficiency. It has been featured and market proven in consumer and mobile devices such as Samsung

(22)

The Search for Parallelism

ARM (like the industry) utilizes parallelism across all segments, across multiple devices to drive down energy use

 CPUs use instruction-level parallelism: modern ARM CPUs have IPC well over 3

 CPUs and OSs exploit task-level parallelism: fast context swap, MMUs, virtual memory, hypervisors

Graphics is one of the few problem spaces that have very high levels of parallelism:

 Do <foo> on every vertex in the frame (10s - 100s of kvertices/frame)

 Do <bar> on every pixel in the frame (millions of pixels/frame)

 Data-level parallelism through thread-level parallelism

GPU compute (throughput architectures/implementations) has great advantages:

 Performance and Energy efficiency

CPUs remain “easier” to program

(23)

Amdahl’s Law is Alive and Well

100% parallel 95% parallel 90% parallel 75% parallel 50% parallel 0 4 8 12 16 20 24 28 32 M ax im u m s p ee du p

Speedup on parallel processors is limited by the sequential portion of the program

Sequential portion need not be large to constrain speedup significantly !

(24)

ARM’s Heterogeneous Computing Advantages

ARM has always been involved in heterogeneous systems

 For 23 years, mobile SoCs have contained CPUs, DSPs and other compute elements

ARM’s first multicore CPU was produced 10 years ago

 ARM has produced several generations of coherent interconnect

ARM introduced the world’s first embedded multicore GPU

 Now have TFLOP GPUs

big.LITTLE™ and GPU Computing

 Right task, right processor

Today’s modern mobile SoC contains ISPs, DSPs, GPUs, baseband accelerators,

crypto accelerators, GPUs and VPUs

 “Helper” ARM CPUs, security, power controllers, baseband, WiFi

 Heterogeneous, multicore applications CPUs

ARM is introducing multicore VPUs

(25)

Improving the Granule of Offload

Hardware: sharing data

 Traditional CPU -> GPU was unidirectional offload

 GPU is I/O-coherent with CPU

◦ GPU reads snoop CPU data

 Frame buffer is not cached

 Too big, especially with multi-buffering

 Typically large, pre-arranged buffers

 Set up by a driver

GPU Compute is much more bidirectional

 GPU needs to be fully coherent with the CPU

 Smaller-scale sharing of data

 Exchanging pointers to data structures

CPU GPU processor Display

CPU GPU

Object Database

(26)

CPU + GPU Heterogeneous benefits

ARM leadership in system coherency is key to enable heterogeneous computing and the future of graphics

GPU participating in system coherency

enables a broad set of benefits and values

Avoids unnecessary cache flush/invalidate operations

 Maintain warm caches: lower power, higher performance

Avoids DRAM access to data present in CPU caches

 Lower power by elimination of off-chip memory accesses

Eliminates software-managed cache coherency

 Greater efficiency, reduced complexity

 Faster TTM by removing source of complex synchronization bugs

CoreLink ™ CCN-504 Cache Coherent Network

with AMBA® 4 ACE Interfaces

CoreLink DMC 520 System and I/O NIC Network interconnect

L2 L2 L2

Up to Quad

(27)
(28)

Building Smarter Systems

Visual computing is about designing the best compute

systems to match the competing demands of power and

performance

ARM provides the four fundamental IP building blocks

needed to enable high-performance, energy efficient, coherent System-on-Chips

ARM’s unique experience and portfolio of processors,

memory systems, physical IP and interconnect designed together

ARM’s leading system IP extends coherency for

optimized multicore solutions

(29)

ARMv8: 64-bit Processing Now Available In Mobile Power

64-bit processing and full

backward compatibility with 32-bit

Mobile devices will need more than 4GBytes

Scalable power and performance with big.LITTLE

Sub 1mm2 processors

Co-designed with Mali-T600 GPUs 32-bit 64-bit CRYPTO ARMv8 Scalar FP Advanced SIMD ARM v7A Software

(30)

Complexity

Out-of-order superscalar CPUs introduces some complexity

Multicore CPUs/GPUs – more complexity

 Memory consistency model

 Threading models

Heterogeneous computing - more complexity

Some GPU compute engines are:

 Complex

 Badly described

 Difficult to reason about

Most graphics developers want to create visually stunning

(31)

Complexity drives the need for standards

If we do not abstract away from this complexity and present a simpler world to the developer…

 Either they won’t use these new systems

 Or, they’ll use it and get it wrong

Either way, it will be our fault, and they will hate us for it 

If we all provide different abstractions…

 They will hate us for it

We need standard(s)

(32)

ARM and HSA

Heterogeneous System Architecture (HSA) Foundation

 Not-for-profit consortium founded by industry leaders to create hardware and software standards for heterogeneous computing to:

Simplify the programming environment Make compute at low power pervasive

 Introduce new capabilities in modern computing devices

ARM is a founding member and board member

Heterogeneous computing is the essence of efficient computation: the right processor for the right task

More research required – not a solved problem

 Better models, compilation technologies, debug, programming methodologies

(33)

Diverse Partners Driving Future of Heterogeneous Computing

Founders Promoters Supporters Contributors Academic

(34)
(35)

Towards the Future

Don’t miss opportunities. Push the boundaries!

Innovation in the post-PC era requires:

 Focus on energy-efficiency across diverse market segments with diverse devices

 Continue to utilize parallelism (of all forms)

 Compute on the right device

Highest efficiency CPU for required performance Use the right programmable device for the right task,

such as GPU computing

Domain-specific processors for specific tasks

Not just hardware problem, need more software investment:

 Operating systems, hypervisors, managed environments, APIs, tools

 Application developers across the ecosystem

(36)
(37)

Thank you

谢谢

(38)

References

Related documents