Maximize Application Performance On the Go and In the Cloud with OpenCL* on Intel Architecture

(1)

Maximize Application Performance On the Go and In

the Cloud with OpenCL* on Intel® Architecture

Arnon Peleg (Intel)

Ben Ashbaugh (Intel)

Dave Helmly (Adobe)

(2)

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED,

BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS

OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR

INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS.

Intel may make changes to specifications and product descriptions at any time, without notice. All products, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice.

Intel processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request. Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families: Go to: Learn About Intel® Processor Numbers

Any code names featured are used internally within Intel to identify products that are in development and not yet publicly announced for release. Customers, licensees and other third parties are not authorized by Intel to use code names in advertising, promotion or marketing of any product or services and any such use of Intel's internal code names is at the sole risk of the user. Intel product plans in this presentation do not constitute Intel plan of record product roadmaps. Please contact your Intel representative to obtain Intel’s current plan of record product roadmaps.

Performance claims: Software and workloads used in performance tests may have been optimized for performance only on Intel®_{microprocessors. Performance tests, such as SYSmark and}

MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.Intel.com/performance

Intel, Intel Inside, the Intel logo, Centrino, Intel Core, Intel Atom, Pentium, VTune, and Ultrabook are trademarks of Intel Corporation in the United States and other countries. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos. *Other names and brands may be claimed as the property of others.

(3)

Today Intel and OpenCL* sessions

3

12:15 PM -1:45 PM

Maximize Application Performance On the Go and in the

Cloud with OpenCL* on Intel® Architecture

(Intel, Adobe Systems Inc.)

2:00 PM – 3:00 PM

Faster Video Creation with Higher Productivity Using Intel®

Developer Tools and OpenCL*

(Intel)

3:15 PM – 4:15 PM

Journey of Pixels in Adobe Photoshop from CPU, GPU to

Cloud on Intel HD Graphics

(4)

This Session Agenda

4

•

Introduction to the Intel® SDK for OpenCL* Applications

Presenter: Arnon Peleg (Intel)

Product Manager, Intel SDK for OpenCL Applications, Intel Corporation

•

Optimize OpenCL* applications for Intel® Iris™ Graphics

Presenter: Ben Ashbaugh (Intel)

Sr. Graphics Complier Engineer, Intel Corporation

•

Accelerating video production with OpenCL* and Intel® Iris™ Graphics

Presenter: Dave Helmly,

(5)

5

Arnon Peleg, Intel® Cooperation

(6)

The Question Is…

How to get maximum performance out of the platform?

Better Together



Use all resources (CPU, Graphics, Media)



Target the right task on the right device

Complex game & graphics engine graphs,

variable bitrate compression, …

Film & image post-processing filters,

graphics, video analytics, …

Programmable

Graphics

Multicore

CPU

Task-parallel or irregular workloads

better suited to a CPU

Highly data-parallel tasks better suited

to Intel® Processor Graphics

Be the conductor of our orchestra:

Use a dynamic set of instruments, whose ranges overlap

and compliment each other. When these instruments

come together, your composition has even more power to

move your audience.

(7)

What is Programmability With OpenCL*?

Unified Code Base

Video Editing

& Playback

Image

Processing

Perceptual

Computing

Games

Maximize platform performance with Intel®

Iris™ Graphics and Intel® HD Graphics for Visual

Computing Usages

Standard based by

Khronos* with Intel

participation

kernel void dp_mul(global const float *a, global const float *b, global float *c) {

int id = get_global_id(0); c[id] = a[id] * b[id];

} // execute over “n” work-items

CPU

Proc.

GFX

A Standard-based Environment to Write Portable Parallel Code for a Diverse Mix

of Multi-core CPUs, Processor Graphics, and Coprocessors

(8)

The

“BIG“ Idea Behind

OpenCL*

8

OpenCL* execution model …



Define N-dimensional computation domain



Choose a target device



Execute a kernel at each point in computation domain

void

trad_mul(int n,

const float *a,

const float *b,

float *c)

{

int i;

for (i=0; i<n; i++)

c[i] = a[i] * b[i];

}

Traditional loops

kernel

void

dp_mul(

global

const float *a,

global

const float *b,

global

float *c)

{

int id =

get_global_id

(0);

c[id] = a[id] * b[id];

}

// execute over “n”

work-items

(9)

Intel

®

SDK for OpenCL* Applications 2013

9

FREE download at intel.com/software/opencl

Supporting the OpenCL 1.2 Full-Profile

on 3

rd

and 4

th

Generation Intel® Core™ Processors with Intel® Iris™ Graphics and Intel®

HD Graphics product family, Intel® Xeon® Processors, and Intel® Xeon Phi™

Coprocessors

(10)

Intel

®

SDK for OpenCL* Applications

XE

2013



OpenCL* runtime and compiler for Intel® Processors (CPU) and Intel® Xeon® Phi™ Coprocessors



Linux OSs support:



Red Hat EL 6.x



SUSE SLES 11.x



OpenCL headers and libs



Source Code Samples



Product documentation – User guide, Optimization Guide



Development tools



Offline build and analysis tools



Debugger for OpenCL kernels on CPU



Profiling with Intel® VTune™ Amplifier XE

What is it?

(11)

Intel

®

SDK for OpenCL* Applications 2013



OpenCL* 1.2 support for 3rd and Future 4th Generation Intel® Core™

Processors



Accelerating performance with Intel® Iris™ Graphics and Intel® HD Graphics

family



Enhanced graphics and media APIs interoperability



DirectX*, OpenGL*, and the Intel® Media SDK



Support for Microsoft Windows 7* and Windows 8* Operating Systems



Developers tools for build, debug and tune of OpenCL applications.

What is it?

(12)

The Intel® SDK for OpenCL* Applications 2013 Software Stack

For Intel® Core™ Processors with Intel® Iris™ Graphics and Intel® HD Graphics

3rd and 4th Generation Intel® Core™ Processors

With Intel® Iris™ Graphics and Intel® HD Graphics

Intel HD Graphics Driver

With OpenCL* 1.2 support for CPU and the Processor Graphics

Intel® SDK for OpenCL* Applications

Microsoft*

Visual Studio*

Integration

SDK Tools

- Kernel Debugger

- Kernel Builder

Developer

Environment

(libs and

headers)

Profiling Tools

Integration

Online

Resources:

- Code Samples

- Optimization Guide

- Tech Articles and

Videos

Applications

Develop

Run

(13)

Visual Computing

Domain

Data Center

Domain

Intel® SDK for OpenCL* Applications Support Matrix

(Version 2013)

(Version XE 2013)

Supported Processors:

Intel® HD & Iris™ Graphics

Intel® Core Processor (CPU)

Intel® Xeon® Processor

Intel® Xeon Phi™ Coprocessor

Operating Systems:

Windows 7*

Window 8*

Red Hat* Linux*

SUSE* Linux*

Know the Processors and Operating Systems

You Use and download the SDK You Need

(14)

The Intel® SDK for OpenCL* Applications Online Resource

intel.com/software/opencl

@intelopencl

The SDK section of the Intel® Developers Zone is a one-stop shop for

resources, support and information for OpenCL* developers



Free Downloads



Code Samples



Tech Articles



Case Studies



Forums and Support



Beta Programs

(15)

Optimize OpenCL* applications for Intel® Iris™ Graphics

Ben Ashbaugh (Intel)

(16)

Agenda

Understanding Occupancy



How Intel® Iris™ Graphics executes OpenCL* Kernels

Memory Matters



Host to Device



Device Access

Compute Characteristics



Maximizing GFlops

Summary / Questions

16

(17)

Agenda

Understanding Occupancy



How Intel® Iris™ Graphics executes OpenCL* Kernels

(18)

Intel® Iris™

Graphics Architecture Overview

18

R

ing

B

us

/

LL

C

/

M

em

ory

Command Streamer (CS) Vertex Fetch (VF) Vertex Shader (VS) Hull Shader (HS) Tessellator Domain Shader (DS) Geometry Shader (GS) Stream Out (SOL) Clip/Setup Th rea d D is pa tc h Video Front End (VFE) Video Quality Engine Multi-Format

CODEC Blitter Display

Slice 0

L3$ Pixel Ops Rasterizer / Depth Render$ Depth$ L1 IC$ 3D Sampler Media Sampler Data Port Tex$ Su b Sl ic e 1 L1 IC$ 3D Sampler Media Sampler Data Port Tex$ Su b Sl ic e 0 EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU

Slice 1

(19)

Intel® Iris™

19

R

ing

B

us

/

LL

C

/

M

em

ory

Slice 0

Slice 1

Global Assets

•

Command Streamer

•

Thread Dispatch

(20)

Intel® Iris™

20

R

ing

B

us

/

LL

C

/

M

em

ory

Slice 0

Slice 1

Slice Common

•

L3 Cache

(21)

Intel® Iris™

21

R

ing

B

us

/

LL

C

/

M

em

ory

Slice 0

Slice 1

Sub Slice

•

Execution Units

•

Samplers and Data Port

(22)

Intel® Iris™ Graphics Architecture Building Blocks

OpenCL* Kernels run on an Execution Unit (EU)

Each EU is a Multi-Threaded SIMD Processor

Up to 7 threads per EU



128 x 8 x 32-bit registers per thread

Up to 8, 16, or 32 OpenCL* work items per

thread (compiler-controlled)



“SIMD8”, “SIMD16”, “SIMD32”



SIMD8



More Registers



SIMD16 and SIMD32



Better Efficiency

22

EU

Thread 0

Thread 2

Thread 4

Thread 6

Thread 1

`

Thread 3

Thread 5

(23)

OpenCL* Work Groups run on a Sub Slice



10 EUs per Sub Slice



Texture Sampler (Images)



Data Port (Buffers)



Instruction and Texture Caches

23

OpenCL* Work Groups may run on multiple EU threads, on multiple EUs!

Su

b

Sl

ic

e

L1

IC$

3D Sampler

Media

Sampler

Data Port

Tex$

EU

(24)

Two Sub Slices make a Slice

Shared Resources: “Slice Common”



L3 Cache + Shared Local Memory



Barriers

Intel® Iris™ Graphics has Two Slices



2 x 2 = 4 Sub Slices



4 x 10 = 40 EUs



Up to 40 x 7 = 280 EU threads



Up to 8960 OpenCL* work items in flight!

24

Slice

L3$ Pixel Ops Rasterizer / Depth Render$ Depth$ L1 IC$ 3D Sampler Media Sampler Data Port Tex$ Su b Sl ic e L1 IC$ 3D Sampler Media Sampler Data Port Tex$ Su b Sl ic e EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU

(25)

How Intel® Iris™

Graphics Runs OpenCL*

25

(26)

How Intel® Iris™

Graphics Runs OpenCL*

26

(27)

Su

b

Sl

ic

e

L1

IC$

3D Sampler

Media

Sampler

Data Port

Tex$

EU

How Intel® Iris™ Graphics Runs OpenCL*

27

2. Launch EU Threads for the Work Group Onto a Sub Slice



Repeat for each Work Group



Must have enough room in the Sub Slice for all EU threads for the Work Group



Not enough room in any Sub Slice



EU threads must wait

(28)

Occupancy

Goal: Use All Machine Resources

This is harder than it sounds! Many factors to consider…

1. Launch Enough Work



One thread sufficient to prevent an EU from going

idle



Too few EU threads can result in an EU being

stalled



More EU threads



better latency coverage



keeps an EU

active

(29)

Occupancy

–

Intel® VTune

™ Amplifier XE 2013

(30)

Occupancy (continued…)

2. Don’t Waste SIMD Lanes



Use an optimal Local Work Size



Good: Query for compiled SIMD size:

CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE



Occasionally Helpful: Compile for a specific local work size (8, 16, or 32):

__attribute__((reqd_work_group_size(X, Y, Z)))



Best: Let the driver pick (Local Work Size == NULL)

Ideal for kernels with no barriers or shared local memory

(31)

Occupancy (continued…)

More subtle factors:

3. Barriers – 16 barriers per sub slice



Can be a limiting factor for very small local work groups

4. Shared Local Memory – 64KB shared local memory per sub slice



Can be a limiting factor for kernels that use lots of shared local memory

(32)

Agenda

Memory Matters



Host to Device

(33)

Optimizing Host to Device Transfers

33

Host (CPU) and Device (GPU) share the same physical memory

For OpenCL* buffers:



No transfer needed (zero copy)!



Allocate system memory aligned to a cache line (64 bytes)



Create buffer with system memory pointer and

CL_MEM_USE_HOST_PTR



Use

clEnqueueMapBuffer()

to access data

For OpenCL* images:

(34)

Operating on Buffers as Images

34

Intel® Iris™ Graphics supports

cl_khr_image2d_from_buffer



New OpenCL* 1.2 Extension



Treat data as a buffer for some kernels, as an image for others



Some restrictions for zero copy: buffer size, image pitch

0x123

0x456

0x789

(35)

Interop with Graphics and Media APIs / SDKs

35

Intel® Iris™ Graphics supports many Graphics and Media interop extensions:



cl_khr_dx9_media_sharing

(includes DXVA for Intel® Media SDK)



cl_khr_d3d10_sharing



cl_khr_d3d11_sharing



cl_khr_gl_sharing



cl_khr_gl_depth_images



cl_khr_gl_event



cl_khr_gl_msaa_sharing

(36)

Agenda

Memory Matters



Device Access

(37)

EU

Sampler

L1

L3

Sampler

L2

LLC

DRAM

images

buffers

256KB/slice

2-8MB/package

(shared w/ CPU)

EDRAM

(non-inclusive

victim cache)

128MB/package

(Intel® Iris

™ Pro 5200)

Intel® Iris™

Graphics Cache Hierarchy

(38)

__global

and

__constant

Memory

38

Global Memory Accesses go through the L3 Cache

L3 Cache Line is 64 bytes

EU thread accesses to the same L3 Cache Line are collapsed



Order of data within cache line does not matter



Bandwidth determined by number of cache lines accessed



Maximum Bandwidth: 64 bytes / clock / sub slice

Good: Load at least 32-bits of data at a time, starting from a 32-bit aligned address

Best: Load 4 x 32-bits of data at a time, starting from a cache line aligned address

(39)

Global and Constant Memory Access Examples

1. x = data[ get_global_id(0) ]



One cache line, full bandwidth

2. x = data[ n – get_global_id(0) ]



Reverse order, full bandwidth

3. x = data[ get_global_id(0) + 1 ]



Offset, two cache lines, half bandwidth

4. x = data[ get_global_id(0) * 2 ]



Strided, half bandwidth



Very strided, worst-case

39

Cache Line n 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Cache Line n + 1 Global ID: Cache Line n - 1 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Cache Line n Global ID: Cache Line n 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Cache Line n + 1 Global ID: Cache Line n 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Cache Line n + 1 Global ID: Cache Line n 0 1 2 ...

Cache Line n + 1 Cache Line n + 2 ... Global ID:

(40)

__local

Memory Accesses

40

Local Memory Accesses also go through the L3 Cache!

Key Difference: Local Memory is Banked



Banked at a DWORD granularity, 16 banks



Bandwidth determined by number of bank conflicts



Maximum Bandwidth: Still 64 bytes / clock / sub slice

Supports more access patterns with full bandwidth than Global Memory



No bank conflicts



full bandwidth

(41)

Local Memory Access Examples

1. x = data[ get_global_id(0) + 1 ]



Unique banks, full bandwidth

2. x = data[ get_global_id(0) & ~1 ]



Same address read, full bandwidth



Strided, half bandwidth



Very strided, worst-case



Full bandwidth!

41

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 Bank: 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Global ID: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 Bank: 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Global ID: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 Bank: 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Global ID: 0 1 2 3 4 5 6 0 _... 0 0 0 0 Bank: 0 0 0 Global ID: ... ... ... ... ... ... ... 7 0 1 2 3 4 5 6 0 _... 1 2 3 4 Bank: 5 6 Global ID: ... ... ... ... ... ...

(42)

__private

Memory

Compiler can

usually

allocate Private Memory in the

Register File



Even if Private Memory is dynamically indexed



Good Performance

Fallback: Private Memory allocated in Global Memory



Accesses are very strided



Bad Performance

__private int a[100]

…

EU

Th

read

n

-1

EU

Th

read

n

EU

Th

read

n

+1

Work Item 0 Work Item 1 … Work Item n

__private int b[100]

__private int c[200]

Work Item 0 Work Item 1 … Work Item n Work Item 0 Work Item 1 … Work Item n

(43)

Agenda

Compute Characteristics



Maximizing GFlops

(44)

ISA

“SIMT” ISA with Predication and Branching

“Divergent” code executes both branches



Reduced SIMD Efficiency

44

this();

if ( x )

that();

else

another();

finish();

SIMD lane

time

Example: “x” sometimes true

SIMD lane

time

(45)

Compute GFlops

45

EUs have 2 x 4-wide vector ALUs

Second ALU has limitations:



Subset of instructions:

add

,

mov

,

mad

,

mul

,

cmp



Instruction must come from another EU thread



Only float operands!

Peak GFlops: #EUs x ( 2 x 4-wide ALUs ) x ( MUL + ADD ) x Clock Rate

For Intel® Iris™ Pro 5200: 40 x 8 x 2 x 1.3 = 832 GFlops!

(46)

Maximizing Compute Performance

46

Use

mad()

/

fma()

: Either explicitly with built-ins, or via

-cl-mad-enable

Use

floats

wherever possible to maximize co-issue

Avoid

long

and

size_t

data types



Prefer

float

over

int

, if possible



Using

short

data types may improve performance

Trade accuracy for speed: “native” built-ins,

-cl-fast-relaxed-math

(47)

Agenda

Summary / Questions

(48)

Summary

48

Maximize Occupancy



Choose a Good Local Work Size



Or, Let the Driver Choose (Local Work Size == NULL)

Avoid Host-to-Device Transfers



Create Buffers with

CL_MEM_USE_HOST_PTR

Access Device Memory Efficiently



Minimize Cache Lines for

__global

Memory



Minimize Bank Conflicts for

__local

Memory

Maximize Compute



Avoid Divergent Branches

(49)

Questions / Acknowledgements

This presentation would not have been possible without material and review

comments from many people – Thank you!

Murali Sundaresan, Sushma Rao, Aaron Kunze, Tom Craver, Brijender Bharti, Rami Jiossy,

Michal Mrozek, Jay Rao, Pavan Lanka, Adam Lake, Arnon Peleg, Raun Krisch, Berna Adalier

(50)

Additional Resources

Intel® SDK for OpenCL* Applications 2013

Intel® OpenCL* Optimization Guide

Intel® VTune™ Amplifier XE 2013

Intel® Graphics Performance Analyzers

(51)

Accelerating video production with OpenCL* and Intel® Iris™

Graphics

Dave Helmly,

Sr. Manager, Solutions Consulting Pro Video/Audio Americas, Adobe* Systems Inc.

(52)

Intel is Hiring!

We want to work with you!

www.intel.com/jobs/

‎

Head to our booth (201) to hear about our exciting opportunities!

(53)

Coming up next…

•

2:00 p.m. – 3:00 p.m.

•

Faster Content Creation with Higher Productivity using Intel® Developer Tools and

OpenCL*

•

Presenters:

•

Arnon Peleg (Intel)

•

Raghu Muthyalampalli (Intel)

For more information:

http://software.intel.com/en-us/siggraph2013

(54)

Learn About Intel® Processor Numbers

Intel® SDK for OpenCL* Applications 2013

Intel® OpenCL* Optimization Guide

Intel® VTune™ Amplifier XE 2013

Intel® Graphics Performance Analyzers

http://software.intel.com/en-us/siggraph2013