• No results found

Maximize Application Performance On the Go and In the Cloud with OpenCL* on Intel Architecture

N/A
N/A
Protected

Academic year: 2021

Share "Maximize Application Performance On the Go and In the Cloud with OpenCL* on Intel Architecture"

Copied!
54
0
0

Loading.... (view fulltext now)

Full text

(1)

Maximize Application Performance On the Go and In

the Cloud with OpenCL* on Intel® Architecture

Arnon Peleg (Intel)

Ben Ashbaugh (Intel)

Dave Helmly (Adobe)

(2)

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED,

BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS

OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR

INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS.

Intel may make changes to specifications and product descriptions at any time, without notice. All products, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice.

Intel processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request. Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families: Go to: Learn About Intel® Processor Numbers

Any code names featured are used internally within Intel to identify products that are in development and not yet publicly announced for release. Customers, licensees and other third parties are not authorized by Intel to use code names in advertising, promotion or marketing of any product or services and any such use of Intel's internal code names is at the sole risk of the user. Intel product plans in this presentation do not constitute Intel plan of record product roadmaps. Please contact your Intel representative to obtain Intel’s current plan of record product roadmaps.

Performance claims: Software and workloads used in performance tests may have been optimized for performance only on Intel® microprocessors. Performance tests, such as SYSmark and

MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.Intel.com/performance

Iris™ graphics is available on select systems. Consult your system manufacturer. Copyright © 2013 Intel Corporation. All rights reserved.

Intel, Intel Inside, the Intel logo, Centrino, Intel Core, Intel Atom, Pentium, VTune, and Ultrabook are trademarks of Intel Corporation in the United States and other countries. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos. *Other names and brands may be claimed as the property of others.

(3)

Today Intel and OpenCL* sessions

3

12:15 PM -1:45 PM

Maximize Application Performance On the Go and in the

Cloud with OpenCL* on Intel® Architecture

(Intel, Adobe Systems Inc.)

2:00 PM – 3:00 PM

Faster Video Creation with Higher Productivity Using Intel®

Developer Tools and OpenCL*

(Intel)

3:15 PM – 4:15 PM

Journey of Pixels in Adobe Photoshop from CPU, GPU to

Cloud on Intel HD Graphics

(4)

This Session Agenda

4

Introduction to the Intel® SDK for OpenCL* Applications

Presenter: Arnon Peleg (Intel)

Product Manager, Intel SDK for OpenCL Applications, Intel Corporation

Optimize OpenCL* applications for Intel® Iris™ Graphics

Presenter: Ben Ashbaugh (Intel)

Sr. Graphics Complier Engineer, Intel Corporation

Accelerating video production with OpenCL* and Intel® Iris™ Graphics

Presenter: Dave Helmly,

(5)

5

Arnon Peleg, Intel® Cooperation

(6)

The Question Is…

How to get maximum performance out of the platform?

Better Together

Use all resources (CPU, Graphics, Media)

Target the right task on the right device

Complex game & graphics engine graphs,

variable bitrate compression, …

Film & image post-processing filters,

graphics, video analytics, …

Programmable

Graphics

Multicore

CPU

Task-parallel or irregular workloads

better suited to a CPU

Highly data-parallel tasks better suited

to Intel® Processor Graphics

Be the conductor of our orchestra:

Use a dynamic set of instruments, whose ranges overlap

and compliment each other. When these instruments

come together, your composition has even more power to

move your audience.

(7)

What is Programmability With OpenCL*?

Unified Code Base

Video Editing

& Playback

Image

Processing

Perceptual

Computing

Games

Maximize platform performance with Intel®

Iris™ Graphics and Intel® HD Graphics for Visual

Computing Usages

Standard based by

Khronos* with Intel

participation

kernel void dp_mul(global const float *a, global const float *b, global float *c) {

int id = get_global_id(0); c[id] = a[id] * b[id];

} // execute over “n” work-items

CPU

Proc.

GFX

A Standard-based Environment to Write Portable Parallel Code for a Diverse Mix

of Multi-core CPUs, Processor Graphics, and Coprocessors

(8)

The

“BIG“ Idea Behind

OpenCL*

8

OpenCL* execution model …

Define N-dimensional computation domain

Choose a target device

Execute a kernel at each point in computation domain

void

trad_mul(int n,

const float *a,

const float *b,

float *c)

{

int i;

for (i=0; i<n; i++)

c[i] = a[i] * b[i];

}

Traditional loops

kernel

void

dp_mul(

global

const float *a,

global

const float *b,

global

float *c)

{

int id =

get_global_id

(0);

c[id] = a[id] * b[id];

}

// execute over “n”

work-items

(9)

Intel

®

SDK for OpenCL* Applications 2013

9

FREE download at intel.com/software/opencl

Supporting the OpenCL 1.2 Full-Profile

on 3

rd

and 4

th

Generation Intel® Core™ Processors with Intel® Iris™ Graphics and Intel®

HD Graphics product family, Intel® Xeon® Processors, and Intel® Xeon Phi™

Coprocessors

(10)

Intel

®

SDK for OpenCL* Applications

XE

2013

OpenCL* runtime and compiler for Intel® Processors (CPU) and Intel® Xeon® Phi™ Coprocessors

Linux OSs support:

Red Hat EL 6.x

SUSE SLES 11.x

OpenCL headers and libs

Source Code Samples

Product documentation – User guide, Optimization Guide

Development tools

Offline build and analysis tools

Debugger for OpenCL kernels on CPU

Profiling with Intel® VTune™ Amplifier XE

What is it?

(11)

Intel

®

SDK for OpenCL* Applications 2013

OpenCL* 1.2 support for 3rd and Future 4th Generation Intel® Core™

Processors

Accelerating performance with Intel® Iris™ Graphics and Intel® HD Graphics

family

Enhanced graphics and media APIs interoperability

DirectX*, OpenGL*, and the Intel® Media SDK

Support for Microsoft Windows 7* and Windows 8* Operating Systems

Developers tools for build, debug and tune of OpenCL applications.

What is it?

(12)

The Intel® SDK for OpenCL* Applications 2013 Software Stack

For Intel® Core™ Processors with Intel® Iris™ Graphics and Intel® HD Graphics

3rd and 4th Generation Intel® Core™ Processors

With Intel® Iris™ Graphics and Intel® HD Graphics

Intel HD Graphics Driver

With OpenCL* 1.2 support for CPU and the Processor Graphics

Intel® SDK for OpenCL* Applications

Microsoft*

Visual Studio*

Integration

SDK Tools

- Kernel Debugger

- Kernel Builder

Developer

Environment

(libs and

headers)

Profiling Tools

Integration

Online

Resources:

- Code Samples

- Optimization Guide

- Tech Articles and

Videos

Applications

Develop

Run

(13)

Visual Computing

Domain

Data Center

Domain

Intel® SDK for OpenCL* Applications Support Matrix

(Version 2013)

(Version XE 2013)

Supported Processors:

Intel® HD & Iris™ Graphics

Intel® Core Processor (CPU)

Intel® Xeon® Processor

Intel® Xeon Phi™ Coprocessor

Operating Systems:

Windows 7*

Window 8*

Red Hat* Linux*

SUSE* Linux*

Know the Processors and Operating Systems

You Use and download the SDK You Need

(14)

The Intel® SDK for OpenCL* Applications Online Resource

intel.com/software/opencl

@intelopencl

The SDK section of the Intel® Developers Zone is a one-stop shop for

resources, support and information for OpenCL* developers

Free Downloads

Code Samples

Tech Articles

Case Studies

Forums and Support

Beta Programs

(15)

Optimize OpenCL* applications for Intel® Iris™ Graphics

Ben Ashbaugh (Intel)

(16)

Agenda

Understanding Occupancy

How Intel® Iris™ Graphics executes OpenCL* Kernels

Memory Matters

Host to Device

Device Access

Compute Characteristics

Maximizing GFlops

Summary / Questions

16

(17)

Agenda

Understanding Occupancy

How Intel® Iris™ Graphics executes OpenCL* Kernels

(18)

Intel® Iris™

Graphics Architecture Overview

18

R

ing

B

us

/

LL

C

/

M

em

ory

Command Streamer (CS) Vertex Fetch (VF) Vertex Shader (VS) Hull Shader (HS) Tessellator Domain Shader (DS) Geometry Shader (GS) Stream Out (SOL) Clip/Setup Th rea d D is pa tc h Video Front End (VFE) Video Quality Engine Multi-Format

CODEC Blitter Display

Slice 0

L3$ Pixel Ops Rasterizer / Depth Render$ Depth$ L1 IC$ 3D Sampler Media Sampler Data Port Tex$ Su b Sl ic e 1 L1 IC$ 3D Sampler Media Sampler Data Port Tex$ Su b Sl ic e 0 EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU

Slice 1

L3$ Pixel Ops Rasterizer / Depth Render$ Depth$ L1 IC$ 3D Sampler Media Sampler Data Port Tex$ Su b Sl ic e 3 L1 IC$ 3D Sampler Media Sampler Data Port Tex$ Su b Sl ic e 2 EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU
(19)

Intel® Iris™

Graphics Architecture Overview

19

R

ing

B

us

/

LL

C

/

M

em

ory

Command Streamer (CS) Vertex Fetch (VF) Vertex Shader (VS) Hull Shader (HS) Tessellator Domain Shader (DS) Geometry Shader (GS) Stream Out (SOL) Clip/Setup Th rea d D is pa tc h Video Front End (VFE) Video Quality Engine Multi-Format

CODEC Blitter Display

Slice 0

L3$ Pixel Ops Rasterizer / Depth Render$ Depth$ L1 IC$ 3D Sampler Media Sampler Data Port Tex$ Su b Sl ic e 1 L1 IC$ 3D Sampler Media Sampler Data Port Tex$ Su b Sl ic e 0 EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU

Slice 1

L3$ Pixel Ops Rasterizer / Depth Render$ Depth$ L1 IC$ 3D Sampler Media Sampler Data Port Tex$ Su b Sl ic e 3 L1 IC$ 3D Sampler Media Sampler Data Port Tex$ Su b Sl ic e 2 EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU

Global Assets

Command Streamer

Thread Dispatch

(20)

Intel® Iris™

Graphics Architecture Overview

20

R

ing

B

us

/

LL

C

/

M

em

ory

Command Streamer (CS) Vertex Fetch (VF) Vertex Shader (VS) Hull Shader (HS) Tessellator Domain Shader (DS) Geometry Shader (GS) Stream Out (SOL) Clip/Setup Th rea d D is pa tc h Video Front End (VFE) Video Quality Engine Multi-Format

CODEC Blitter Display

Slice 0

L3$ Pixel Ops Rasterizer / Depth Render$ Depth$ L1 IC$ 3D Sampler Media Sampler Data Port Tex$ Su b Sl ic e 1 L1 IC$ 3D Sampler Media Sampler Data Port Tex$ Su b Sl ic e 0 EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU

Slice 1

L3$ Pixel Ops Rasterizer / Depth Render$ Depth$ L1 IC$ 3D Sampler Media Sampler Data Port Tex$ Su b Sl ic e 3 L1 IC$ 3D Sampler Media Sampler Data Port Tex$ Su b Sl ic e 2 EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU

Slice Common

L3 Cache

(21)

Intel® Iris™

Graphics Architecture Overview

21

R

ing

B

us

/

LL

C

/

M

em

ory

Command Streamer (CS) Vertex Fetch (VF) Vertex Shader (VS) Hull Shader (HS) Tessellator Domain Shader (DS) Geometry Shader (GS) Stream Out (SOL) Clip/Setup Th rea d D is pa tc h Video Front End (VFE) Video Quality Engine Multi-Format

CODEC Blitter Display

Slice 0

L3$ Pixel Ops Rasterizer / Depth Render$ Depth$ L1 IC$ 3D Sampler Media Sampler Data Port Tex$ Su b Sl ic e 1 L1 IC$ 3D Sampler Media Sampler Data Port Tex$ Su b Sl ic e 0 EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU

Slice 1

L3$ Pixel Ops Rasterizer / Depth Render$ Depth$ L1 IC$ 3D Sampler Media Sampler Data Port Tex$ Su b Sl ic e 3 L1 IC$ 3D Sampler Media Sampler Data Port Tex$ Su b Sl ic e 2 EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU

Sub Slice

Execution Units

Samplers and Data Port

(22)

Intel® Iris™ Graphics Architecture Building Blocks

OpenCL* Kernels run on an Execution Unit (EU)

Each EU is a Multi-Threaded SIMD Processor

Up to 7 threads per EU

128 x 8 x 32-bit registers per thread

Up to 8, 16, or 32 OpenCL* work items per

thread (compiler-controlled)

“SIMD8”, “SIMD16”, “SIMD32”

SIMD8

More Registers

SIMD16 and SIMD32

Better Efficiency

22

EU

Thread 0

Thread 2

Thread 4

Thread 6

Thread 1

`

Thread 3

Thread 5

(23)

Intel® Iris™ Graphics Architecture Building Blocks

OpenCL* Work Groups run on a Sub Slice

10 EUs per Sub Slice

Texture Sampler (Images)

Data Port (Buffers)

Instruction and Texture Caches

23

OpenCL* Work Groups may run on multiple EU threads, on multiple EUs!

Su

b

Sl

ic

e

L1

IC$

3D Sampler

Media

Sampler

Data Port

Tex$

EU

EU

EU

EU

EU

EU

EU

EU

EU

EU

(24)

Intel® Iris™ Graphics Architecture Building Blocks

Two Sub Slices make a Slice

Shared Resources: “Slice Common”

L3 Cache + Shared Local Memory

Barriers

Intel® Iris™ Graphics has Two Slices

2 x 2 = 4 Sub Slices

4 x 10 = 40 EUs

Up to 40 x 7 = 280 EU threads

Up to 8960 OpenCL* work items in flight!

24

Slice

L3$ Pixel Ops Rasterizer / Depth Render$ Depth$ L1 IC$ 3D Sampler Media Sampler Data Port Tex$ Su b Sl ic e L1 IC$ 3D Sampler Media Sampler Data Port Tex$ Su b Sl ic e EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU
(25)

How Intel® Iris™

Graphics Runs OpenCL*

25

(26)

How Intel® Iris™

Graphics Runs OpenCL*

26

(27)

Su

b

Sl

ic

e

L1

IC$

3D Sampler

Media

Sampler

Data Port

Tex$

EU

EU

EU

EU

EU

EU

EU

EU

EU

EU

How Intel® Iris™ Graphics Runs OpenCL*

27

2. Launch EU Threads for the Work Group Onto a Sub Slice

Repeat for each Work Group

Must have enough room in the Sub Slice for all EU threads for the Work Group

Not enough room in any Sub Slice

EU threads must wait

(28)

Occupancy

Goal: Use All Machine Resources

This is harder than it sounds! Many factors to consider…

1. Launch Enough Work

One thread sufficient to prevent an EU from going

idle

Too few EU threads can result in an EU being

stalled

More EU threads

better latency coverage

keeps an EU

active

(29)

Occupancy

Intel® VTune

™ Amplifier XE 2013

(30)

Occupancy (continued…)

2. Don’t Waste SIMD Lanes

Use an optimal Local Work Size

Good: Query for compiled SIMD size:

CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE

Occasionally Helpful: Compile for a specific local work size (8, 16, or 32):

__attribute__((reqd_work_group_size(X, Y, Z)))

Best: Let the driver pick (Local Work Size == NULL)

Ideal for kernels with no barriers or shared local memory

(31)

Occupancy (continued…)

More subtle factors:

3. Barriers – 16 barriers per sub slice

Can be a limiting factor for very small local work groups

4. Shared Local Memory – 64KB shared local memory per sub slice

Can be a limiting factor for kernels that use lots of shared local memory

(32)

Agenda

Memory Matters

Host to Device

(33)

Optimizing Host to Device Transfers

33

Host (CPU) and Device (GPU) share the same physical memory

For OpenCL* buffers:

No transfer needed (zero copy)!

Allocate system memory aligned to a cache line (64 bytes)

Create buffer with system memory pointer and

CL_MEM_USE_HOST_PTR

Use

clEnqueueMapBuffer()

to access data

For OpenCL* images:

(34)

Operating on Buffers as Images

34

Intel® Iris™ Graphics supports

cl_khr_image2d_from_buffer

New OpenCL* 1.2 Extension

Treat data as a buffer for some kernels, as an image for others

Some restrictions for zero copy: buffer size, image pitch

0x123

0x456

0x789

(35)

Interop with Graphics and Media APIs / SDKs

35

Intel® Iris™ Graphics supports many Graphics and Media interop extensions:

cl_khr_dx9_media_sharing

(includes DXVA for Intel® Media SDK)

cl_khr_d3d10_sharing

cl_khr_d3d11_sharing

cl_khr_gl_sharing

cl_khr_gl_depth_images

cl_khr_gl_event

cl_khr_gl_msaa_sharing

(36)

Agenda

Memory Matters

Device Access

(37)

EU

Sampler

L1

L3

Sampler

L2

LLC

DRAM

images

buffers

256KB/slice

2-8MB/package

(shared w/ CPU)

EDRAM

(non-inclusive

victim cache)

128MB/package

(Intel® Iris

™ Pro 5200)

Intel® Iris™

Graphics Cache Hierarchy

(38)

__global

and

__constant

Memory

38

Global Memory Accesses go through the L3 Cache

L3 Cache Line is 64 bytes

EU thread accesses to the same L3 Cache Line are collapsed

Order of data within cache line does not matter

Bandwidth determined by number of cache lines accessed

Maximum Bandwidth: 64 bytes / clock / sub slice

Good: Load at least 32-bits of data at a time, starting from a 32-bit aligned address

Best: Load 4 x 32-bits of data at a time, starting from a cache line aligned address

(39)

Global and Constant Memory Access Examples

1. x = data[ get_global_id(0) ]

One cache line, full bandwidth

2. x = data[ n – get_global_id(0) ]

Reverse order, full bandwidth

3. x = data[ get_global_id(0) + 1 ]

Offset, two cache lines, half bandwidth

4. x = data[ get_global_id(0) * 2 ]

Strided, half bandwidth

5. x = data[ get_global_id(0) * 16 ]

Very strided, worst-case

39

Cache Line n 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Cache Line n + 1 Global ID: Cache Line n - 1 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Cache Line n Global ID: Cache Line n 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Cache Line n + 1 Global ID: Cache Line n 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Cache Line n + 1 Global ID: Cache Line n 0 1 2 ...

Cache Line n + 1 Cache Line n + 2 ... Global ID:

(40)

__local

Memory Accesses

40

Local Memory Accesses also go through the L3 Cache!

Key Difference: Local Memory is Banked

Banked at a DWORD granularity, 16 banks

Bandwidth determined by number of bank conflicts

Maximum Bandwidth: Still 64 bytes / clock / sub slice

Supports more access patterns with full bandwidth than Global Memory

No bank conflicts

full bandwidth

(41)

Local Memory Access Examples

1. x = data[ get_global_id(0) + 1 ]

Unique banks, full bandwidth

2. x = data[ get_global_id(0) & ~1 ]

Same address read, full bandwidth

3. x = data[ get_global_id(0) * 2 ]

Strided, half bandwidth

4. x = data[ get_global_id(0) * 16 ]

Very strided, worst-case

5. x = data[ get_global_id(0) * 17 ]

Full bandwidth!

41

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 Bank: 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Global ID: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 Bank: 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Global ID: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 Bank: 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Global ID: 0 1 2 3 4 5 6 0 ... 0 0 0 0 Bank: 0 0 0 Global ID: ... ... ... ... ... ... ... 7 0 1 2 3 4 5 6 0 ... 1 2 3 4 Bank: 5 6 Global ID: ... ... ... ... ... ...
(42)

__private

Memory

Compiler can

usually

allocate Private Memory in the

Register File

Even if Private Memory is dynamically indexed

Good Performance

Fallback: Private Memory allocated in Global Memory

Accesses are very strided

Bad Performance

__private int a[100]

EU

Th

read

n

-1

EU

Th

read

n

EU

Th

read

n

+1

Work Item 0 Work Item 1 Work Item n

__private int b[100]

__private int c[200]

Work Item 0 Work Item 1 Work Item n Work Item 0 Work Item 1 Work Item n
(43)

Agenda

Compute Characteristics

Maximizing GFlops

(44)

ISA

“SIMT” ISA with Predication and Branching

“Divergent” code executes both branches

Reduced SIMD Efficiency

44

this();

if ( x )

that();

else

another();

finish();

SIMD lane

time

Example: “x” sometimes true

SIMD lane

time

(45)

Compute GFlops

45

EUs have 2 x 4-wide vector ALUs

Second ALU has limitations:

Subset of instructions:

add

,

mov

,

mad

,

mul

,

cmp

Instruction must come from another EU thread

Only float operands!

Peak GFlops: #EUs x ( 2 x 4-wide ALUs ) x ( MUL + ADD ) x Clock Rate

For Intel® Iris™ Pro 5200: 40 x 8 x 2 x 1.3 = 832 GFlops!

(46)

Maximizing Compute Performance

46

Use

mad()

/

fma()

: Either explicitly with built-ins, or via

-cl-mad-enable

Use

floats

wherever possible to maximize co-issue

Avoid

long

and

size_t

data types

Prefer

float

over

int

, if possible

Using

short

data types may improve performance

Trade accuracy for speed: “native” built-ins,

-cl-fast-relaxed-math

(47)

Agenda

Summary / Questions

(48)

Summary

48

Maximize Occupancy

Choose a Good Local Work Size

Or, Let the Driver Choose (Local Work Size == NULL)

Avoid Host-to-Device Transfers

Create Buffers with

CL_MEM_USE_HOST_PTR

Access Device Memory Efficiently

Minimize Cache Lines for

__global

Memory

Minimize Bank Conflicts for

__local

Memory

Maximize Compute

Avoid Divergent Branches

(49)

Questions / Acknowledgements

This presentation would not have been possible without material and review

comments from many people – Thank you!

Murali Sundaresan, Sushma Rao, Aaron Kunze, Tom Craver, Brijender Bharti, Rami Jiossy,

Michal Mrozek, Jay Rao, Pavan Lanka, Adam Lake, Arnon Peleg, Raun Krisch, Berna Adalier

(50)

Additional Resources

Intel® SDK for OpenCL* Applications 2013

Intel® OpenCL* Optimization Guide

Intel® VTune™ Amplifier XE 2013

Intel® Graphics Performance Analyzers

(51)

Accelerating video production with OpenCL* and Intel® Iris™

Graphics

Dave Helmly,

Sr. Manager, Solutions Consulting Pro Video/Audio Americas, Adobe* Systems Inc.

(52)

Intel is Hiring!

We want to work with you!

www.intel.com/jobs/

Head to our booth (201) to hear about our exciting opportunities!

(53)

Coming up next…

2:00 p.m. – 3:00 p.m.

Faster Content Creation with Higher Productivity using Intel® Developer Tools and

OpenCL*

Presenters:

Arnon Peleg (Intel)

Raghu Muthyalampalli (Intel)

For more information:

http://software.intel.com/en-us/siggraph2013

(54)
Learn About Intel® Processor Numbers Intel® SDK for OpenCL* Applications 2013 Intel® OpenCL* Optimization Guide Intel® VTune™ Amplifier XE 2013 Intel® Graphics Performance Analyzers http://software.intel.com/en-us/siggraph2013

References

Related documents

Immediate assi#nment success rate indicates t/e success rate of t/e MS accessin# t/e si#nalin# c/annel9 It concerns t/e &#34;rocedure from t/e MS sendin# a c/annel re&gt;uired

The Intel Xeon processor 5500 series- based servers are ideal for highly dense cloud architecture applications because of the optimized power foot print and Intel® Intelligent Power

Abel minden energiáját arra összpontosította, hogy egyre több szállodát építsen – Osborne képvisel segítségével, aki a jelek szerint mindenhol meg tudta

Listen, I tell you one more fact that this truth is revealed only to the one who develops love for God in the heart... Have faith that only Sat Guru can

Both programmes also categorised the suppliers into several categories: the Australian programme into three categories, while for the Malaysian programme, the

Material and methods: Data on hysterectomy procedures performed between 2011 and 2016 in Poland acquired from the National Health Fund reports were extracted and analyzed..

Caring and in the yogscast swear and listen to make references to does santa claus, playlists if you and your music?. Troll or phone number of your account, brand new music to

Intel Iris Xe Graphics capability requires system to be configured with Intel Core i5 or i7 processor and dual channel memory.. On the system with Intel Core i5 or i7 processor