Maximize Application Performance On the Go and In
the Cloud with OpenCL* on Intel® Architecture
Arnon Peleg (Intel)
Ben Ashbaugh (Intel)
Dave Helmly (Adobe)
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED,
BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS
OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR
INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS.
Intel may make changes to specifications and product descriptions at any time, without notice. All products, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice.
Intel processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request. Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families: Go to: Learn About Intel® Processor Numbers
Any code names featured are used internally within Intel to identify products that are in development and not yet publicly announced for release. Customers, licensees and other third parties are not authorized by Intel to use code names in advertising, promotion or marketing of any product or services and any such use of Intel's internal code names is at the sole risk of the user. Intel product plans in this presentation do not constitute Intel plan of record product roadmaps. Please contact your Intel representative to obtain Intel’s current plan of record product roadmaps.
Performance claims: Software and workloads used in performance tests may have been optimized for performance only on Intel® microprocessors. Performance tests, such as SYSmark and
MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.Intel.com/performance
Iris™ graphics is available on select systems. Consult your system manufacturer. Copyright © 2013 Intel Corporation. All rights reserved.
Intel, Intel Inside, the Intel logo, Centrino, Intel Core, Intel Atom, Pentium, VTune, and Ultrabook are trademarks of Intel Corporation in the United States and other countries. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos. *Other names and brands may be claimed as the property of others.
Today Intel and OpenCL* sessions
3
12:15 PM -1:45 PM
Maximize Application Performance On the Go and in the
Cloud with OpenCL* on Intel® Architecture
(Intel, Adobe Systems Inc.)
2:00 PM – 3:00 PM
Faster Video Creation with Higher Productivity Using Intel®
Developer Tools and OpenCL*
(Intel)
3:15 PM – 4:15 PM
Journey of Pixels in Adobe Photoshop from CPU, GPU to
Cloud on Intel HD Graphics
This Session Agenda
4
•
Introduction to the Intel® SDK for OpenCL* Applications
Presenter: Arnon Peleg (Intel)
Product Manager, Intel SDK for OpenCL Applications, Intel Corporation
•
Optimize OpenCL* applications for Intel® Iris™ Graphics
Presenter: Ben Ashbaugh (Intel)
Sr. Graphics Complier Engineer, Intel Corporation
•
Accelerating video production with OpenCL* and Intel® Iris™ Graphics
Presenter: Dave Helmly,
5
Arnon Peleg, Intel® Cooperation
The Question Is…
How to get maximum performance out of the platform?
Better Together
Use all resources (CPU, Graphics, Media)
Target the right task on the right device
Complex game & graphics engine graphs,
variable bitrate compression, …
Film & image post-processing filters,
graphics, video analytics, …
Programmable
Graphics
Multicore
CPU
Task-parallel or irregular workloads
better suited to a CPU
Highly data-parallel tasks better suited
to Intel® Processor Graphics
Be the conductor of our orchestra:
Use a dynamic set of instruments, whose ranges overlap
and compliment each other. When these instruments
come together, your composition has even more power to
move your audience.
What is Programmability With OpenCL*?
Unified Code Base
Video Editing
& Playback
Image
Processing
Perceptual
Computing
Games
Maximize platform performance with Intel®
Iris™ Graphics and Intel® HD Graphics for Visual
Computing Usages
Standard based by
Khronos* with Intel
participation
kernel void dp_mul(global const float *a, global const float *b, global float *c) {
int id = get_global_id(0); c[id] = a[id] * b[id];
} // execute over “n” work-items
CPU
Proc.
GFX
A Standard-based Environment to Write Portable Parallel Code for a Diverse Mix
of Multi-core CPUs, Processor Graphics, and Coprocessors
The
“BIG“ Idea Behind
OpenCL*
8
OpenCL* execution model …
Define N-dimensional computation domain
Choose a target device
Execute a kernel at each point in computation domain
void
trad_mul(int n,
const float *a,
const float *b,
float *c)
{
int i;
for (i=0; i<n; i++)
c[i] = a[i] * b[i];
}
Traditional loops
kernel
void
dp_mul(
global
const float *a,
global
const float *b,
global
float *c)
{
int id =
get_global_id
(0);
c[id] = a[id] * b[id];
}
// execute over “n”
work-items
Intel
®
SDK for OpenCL* Applications 2013
9
FREE download at intel.com/software/opencl
Supporting the OpenCL 1.2 Full-Profile
on 3
rd
and 4
th
Generation Intel® Core™ Processors with Intel® Iris™ Graphics and Intel®
HD Graphics product family, Intel® Xeon® Processors, and Intel® Xeon Phi™
Coprocessors
Intel
®
SDK for OpenCL* Applications
XE
2013
OpenCL* runtime and compiler for Intel® Processors (CPU) and Intel® Xeon® Phi™ Coprocessors
Linux OSs support:
Red Hat EL 6.x
SUSE SLES 11.x
OpenCL headers and libs
Source Code Samples
Product documentation – User guide, Optimization Guide
Development tools
Offline build and analysis tools
Debugger for OpenCL kernels on CPU
Profiling with Intel® VTune™ Amplifier XE
What is it?
Intel
®
SDK for OpenCL* Applications 2013
OpenCL* 1.2 support for 3rd and Future 4th Generation Intel® Core™
Processors
Accelerating performance with Intel® Iris™ Graphics and Intel® HD Graphics
family
Enhanced graphics and media APIs interoperability
DirectX*, OpenGL*, and the Intel® Media SDK
Support for Microsoft Windows 7* and Windows 8* Operating Systems
Developers tools for build, debug and tune of OpenCL applications.
What is it?
The Intel® SDK for OpenCL* Applications 2013 Software Stack
For Intel® Core™ Processors with Intel® Iris™ Graphics and Intel® HD Graphics
3rd and 4th Generation Intel® Core™ Processors
With Intel® Iris™ Graphics and Intel® HD Graphics
Intel HD Graphics Driver
With OpenCL* 1.2 support for CPU and the Processor Graphics
Intel® SDK for OpenCL* Applications
Microsoft*
Visual Studio*
Integration
SDK Tools
- Kernel Debugger
- Kernel Builder
Developer
Environment
(libs and
headers)
Profiling Tools
Integration
Online
Resources:
- Code Samples
- Optimization Guide
- Tech Articles and
Videos
Applications
Develop
Run
Visual Computing
Domain
Data Center
Domain
Intel® SDK for OpenCL* Applications Support Matrix
(Version 2013)
(Version XE 2013)
Supported Processors:
Intel® HD & Iris™ Graphics
Intel® Core Processor (CPU)
Intel® Xeon® Processor
Intel® Xeon Phi™ Coprocessor
Operating Systems:
Windows 7*
Window 8*
Red Hat* Linux*
SUSE* Linux*
Know the Processors and Operating Systems
You Use and download the SDK You Need
The Intel® SDK for OpenCL* Applications Online Resource
intel.com/software/opencl
@intelopencl
The SDK section of the Intel® Developers Zone is a one-stop shop for
resources, support and information for OpenCL* developers
Free Downloads
Code Samples
Tech Articles
Case Studies
Forums and Support
Beta Programs
Optimize OpenCL* applications for Intel® Iris™ Graphics
Ben Ashbaugh (Intel)
Agenda
Understanding Occupancy
How Intel® Iris™ Graphics executes OpenCL* Kernels
Memory Matters
Host to Device
Device Access
Compute Characteristics
Maximizing GFlops
Summary / Questions
16
Agenda
Understanding Occupancy
How Intel® Iris™ Graphics executes OpenCL* Kernels
Intel® Iris™
Graphics Architecture Overview
18
R
ing
B
us
/
LL
C
/
M
em
ory
Command Streamer (CS) Vertex Fetch (VF) Vertex Shader (VS) Hull Shader (HS) Tessellator Domain Shader (DS) Geometry Shader (GS) Stream Out (SOL) Clip/Setup Th rea d D is pa tc h Video Front End (VFE) Video Quality Engine Multi-FormatCODEC Blitter Display
Slice 0
L3$ Pixel Ops Rasterizer / Depth Render$ Depth$ L1 IC$ 3D Sampler Media Sampler Data Port Tex$ Su b Sl ic e 1 L1 IC$ 3D Sampler Media Sampler Data Port Tex$ Su b Sl ic e 0 EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EUSlice 1
L3$ Pixel Ops Rasterizer / Depth Render$ Depth$ L1 IC$ 3D Sampler Media Sampler Data Port Tex$ Su b Sl ic e 3 L1 IC$ 3D Sampler Media Sampler Data Port Tex$ Su b Sl ic e 2 EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EUIntel® Iris™
Graphics Architecture Overview
19
R
ing
B
us
/
LL
C
/
M
em
ory
Command Streamer (CS) Vertex Fetch (VF) Vertex Shader (VS) Hull Shader (HS) Tessellator Domain Shader (DS) Geometry Shader (GS) Stream Out (SOL) Clip/Setup Th rea d D is pa tc h Video Front End (VFE) Video Quality Engine Multi-FormatCODEC Blitter Display
Slice 0
L3$ Pixel Ops Rasterizer / Depth Render$ Depth$ L1 IC$ 3D Sampler Media Sampler Data Port Tex$ Su b Sl ic e 1 L1 IC$ 3D Sampler Media Sampler Data Port Tex$ Su b Sl ic e 0 EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EUSlice 1
L3$ Pixel Ops Rasterizer / Depth Render$ Depth$ L1 IC$ 3D Sampler Media Sampler Data Port Tex$ Su b Sl ic e 3 L1 IC$ 3D Sampler Media Sampler Data Port Tex$ Su b Sl ic e 2 EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EUGlobal Assets
•
Command Streamer
•
Thread Dispatch
Intel® Iris™
Graphics Architecture Overview
20
R
ing
B
us
/
LL
C
/
M
em
ory
Command Streamer (CS) Vertex Fetch (VF) Vertex Shader (VS) Hull Shader (HS) Tessellator Domain Shader (DS) Geometry Shader (GS) Stream Out (SOL) Clip/Setup Th rea d D is pa tc h Video Front End (VFE) Video Quality Engine Multi-FormatCODEC Blitter Display
Slice 0
L3$ Pixel Ops Rasterizer / Depth Render$ Depth$ L1 IC$ 3D Sampler Media Sampler Data Port Tex$ Su b Sl ic e 1 L1 IC$ 3D Sampler Media Sampler Data Port Tex$ Su b Sl ic e 0 EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EUSlice 1
L3$ Pixel Ops Rasterizer / Depth Render$ Depth$ L1 IC$ 3D Sampler Media Sampler Data Port Tex$ Su b Sl ic e 3 L1 IC$ 3D Sampler Media Sampler Data Port Tex$ Su b Sl ic e 2 EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EUSlice Common
•
L3 Cache
Intel® Iris™
Graphics Architecture Overview
21
R
ing
B
us
/
LL
C
/
M
em
ory
Command Streamer (CS) Vertex Fetch (VF) Vertex Shader (VS) Hull Shader (HS) Tessellator Domain Shader (DS) Geometry Shader (GS) Stream Out (SOL) Clip/Setup Th rea d D is pa tc h Video Front End (VFE) Video Quality Engine Multi-FormatCODEC Blitter Display
Slice 0
L3$ Pixel Ops Rasterizer / Depth Render$ Depth$ L1 IC$ 3D Sampler Media Sampler Data Port Tex$ Su b Sl ic e 1 L1 IC$ 3D Sampler Media Sampler Data Port Tex$ Su b Sl ic e 0 EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EUSlice 1
L3$ Pixel Ops Rasterizer / Depth Render$ Depth$ L1 IC$ 3D Sampler Media Sampler Data Port Tex$ Su b Sl ic e 3 L1 IC$ 3D Sampler Media Sampler Data Port Tex$ Su b Sl ic e 2 EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EUSub Slice
•
Execution Units
•
Samplers and Data Port
Intel® Iris™ Graphics Architecture Building Blocks
OpenCL* Kernels run on an Execution Unit (EU)
Each EU is a Multi-Threaded SIMD Processor
Up to 7 threads per EU
128 x 8 x 32-bit registers per thread
Up to 8, 16, or 32 OpenCL* work items per
thread (compiler-controlled)
“SIMD8”, “SIMD16”, “SIMD32”
SIMD8
More Registers
SIMD16 and SIMD32
Better Efficiency
22
EU
Thread 0
Thread 2
Thread 4
Thread 6
Thread 1
`Thread 3
Thread 5
Intel® Iris™ Graphics Architecture Building Blocks
OpenCL* Work Groups run on a Sub Slice
10 EUs per Sub Slice
Texture Sampler (Images)
Data Port (Buffers)
Instruction and Texture Caches
23
OpenCL* Work Groups may run on multiple EU threads, on multiple EUs!
Su
b
Sl
ic
e
L1
IC$
3D Sampler
Media
Sampler
Data Port
Tex$
EU
EU
EU
EU
EU
EU
EU
EU
EU
EU
Intel® Iris™ Graphics Architecture Building Blocks
Two Sub Slices make a Slice
Shared Resources: “Slice Common”
L3 Cache + Shared Local Memory
Barriers
Intel® Iris™ Graphics has Two Slices
2 x 2 = 4 Sub Slices
4 x 10 = 40 EUs
Up to 40 x 7 = 280 EU threads
Up to 8960 OpenCL* work items in flight!
24
Slice
L3$ Pixel Ops Rasterizer / Depth Render$ Depth$ L1 IC$ 3D Sampler Media Sampler Data Port Tex$ Su b Sl ic e L1 IC$ 3D Sampler Media Sampler Data Port Tex$ Su b Sl ic e EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EUHow Intel® Iris™
Graphics Runs OpenCL*
25
How Intel® Iris™
Graphics Runs OpenCL*
26
Su
b
Sl
ic
e
L1
IC$
3D Sampler
Media
Sampler
Data Port
Tex$
EU
EU
EU
EU
EU
EU
EU
EU
EU
EU
How Intel® Iris™ Graphics Runs OpenCL*
27
2. Launch EU Threads for the Work Group Onto a Sub Slice
Repeat for each Work Group
Must have enough room in the Sub Slice for all EU threads for the Work Group
Not enough room in any Sub Slice
EU threads must wait
Occupancy
Goal: Use All Machine Resources
This is harder than it sounds! Many factors to consider…
1. Launch Enough Work
One thread sufficient to prevent an EU from going
idle
Too few EU threads can result in an EU being
stalled
More EU threads
better latency coverage
keeps an EU
active
Occupancy
–
Intel® VTune
™ Amplifier XE 2013
Occupancy (continued…)
2. Don’t Waste SIMD Lanes
Use an optimal Local Work Size
Good: Query for compiled SIMD size:
CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE
Occasionally Helpful: Compile for a specific local work size (8, 16, or 32):
__attribute__((reqd_work_group_size(X, Y, Z)))
Best: Let the driver pick (Local Work Size == NULL)
Ideal for kernels with no barriers or shared local memory
Occupancy (continued…)
More subtle factors:
3. Barriers – 16 barriers per sub slice
Can be a limiting factor for very small local work groups
4. Shared Local Memory – 64KB shared local memory per sub slice
Can be a limiting factor for kernels that use lots of shared local memory
Agenda
Memory Matters
Host to Device
Optimizing Host to Device Transfers
33
Host (CPU) and Device (GPU) share the same physical memory
For OpenCL* buffers:
No transfer needed (zero copy)!
Allocate system memory aligned to a cache line (64 bytes)
Create buffer with system memory pointer and
CL_MEM_USE_HOST_PTR
Use
clEnqueueMapBuffer()
to access data
For OpenCL* images:
Operating on Buffers as Images
34
Intel® Iris™ Graphics supports
cl_khr_image2d_from_buffer
New OpenCL* 1.2 Extension
Treat data as a buffer for some kernels, as an image for others
Some restrictions for zero copy: buffer size, image pitch
0x123
0x456
0x789
Interop with Graphics and Media APIs / SDKs
35
Intel® Iris™ Graphics supports many Graphics and Media interop extensions:
cl_khr_dx9_media_sharing
(includes DXVA for Intel® Media SDK)
cl_khr_d3d10_sharing
cl_khr_d3d11_sharing
cl_khr_gl_sharing
cl_khr_gl_depth_images
cl_khr_gl_event
cl_khr_gl_msaa_sharing
Agenda
Memory Matters
Device Access
EU
Sampler
L1
L3
Sampler
L2
LLC
DRAM
images
buffers
256KB/slice
2-8MB/package
(shared w/ CPU)
EDRAM
(non-inclusive
victim cache)
128MB/package
(Intel® Iris
™ Pro 5200)
Intel® Iris™
Graphics Cache Hierarchy
__global
and
__constant
Memory
38
Global Memory Accesses go through the L3 Cache
L3 Cache Line is 64 bytes
EU thread accesses to the same L3 Cache Line are collapsed
Order of data within cache line does not matter
Bandwidth determined by number of cache lines accessed
Maximum Bandwidth: 64 bytes / clock / sub slice
Good: Load at least 32-bits of data at a time, starting from a 32-bit aligned address
Best: Load 4 x 32-bits of data at a time, starting from a cache line aligned address
Global and Constant Memory Access Examples
1. x = data[ get_global_id(0) ]
One cache line, full bandwidth
2. x = data[ n – get_global_id(0) ]
Reverse order, full bandwidth
3. x = data[ get_global_id(0) + 1 ]
Offset, two cache lines, half bandwidth
4. x = data[ get_global_id(0) * 2 ]
Strided, half bandwidth
5. x = data[ get_global_id(0) * 16 ]
Very strided, worst-case
39
Cache Line n 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Cache Line n + 1 Global ID: Cache Line n - 1 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Cache Line n Global ID: Cache Line n 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Cache Line n + 1 Global ID: Cache Line n 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Cache Line n + 1 Global ID: Cache Line n 0 1 2 ...Cache Line n + 1 Cache Line n + 2 ... Global ID:
__local
Memory Accesses
40
Local Memory Accesses also go through the L3 Cache!
Key Difference: Local Memory is Banked
Banked at a DWORD granularity, 16 banks
Bandwidth determined by number of bank conflicts
Maximum Bandwidth: Still 64 bytes / clock / sub slice
Supports more access patterns with full bandwidth than Global Memory
No bank conflicts
full bandwidth
Local Memory Access Examples
1. x = data[ get_global_id(0) + 1 ]
Unique banks, full bandwidth
2. x = data[ get_global_id(0) & ~1 ]
Same address read, full bandwidth
3. x = data[ get_global_id(0) * 2 ]
Strided, half bandwidth
4. x = data[ get_global_id(0) * 16 ]
Very strided, worst-case
5. x = data[ get_global_id(0) * 17 ]
Full bandwidth!
41
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 Bank: 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Global ID: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 Bank: 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Global ID: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 Bank: 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Global ID: 0 1 2 3 4 5 6 0 ... 0 0 0 0 Bank: 0 0 0 Global ID: ... ... ... ... ... ... ... 7 0 1 2 3 4 5 6 0 ... 1 2 3 4 Bank: 5 6 Global ID: ... ... ... ... ... ...__private
Memory
Compiler can
usually
allocate Private Memory in the
Register File
Even if Private Memory is dynamically indexed
Good Performance
Fallback: Private Memory allocated in Global Memory
Accesses are very strided
Bad Performance
__private int a[100]
…
…
EU
Th
read
n
-1
EU
Th
read
n
EU
Th
read
n
+1
Work Item 0 Work Item 1 … Work Item n__private int b[100]
__private int c[200]
Work Item 0 Work Item 1 … Work Item n Work Item 0 Work Item 1 … Work Item nAgenda
Compute Characteristics
Maximizing GFlops
ISA
“SIMT” ISA with Predication and Branching
“Divergent” code executes both branches
Reduced SIMD Efficiency
44
this();
if ( x )
that();
else
another();
finish();
SIMD lane
time
Example: “x” sometimes true
SIMD lane
time
Compute GFlops
45
EUs have 2 x 4-wide vector ALUs
Second ALU has limitations:
Subset of instructions:
add
,
mov
,
mad
,
mul
,
cmp
Instruction must come from another EU thread
Only float operands!
Peak GFlops: #EUs x ( 2 x 4-wide ALUs ) x ( MUL + ADD ) x Clock Rate
For Intel® Iris™ Pro 5200: 40 x 8 x 2 x 1.3 = 832 GFlops!
Maximizing Compute Performance
46
Use
mad()
/
fma()
: Either explicitly with built-ins, or via
-cl-mad-enable
Use
floats
wherever possible to maximize co-issue
Avoid
long
and
size_t
data types
Prefer
float
over
int
, if possible
Using
short
data types may improve performance
Trade accuracy for speed: “native” built-ins,
-cl-fast-relaxed-math
Agenda
Summary / Questions
Summary
48
Maximize Occupancy
Choose a Good Local Work Size
Or, Let the Driver Choose (Local Work Size == NULL)
Avoid Host-to-Device Transfers
Create Buffers with
CL_MEM_USE_HOST_PTR
Access Device Memory Efficiently
Minimize Cache Lines for
__global
Memory
Minimize Bank Conflicts for
__local
Memory
Maximize Compute
Avoid Divergent Branches
Questions / Acknowledgements
This presentation would not have been possible without material and review
comments from many people – Thank you!
Murali Sundaresan, Sushma Rao, Aaron Kunze, Tom Craver, Brijender Bharti, Rami Jiossy,
Michal Mrozek, Jay Rao, Pavan Lanka, Adam Lake, Arnon Peleg, Raun Krisch, Berna Adalier
Additional Resources
Intel® SDK for OpenCL* Applications 2013
Intel® OpenCL* Optimization Guide
Intel® VTune™ Amplifier XE 2013
Intel® Graphics Performance Analyzers
Accelerating video production with OpenCL* and Intel® Iris™
Graphics
Dave Helmly,
Sr. Manager, Solutions Consulting Pro Video/Audio Americas, Adobe* Systems Inc.
Intel is Hiring!
We want to work with you!
www.intel.com/jobs/
Head to our booth (201) to hear about our exciting opportunities!
Coming up next…
•
2:00 p.m. – 3:00 p.m.
•
Faster Content Creation with Higher Productivity using Intel® Developer Tools and
OpenCL*
•
Presenters:
•
Arnon Peleg (Intel)
•
Raghu Muthyalampalli (Intel)
For more information:
http://software.intel.com/en-us/siggraph2013