Intel Xeon Phi Coprocessor High- Performance Programming

(1)

Intel® Xeon Phi™

Coprocessor High- Performance Programming

Jim Jeffers James Reinders

£

ELSEVIER

AMSTERDAM • BOSTON • HEIDELBERG • LONDON NEW YORK • OXFORD • PARIS • SAN DIEGO SAN FRANCISCO • SINGAPORE • SYDNEY • TOKYO

Morgan Kaufinann is an imprint of Elsevier

M<

M O R G A N K A U F M A N N

(2)

Foreword xiii

Preface xvii

Acknowledgements xix

CHAPTER 1 Introduction 1

Trend: more parallelism 1

Why Intel® Xeon Phi™ coprocessors are needed 2

Platforms with coprocessors 5

The first Intel® Xeon Phi™ coprocessor 6

Keeping the "Ninja Gap" under control 9

Transforming-and-tuning double advantage 10

When to use an Intel® Xeon Phi™ coprocessor 11

Maximizing performance on processors first 11

Why scaling past one hundred threads is so important 12

Maximizing parallel program performance 15

Measuring readiness for highly parallel execution 15

What about GPUs? .: 16

Beyond the ease of porting to increased performance 16

Transformation for performance 17

Hyper-threading versus multithreading 17

Coprocessor major usage model: MPI versus offload 18

Compiler and programming models 19

Cache optimizations 20

Examples, then details 21

For more information 21

CHAPTER 2 High Performance Closed Track Test Drive! 23 Looking under the hood: coprocessor specifications 24 Starting the car: communicating with the coprocessor 26

Taking it out easy: running our first code 28

Starting to accelerate: running more than one thread 32 Petal to the metal: hitting full speed using all cores 38 Easing in to the first curve: accessing memory bandwidth 49 High speed banked curved: maximizing memory bandwidth 54

Back to the pit: a summary 57

CHAPTER 3 A Friendly Country Road Race 59

Preparing for our country road trip: chapter focus 59 Getting a feel for the road: the 9-point stencil algorithm 60 At the starting line: the baseline 9-point stencil implementation 61 Rough road ahead: running the baseline stencil code 68 V

(3)

vi Contents

Cobblestone street ride: vectors but not yet scaling 70 Open road all-out race: vectors plus scaling . 72

Some grease and wrenches!: a bit of tuning 75

Adjusting the "Alignment" 76

Using streaming stores 77

Using huge 2-MB memory pages 79

Summary 81

CHAPTER 4 Driving Around Town: Optimizing A Real-World Code Example 83 Choosing the direction: the basic diffusion calculation 84

Turn ahead: accounting for boundary effects 84

Finding a wide boulevard: scaling the code 91

Thunder road: ensuring vectorization 93

Peeling out: peeling code from the inner loop 97

Trying higher octane fuel: improving speed using data locality and tiling 100 High speed driver certificate: summary of our high speed tour 105

CHAPTER 5 Lots of Data (Vectors) 107

Why vectorize? 107

How to vectorize 108

Five approaches to achieving vectorization 108

Six step vectorization methodology 110

Step 1. Measure baseline release buildjjerformance Ill Step 2. Determine hotspots using Intel VTune™ amplifier XE Ill Step 3. Determine loop candidates using Intel compiler vec-report Ill Step 4. Get advice using the Intel Compiler GAP report and

toolkit resources 112

Step 5. Implement GAP advice and other suggestions (such as using

elemental functions and/or array notations) 112

Step 6: Repeat! 112

Streaming through caches: data layout, alignment, prefetching, and so on 112 Why data layout affects vectorization performance 113

Data alignment 114

Prefetching 116

Streaming stores 121

Compiler tips 123

Avoid manual loop unrolling 123

Requirements for a loop to vectorize (Intel Compiler) 124 Importance of inlining, interference with simple profiling 126

Compiler options 126

Memory disambiguation inside vector-loops 127

Compiler directives 128

SIMD directives 129

(4)

The VECTOR and NO VECTOR directives 134

The IVDEP directive 135

Random number function vectorization 137

Utilizing full vectors, -opt-assume-safe-padding 138

Option -opt-assume-safe-padding 142

Data alignment to assist vectorization 142

Tradeoffs in array notations due to vector lengths 146

Use array sections to encourage vectorization 150

Fortran array sections 150

Cilk Plus array sections and elemental functions 152 Look at what the compiler created: assembly code inspection 156

How to find the assembly code 157

Quick inspection of assembly code 158

Numerical result variations with vectorization 163

Summary 163

CHAPTER 6 Lots of Tasks (not Threads) 165

OpenMP, Fortran 2008, Intel® TBB, Intel® Cilk™ Plus, Intel® MKL 166 Task creation needs to happen on the coprocessor 166

Importance of thread pools 168

OpenMP 168

Parallel processing model 168

Directives 169

Significant controls over OpenMP 169

Nesting 170

Fortran 2008 171

DO CONCURRENT 171

DO CONCURRENT and DATA RACES 171

DO CONCURRENT definition 172

DO CONCURRENT vs. FOR ALL 173

DO CONCURRENT vs. OpenMP "Parallel" 173

Intel® TBB 174

History 175

Using TBB 177

parallel_for 177

blocked_range 177

Partitioners 178

parallel_reduce 179

parallel_invoke 180

Notes on C + +11 180

TBB summary 181

Cilk Plus 181

History 183

(5)

viii Contents

Borrowing components from TBB 183

Loaning components to TBB 184

Keyword spelling 184

cilkjor 184

cilk_spawn and cilk_sync 185

Reducers (Hyperobjects) 187

Array notation and elemental functions 187

Cilk Plus summary 187

Summary 187

CHAPTER 7 Offload 189

Two offload models 190

Choosing offload vs. native execution 191

Non-shared memory model: using offload pragmas/directives 191 Shared virtual memory model: using offload with shared VM 191 Intel® Math Kernel Library (Intel MKL) automatic offload 192

Language extensions for offload 192

Compiler options and environment variables for offload 193

Sharing environment variables for offload 195

Offloading to multiple coprocessors 195

Using pragma/directive offload : 195

Placing variables and functions on the coprocessor 198 Managing memory allocation for pointer variables 200 Optimization for time: another reason to persist allocations 206 Target-specific code using a pragma in C/C + + 206 Target-specific code using a directive in fortran 209 Code that should not be built for processor-only execution 209

Predefined macros for Intel MIC architecture 211

Fortran arrays 211

Allocating memory for parts of C/C + + arrays 212 Allocating memory for parts of fortran arrays 213

Moving data from one variable to another 214

Restrictions on offloaded code using a pragma 215

Using offload with shared virtual memory 217

Using shared memory and shared variables 217

About shared functions 219

Shared memory management functions 219

Synchronous and asynchronous function execution: _cilk_offload 219

Sharing variables and functions: _cilk_shared 220

Rules for using _cilk_shared and _cilk_offload 222 Synchronization between the processor and the target 222 Writing target-specific code with _cilk_offload 223 Restrictions on offloaded code using shared virtual memory 224

(6)

Persistent data when using shared virtual memory 225 C + + d e c l a r a t i o n s o f p e r s i s t e n t d a t a w i t h s h a r e d v i r t u a l m e m o r y 2 2 7

About asynchronous computation 228

About asynchronous data transfer 229

Asynchronous data transfer from the processor to the coprocessor 229 Applying the target attribute to multiple declarations : 234

Vec-report option used with offloads 235

Measuring timing and data in offload regions 236

Offload_report 236

Using libraries in offloaded code : 237

About creating offload libraries with xiar and xild 237

Performing file I/O on the coprocessor 238

Logging stdout and stderr from offloaded code 240

Summary 241

CHAPTER 8 Coprocessor Architecture 243

The Intel® Xeon Phi™ coprocessor family , 244

Coprocessor card design 245

Intel Xeon Phi™ coprocessor silicon overview 246

Individual coprocessor core architecture 247

Instruction and multithread processing 249

Cache organization and memory access considerations 251

Prefetching ..; 252

Vector processing unit architecture 253

Vector instructions 254

Coprocessor PCIe system interface and DMA 257

DMA capabilities : 258

Coprocessor power management capabilities 260

Reliability, availability, and serviceability (RAS) . 263

Machine check architecture (MCA) 264

Coprocessor system management controller (SMC) 265

Sensors. 265

Thermal design power monitoring and control 266

Fan speed control . 266

Potential application impact 266

Benchmarks 267

Summary 267

CHAPTER 9 Coprocessor System Software 269

Coprocessor software architecture overview 269

Symmetry 271

Ring levels: user and kernel 271

(7)

x Contents

Coprocessor programming models and options 271

Breadth and depth 273

Coprocessor MPI programming models 274

Coprocessor software architecture components 276

Development tools and application layer 276

Intel manycore platform software stack 277

MYO: mine yours ours 277

COI: coprocessor offload infrastructure 278

SCIF: symmetric communications interface 278

Virtual networking (NetDev), TCP/IP, and sockets 278

Coprocessor system management 279

Coprocessor components for MPI applications 282

Linux support for Intel Xeon Phi™ coprocessors 287

Tuning memory allocation performance 288

Controlling the number of 2 MB pages 288

Monitoring the number of 2 MB pages on the coprocessor 288

A sample method for allocating 2 MB pages 289

Summary 290

CHAPTER 10 Linux on the Coprocessor 293

Coprocessor Linux baseline 293

Introduction to coprocessor Linux bootstrap and configuration 294

Default coprocessor Linux configuration 295

Step 1: Ensure root access 296

Step 2: Generate the default configuration 296

Step 3: Change configuration 296

Step 4: Start the Intel® MPSS service 296

Changing coprocessor configuration 297

Configurable components 297

Configuration files 298

Configuring boot parameters 298

Coprocessor root file system 300

The micctrl utility 305

Coprocessor state control 306

Booting coprocessors 306

Shutting down coprocessors 306

Rebooting the coprocessors 306

Resetting coprocessors 307

Coprocessor configuration initialization and propagation 308

Helper functions for configuration parameters 309

Other file system helper functions 311

Adding software 312

Adding files to the root file system 313

(8)

Example: Adding a new global file set 314

Coprocessor Linux boot process 315

Booting the coprocessor 315

Coprocessors in a Linux cluster 318

Intel* Cluster Ready 319

How Intel Cluster Checker works 319

Intel Cluster Checker support for coprocessors 320

Summary 322

CHAPTER 11 Math Library 325

Intel Math Kernel Library overview 326

Intel MKL differences on the coprocessor 327

Intel MKL and Intel compiler 327

Coprocessor support overview 327

Control functions for automatic offload 328

Examples of how to set the environment variables 330

Using the coprocessor in native mode 330

Tips for using native mode 332

Using automatic offload mode 332

How to enable automatic offload 333

Examples of using control work division 333

Tips for effective use of automatic offload 333

Some tips for effective use of Intel MKL with or without offload 336

Using compiler-assisted offload 337

Tips for using compiler assisted offload 338

Precision choices and variations 339

Fast transcendentals and mathematics 339

Understanding the potential for floating-point arithmetic variations 339

Summary 342

CHAPTER 12 MPI 343

MPI overview 343

Using MPI on Intel " Xeon Phi™ coprocessors 345

Heterogeneity (and why it matters) 345

Prerequisites (batteries not included) 348

Offload from an MPI rank 349

Hello world 350

Trapezoidal rule 350

Using MPI natively on the coprocessor 354

Hello world (again) 354

Trapezoidal rule (revisited) 356

Summary 361

(9)

xii Contents

CHAPTER 13 Profiling and Timing 363

Event monitoring registers on the coprocessor 364

List of events used in this guide 364

Efficiency metrics 364

CPI 365

Compute to data access ratio 369

Potential performance issues 370

General cache usage 371

TLB misses 373

VPU usage 374

Memory bandwidth 376

Intel® VTune™ Amplifier XE product 377

Avoid simple profiling 378

Performance application programming interface 378

MPI analysis: Intel Trace Analyzer and Collector 378 Generating a trace file: coprocessor only application 379 Generating a trace file: processor + coprocessor application 379

Timing 380

Clocksources on the coprocessor 380

MIC elapsed time counter (micetc) 380

Time stamp counter (tsc) 380

Setting the clocksource 381

Time structures 381

Time penalty 382

Measuring timing and data in offload regions 383

Summary 383

CHAPTER 14 Summary 385

Advice f, 385

Additional resources 386

Another book coming? 386

Feedback appreciated 386

Glossary 387

Index 401