CUDA Debugging. GPGPU Workshop, August Sandra Wienke Center for Computing and Communication, RWTH Aachen University

(1)

Rechen- und Kommunikationszentrum (RZ)

CUDA

Debugging

GPGPU Workshop, August 2012

Sandra Wienke

Center for Computing and Communication, RWTH Aachen University

Nikolay Piskun, Chris Gottbrath

(2)

CUDA Debugging

S. Wienke| Rechen- und Kommunikationszentrum

2 Core Device: GPU PCIe Streaming Multiprocessor (SM) Thread Block Grid (Kernel) Device … Host SM-1 Shared Mem Registers L1 L2 Global Memory

CPU CPU Mem

Host float x = input[threadID]; _{Host Memory}

float y = func(x); output[threadID] = y; SM-n Shared Mem Registers L1 © N VIDIA C o rpo rat io n 2010 © N VIDIA C o rpo rat io n 2010

(3)

CUDA Debugging

3

int main(int argc, char* argv[]) { int n = 10240;

float* h_x,*h_y; //Pointer to CPU memory //Allocate and initialize h_x and h_y float *d_x,*d_y; //Pointer to GPU memory cudaMalloc(&d_x, n*sizeof(float));

cudaMalloc(&d_y, n*sizeof(float));

cudaMemcpy(d_x, h_x, n * sizeof(float),

cudaMemcpyHostToDevice);

cudaMemcpy(d_y, h_y, n * sizeof(float), cudaMemcpyHostToDevice);

//Invoke parallel SAXPY kernel dim3 threadsPerBlock(128);

dim3 blocksPerGrid(n/threadsPerBlock.x); saxpy_parallel<<<blocksPerGrid,

threadsPerBlock>>>(n, 2.0, d_x, d_y);

cudaMemcpy(h_y, d_y, n * sizeof(float), cudaMemcpyDeviceToHost); cudaFree(d_x); cudaFree(d_y); free(h_x); free(h_y); return 0; }

__global__ void saxpy_parallel(int n, float a, float *x, float *y)

{

int i = blockIdx.x * blockDim.x + threadIdx.x;

if (i < n){

y[i] = a*x[i] + y[i]; }

}

CUDA in a Nutshell

CUDA Runtime API

Allocate data on GPU Copy/transfer data to GPU Invoke kernel on GPU Copy/transfer data to CPU Free data on GPU Indicate kernel execution Compute thread ID Compute SAXPY cudaMalloc cudaMemcpy saxpy_parallel<<<blocksPerGrid, threadsPerBlock>>> cudaMemcpy cudaFree __global__ blockIdx.x * blockDim.x + threadIdx.x

(4)

CUDA Debugging

4



CUDA Toolkit

 Developer kit: libs, header, profiler, compiler,…



Compiling CUDA applications

 nvcc [-arch=sm_20] myKernel.cu

 Debugging flags: -g -G



CUDA command line tools

 Debugger: cuda-gdb

 Detecting memory access errors: cuda-memcheck



CUDA GUI-based debugger: TotalView

 Debugging host and device code in same session  Thread navigation by logical or physical coordinates  Displaying hierarchical memory,…

(5)

CUDA Debugging

5

CUDA Debugging



Setting breakpoints in CUDA kernels

 Start debugging (e.g. “Go”)

 Message box when kernel is loaded:

(6)

CUDA Debugging

6



Debugger thread IDs in Linux CUDA process

 Host thread: positive no.  CUDA thread: negative no.



GPU thread navigation

 Logical coordinates: blocks (3 dimensions), threads (3 dimensions)

 Physical coordinates: device, SM, warp, core/lane

(7)

CUDA Debugging

7

CUDA Debugging



Warp: group of 32 threads

 Share one PC

 Advance synchronously



Single Stepping

 Advances all GPU hardware threads within same warp

 Stepping over a __syncthreads() call advances all threads within the block



Advancing more than just one warp

 “Run To” a selected line number in the source pane  Set a breakpoint and “Continue” the process



Halt

 Stops all the host and device threads

Problem: Diverging threads

if (threadIdx.x > 2) {...} else {...}

(8)

CUDA Debugging

8



Displaying CUDA device

properties

 “Tools” - “CUDA Devices”  Helps mapping between

logical & physical coordinates



PCs across SMs, warps,

lanes

 GPU thread divergence?

Different PC within warp

 Diverging threads

(9)

CUDA Debugging

9

CUDA Debugging



Displaying GPU data

 “Dive” into variable or

watch “Type” in “Expression List”

 Device memory spaces: “@” notation

Storage Qualifier Meaning of address

@global Offset within global storage @shared Offset within shared storage @local Offset within local storage @register PTX register name

@generic Offset within generic address space (e.g. pointer to global, local or shared memory) @constant Offset within constant storage

@texture Offset within texture storage @parameter Offset within parameter storage

(10)

CUDA Debugging

10



Checking GPU memory

 Enable “CUDA Memory checking” during startup or in the “Debug” menu  Detects global memory addressing violations and misaligned global memory

accesses



Further features

 Multi-device support

 Host-pinned memory support  MPI-CUDA applications

(11)

CUDA Debugging

11

CUDA Debugging - Tips



Check CUDA API calls

 All CUDA API routines return error code (cudaError_t)

 Or cudaGetLastError() returns last error from a CUDA runtime call

 cudaGetErrorString(cudaError_t) returns corresponding message 1. Write a macro to check CUDA API return codes or use SafeCall and

CheckError macros from cutil.h (NVIDIA GPU Computing SDK)

2. Use TotalView to examine the return code

 Evaluate the CUDA API call in the expression list

 If needed, dive on the error value and typecast it to an cudaError_t type  You can also surround the API call by cudaGetErrorString() in the

(12)

CUDA Debugging

12



Check + use available hardware features

 printf statements are possible within kernels (since Fermi)  Use double precision floating point operations (since GT200)

 Enable ECC and check whether single or double bit errors occurred using

nvidia-smi -q (since Fermi)



Check final numerical results on host

 While porting, it is recommended to compare all computed GPU results with host results

1. Compute check sums of GPU and host array values 2. If not sufficient, compare arrays element-wise

(13)

CUDA Debugging

13

CUDA Debugging - Tips



Check intermediate results

 If results are directly stored in global memory: dive on result array

 If results are stored in on-chip memory (e.g. registers)  tedious debugging  TotalView: View of variables across CUDA threads not possible yet

1. Create additional array on host for intermediate results with size

#threads * #results * sizeof(result)

Use array on GPU: each thread stores its result at unique index Transfer array back to host and examine the results

2. If having a limited number of thread blocks: create additional array in shared memory within kernel function: __shared__ myarray[size]

Use defines to exchange access to on-chip variable with array access Examine results by diving on array and switching between blocks