Rechen- und Kommunikationszentrum (RZ)
CUDA
Debugging
GPGPU Workshop, August 2012
Sandra Wienke
Center for Computing and Communication, RWTH Aachen University
Nikolay Piskun, Chris Gottbrath
CUDA Debugging
S. Wienke| Rechen- und Kommunikationszentrum
2 Core Device: GPU PCIe Streaming Multiprocessor (SM) Thread Block Grid (Kernel) Device … Host SM-1 Shared Mem Registers L1 L2 Global Memory
CPU CPU Mem
Host float x = input[threadID]; Host Memory
float y = func(x); output[threadID] = y; SM-n Shared Mem Registers L1 © N VIDIA C o rpo rat io n 2010 © N VIDIA C o rpo rat io n 2010
CUDA Debugging
S. Wienke| Rechen- und Kommunikationszentrum
3
int main(int argc, char* argv[]) { int n = 10240;
float* h_x,*h_y; //Pointer to CPU memory //Allocate and initialize h_x and h_y float *d_x,*d_y; //Pointer to GPU memory cudaMalloc(&d_x, n*sizeof(float));
cudaMalloc(&d_y, n*sizeof(float));
cudaMemcpy(d_x, h_x, n * sizeof(float),
cudaMemcpyHostToDevice);
cudaMemcpy(d_y, h_y, n * sizeof(float), cudaMemcpyHostToDevice);
//Invoke parallel SAXPY kernel dim3 threadsPerBlock(128);
dim3 blocksPerGrid(n/threadsPerBlock.x); saxpy_parallel<<<blocksPerGrid,
threadsPerBlock>>>(n, 2.0, d_x, d_y);
cudaMemcpy(h_y, d_y, n * sizeof(float), cudaMemcpyDeviceToHost); cudaFree(d_x); cudaFree(d_y); free(h_x); free(h_y); return 0; }
__global__ void saxpy_parallel(int n, float a, float *x, float *y)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n){
y[i] = a*x[i] + y[i]; }
}
CUDA in a Nutshell
CUDA Runtime API
Allocate data on GPU Copy/transfer data to GPU Invoke kernel on GPU Copy/transfer data to CPU Free data on GPU Indicate kernel execution Compute thread ID Compute SAXPY cudaMalloc cudaMemcpy saxpy_parallel<<<blocksPerGrid, threadsPerBlock>>> cudaMemcpy cudaFree __global__ blockIdx.x * blockDim.x + threadIdx.x
CUDA Debugging
S. Wienke| Rechen- und Kommunikationszentrum
4
CUDA Toolkit
Developer kit: libs, header, profiler, compiler,…
Compiling CUDA applications
nvcc [-arch=sm_20] myKernel.cu
Debugging flags: -g -G
CUDA command line tools
Debugger: cuda-gdb
Detecting memory access errors: cuda-memcheck
CUDA GUI-based debugger: TotalView
Debugging host and device code in same session Thread navigation by logical or physical coordinates Displaying hierarchical memory,…
CUDA Debugging
S. Wienke| Rechen- und Kommunikationszentrum
5
CUDA Debugging
Setting breakpoints in CUDA kernels
Start debugging (e.g. “Go”)
Message box when kernel is loaded:
CUDA Debugging
S. Wienke| Rechen- und Kommunikationszentrum
6
Debugger thread IDs in Linux CUDA process
Host thread: positive no. CUDA thread: negative no.
GPU thread navigation
Logical coordinates: blocks (3 dimensions), threads (3 dimensions)
Physical coordinates: device, SM, warp, core/lane
CUDA Debugging
S. Wienke| Rechen- und Kommunikationszentrum
7
CUDA Debugging
Warp: group of 32 threads
Share one PC
Advance synchronously
Single Stepping
Advances all GPU hardware threads within same warp
Stepping over a __syncthreads() call advances all threads within the block
Advancing more than just one warp
“Run To” a selected line number in the source pane Set a breakpoint and “Continue” the process
Halt
Stops all the host and device threads
Problem: Diverging threads
if (threadIdx.x > 2) {...} else {...}
CUDA Debugging
S. Wienke| Rechen- und Kommunikationszentrum
8
Displaying CUDA device
properties
“Tools” - “CUDA Devices” Helps mapping between
logical & physical coordinates
PCs across SMs, warps,
lanes
GPU thread divergence?
Different PC within warp
Diverging threads
CUDA Debugging
S. Wienke| Rechen- und Kommunikationszentrum
9
CUDA Debugging
Displaying GPU data
“Dive” into variable or
watch “Type” in “Expression List”
Device memory spaces: “@” notation
Storage Qualifier Meaning of address
@global Offset within global storage @shared Offset within shared storage @local Offset within local storage @register PTX register name
@generic Offset within generic address space (e.g. pointer to global, local or shared memory) @constant Offset within constant storage
@texture Offset within texture storage @parameter Offset within parameter storage
CUDA Debugging
S. Wienke| Rechen- und Kommunikationszentrum
10
Checking GPU memory
Enable “CUDA Memory checking” during startup or in the “Debug” menu Detects global memory addressing violations and misaligned global memory
accesses
Further features
Multi-device support
Host-pinned memory support MPI-CUDA applications
CUDA Debugging
S. Wienke| Rechen- und Kommunikationszentrum
11
CUDA Debugging - Tips
Check CUDA API calls
All CUDA API routines return error code (cudaError_t)
Or cudaGetLastError() returns last error from a CUDA runtime call
cudaGetErrorString(cudaError_t) returns corresponding message 1. Write a macro to check CUDA API return codes or use SafeCall and
CheckError macros from cutil.h (NVIDIA GPU Computing SDK)
2. Use TotalView to examine the return code
Evaluate the CUDA API call in the expression list
If needed, dive on the error value and typecast it to an cudaError_t type You can also surround the API call by cudaGetErrorString() in the
CUDA Debugging
S. Wienke| Rechen- und Kommunikationszentrum
12
Check + use available hardware features
printf statements are possible within kernels (since Fermi) Use double precision floating point operations (since GT200)
Enable ECC and check whether single or double bit errors occurred using
nvidia-smi -q (since Fermi)
Check final numerical results on host
While porting, it is recommended to compare all computed GPU results with host results
1. Compute check sums of GPU and host array values 2. If not sufficient, compare arrays element-wise
CUDA Debugging
S. Wienke| Rechen- und Kommunikationszentrum
13
CUDA Debugging - Tips
Check intermediate results
If results are directly stored in global memory: dive on result array
If results are stored in on-chip memory (e.g. registers) tedious debugging TotalView: View of variables across CUDA threads not possible yet
1. Create additional array on host for intermediate results with size
#threads * #results * sizeof(result)
Use array on GPU: each thread stores its result at unique index Transfer array back to host and examine the results
2. If having a limited number of thread blocks: create additional array in shared memory within kernel function: __shared__ myarray[size]
Use defines to exchange access to on-chip variable with array access Examine results by diving on array and switching between blocks