• No results found

CUDA Debugging. GPGPU Workshop, August Sandra Wienke Center for Computing and Communication, RWTH Aachen University

N/A
N/A
Protected

Academic year: 2021

Share "CUDA Debugging. GPGPU Workshop, August Sandra Wienke Center for Computing and Communication, RWTH Aachen University"

Copied!
13
0
0

Loading.... (view fulltext now)

Full text

(1)

Rechen- und Kommunikationszentrum (RZ)

CUDA

Debugging

GPGPU Workshop, August 2012

Sandra Wienke

Center for Computing and Communication, RWTH Aachen University

Nikolay Piskun, Chris Gottbrath

(2)

CUDA Debugging

S. Wienke| Rechen- und Kommunikationszentrum

2 Core Device: GPU PCIe Streaming Multiprocessor (SM) Thread Block Grid (Kernel) Device Host SM-1 Shared Mem Registers L1 L2 Global Memory

CPU CPU Mem

Host float x = input[threadID]; Host Memory

float y = func(x); output[threadID] = y; SM-n Shared Mem Registers L1 © N VIDIA C o rpo rat io n 2010 © N VIDIA C o rpo rat io n 2010

(3)

CUDA Debugging

S. Wienke| Rechen- und Kommunikationszentrum

3

int main(int argc, char* argv[]) { int n = 10240;

float* h_x,*h_y; //Pointer to CPU memory //Allocate and initialize h_x and h_y float *d_x,*d_y; //Pointer to GPU memory cudaMalloc(&d_x, n*sizeof(float));

cudaMalloc(&d_y, n*sizeof(float));

cudaMemcpy(d_x, h_x, n * sizeof(float),

cudaMemcpyHostToDevice);

cudaMemcpy(d_y, h_y, n * sizeof(float), cudaMemcpyHostToDevice);

//Invoke parallel SAXPY kernel dim3 threadsPerBlock(128);

dim3 blocksPerGrid(n/threadsPerBlock.x); saxpy_parallel<<<blocksPerGrid,

threadsPerBlock>>>(n, 2.0, d_x, d_y);

cudaMemcpy(h_y, d_y, n * sizeof(float), cudaMemcpyDeviceToHost); cudaFree(d_x); cudaFree(d_y); free(h_x); free(h_y); return 0; }

__global__ void saxpy_parallel(int n, float a, float *x, float *y)

{

int i = blockIdx.x * blockDim.x + threadIdx.x;

if (i < n){

y[i] = a*x[i] + y[i]; }

}

CUDA in a Nutshell

CUDA Runtime API

Allocate data on GPU Copy/transfer data to GPU Invoke kernel on GPU Copy/transfer data to CPU Free data on GPU Indicate kernel execution Compute thread ID Compute SAXPY cudaMalloc cudaMemcpy saxpy_parallel<<<blocksPerGrid, threadsPerBlock>>> cudaMemcpy cudaFree __global__ blockIdx.x * blockDim.x + threadIdx.x

(4)

CUDA Debugging

S. Wienke| Rechen- und Kommunikationszentrum

4

CUDA Toolkit

 Developer kit: libs, header, profiler, compiler,…

Compiling CUDA applications

nvcc [-arch=sm_20] myKernel.cu

 Debugging flags: -g -G

CUDA command line tools

 Debugger: cuda-gdb

 Detecting memory access errors: cuda-memcheck

CUDA GUI-based debugger: TotalView

 Debugging host and device code in same session  Thread navigation by logical or physical coordinates  Displaying hierarchical memory,…

(5)

CUDA Debugging

S. Wienke| Rechen- und Kommunikationszentrum

5

CUDA Debugging

Setting breakpoints in CUDA kernels

 Start debugging (e.g. “Go”)

 Message box when kernel is loaded:

(6)

CUDA Debugging

S. Wienke| Rechen- und Kommunikationszentrum

6

Debugger thread IDs in Linux CUDA process

 Host thread: positive no.  CUDA thread: negative no.

GPU thread navigation

Logical coordinates: blocks (3 dimensions), threads (3 dimensions)

Physical coordinates: device, SM, warp, core/lane

(7)

CUDA Debugging

S. Wienke| Rechen- und Kommunikationszentrum

7

CUDA Debugging

Warp: group of 32 threads

 Share one PC

 Advance synchronously

Single Stepping

 Advances all GPU hardware threads within same warp

 Stepping over a __syncthreads() call advances all threads within the block

Advancing more than just one warp

 “Run To” a selected line number in the source pane  Set a breakpoint and “Continue” the process

Halt

 Stops all the host and device threads

Problem: Diverging threads

if (threadIdx.x > 2) {...} else {...}

(8)

CUDA Debugging

S. Wienke| Rechen- und Kommunikationszentrum

8

Displaying CUDA device

properties

 “Tools” - “CUDA Devices”  Helps mapping between

logical & physical coordinates

PCs across SMs, warps,

lanes

 GPU thread divergence?

Different PC within warp

Diverging threads

(9)

CUDA Debugging

S. Wienke| Rechen- und Kommunikationszentrum

9

CUDA Debugging

Displaying GPU data

 “Dive” into variable or

watch “Type” in “Expression List”

 Device memory spaces: “@” notation

Storage Qualifier Meaning of address

@global Offset within global storage @shared Offset within shared storage @local Offset within local storage @register PTX register name

@generic Offset within generic address space (e.g. pointer to global, local or shared memory) @constant Offset within constant storage

@texture Offset within texture storage @parameter Offset within parameter storage

(10)

CUDA Debugging

S. Wienke| Rechen- und Kommunikationszentrum

10

Checking GPU memory

 Enable “CUDA Memory checking” during startup or in the “Debug” menu  Detects global memory addressing violations and misaligned global memory

accesses

Further features

 Multi-device support

 Host-pinned memory support  MPI-CUDA applications

(11)

CUDA Debugging

S. Wienke| Rechen- und Kommunikationszentrum

11

CUDA Debugging - Tips

Check CUDA API calls

 All CUDA API routines return error code (cudaError_t)

 Or cudaGetLastError() returns last error from a CUDA runtime call

cudaGetErrorString(cudaError_t) returns corresponding message 1. Write a macro to check CUDA API return codes or use SafeCall and

CheckError macros from cutil.h (NVIDIA GPU Computing SDK)

2. Use TotalView to examine the return code

 Evaluate the CUDA API call in the expression list

 If needed, dive on the error value and typecast it to an cudaError_t type  You can also surround the API call by cudaGetErrorString() in the

(12)

CUDA Debugging

S. Wienke| Rechen- und Kommunikationszentrum

12

Check + use available hardware features

printf statements are possible within kernels (since Fermi)  Use double precision floating point operations (since GT200)

 Enable ECC and check whether single or double bit errors occurred using

nvidia-smi -q (since Fermi)

Check final numerical results on host

 While porting, it is recommended to compare all computed GPU results with host results

1. Compute check sums of GPU and host array values 2. If not sufficient, compare arrays element-wise

(13)

CUDA Debugging

S. Wienke| Rechen- und Kommunikationszentrum

13

CUDA Debugging - Tips

Check intermediate results

 If results are directly stored in global memory: dive on result array

 If results are stored in on-chip memory (e.g. registers)  tedious debugging  TotalView: View of variables across CUDA threads not possible yet

1. Create additional array on host for intermediate results with size

#threads * #results * sizeof(result)

Use array on GPU: each thread stores its result at unique index Transfer array back to host and examine the results

2. If having a limited number of thread blocks: create additional array in shared memory within kernel function: __shared__ myarray[size]

Use defines to exchange access to on-chip variable with array access Examine results by diving on array and switching between blocks

References

Related documents

The lakes can be divided into two groups: strongly acidic water characterized by a pH of 2.6–2.9, as observed in the Purple and Yellow Lakes; and acidic or neutral water, with a

The handbook includes clinical policies and procedures, American Speech-Language-Hearing Association (ASHA) standards and MA state and Department of Education licensing

Table 2 shows that the coefficients (of both OECD and non-OECD variables) are positive in the regressions of all countries except Italy, and that they are significant for the

Prevalence and significance of a negative extended- spectrum beta-lactamase (ESBL) confirmation test result after a positive ESBL screening test result for isolates of Escherichia

To obtain the effect of an increase in quantity of fertilizer distributed by government on change in the total quantity of fertilizer procured from private sector in a

En el presente trabajo, se ha escogido como objeto de estudio el pez cebra (Danio rerio) debido a las ventajas que presenta como organismo modelo para estudios de investigación, y

It makes good sense to do fun and pleasurable things to make yourself feel better, but these are not the only sorts of activities that will help generate positive feelings..

perairan yang tidak terlalu dalam (kedalaman penangkapan berkaitan dengan kemampuan dan pengalaman dari nelayan yang bersangkutan); 3) Jenis lobster yang tertangkap di