Because function arguments are passed by value by default in C/C++, the CUDA runtime can automatically handle the transfer of these values to the device. The first two arguments, n and a, however, were not explicitly transferred to the device in host code. This is indeed true because we passed d_x and d_y to the kernel when we launched it from the host code. In this case the n, a and i variables will be stored by each thread in a register, and the pointers x and y must be pointers to the device memory address space. Variables defined within device code do not need to be specified as device variables because they are assumed to reside on the device. In CUDA, we define kernels such as saxpy using the _global_ declaration specifier. For device memory allocated with cudaMalloc(), simply call cudaFree(). Cleaning UpĪfter we are finished, we should free any allocated memory.
In this case we launch the kernel with thread blocks containing 256 threads, and use integer arithmetic to determine the number of thread blocks required to process all N elements of the arrays ( (N+255)/256).įor cases where the number of elements in the arrays is not evenly divisible by the thread block size, the kernel code must check for out-of-bounds memory accesses. Thread blocks and grids can be made one-, two- or three-dimensional by passing dim3 (a simple struct defined by CUDA with x, y, and z members) values for these arguments, but for this simple example we only need one dimension so we pass integers instead. The first argument in the execution configuration specifies the number of thread blocks in the grid, and the second specifies the number of threads in a thread block. In the CUDA programming model we speak of launching a kernel with a grid of thread blocks. In CUDA there is a hierarchy of threads in software which mimics how thread processors are grouped on the GPU. The information between the triple chevrons is the execution configuration, which dictates how many device threads execute the kernel in parallel. Int i = blockIdx.x*blockDim.x + threadIdx.x ĬudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost) Void saxpy(int n, float a, float *x, float *y) In this post I will dissect a more complete version of the CUDA C SAXPY, explaining in detail what is done and why. SAXPY stands for “Single-precision A*X Plus Y”, and is a good “hello world” example for parallel computation.
In a recent post, I illustrated Six Ways to SAXPY, which includes a CUDA C version. Keeping this sequence of operations in mind, let’s look at a CUDA C example.