You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
for_num_elements defaults to block size 1024 but this is often suboptimal for performance. See NVIDIA's article on optimal number of blocks and threads.
cudaOccupancyMaxPotentialBlockSize makes it possible to compute a reasonably efficient execution configuration for a kernel without having to directly query the kernel’s attributes or the device properties, regardless of what device is present or any compilation details
The CUDA Toolkit version 6.5 also provides a self-documenting, standalone occupancy calculator and launch configurator implementation in <CUDA_Toolkit_Path>/include/cuda_occupancy.h for any use cases that cannot depend on the CUDA software stack. A spreadsheet version of the occupancy calculator is also included (and has been for many CUDA releases). The spreadsheet version is particularly useful as a learning tool that visualizes the impact of changes to the parameters that affect occupancy (block size, registers per thread, and shared memory per thread). You can find more information in the CUDA C Programming Guide and CUDA Runtime API Reference.
For example in C++ CUDA:
#include"stdio.h"
__global__ voidMyKernel(int *array, int arrayCount)
{
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < arrayCount)
{
array[idx] *= array[idx];
}
}
voidlaunchMyKernel(int *array, int arrayCount)
{
int blockSize; // The launch configurator returned block size int minGridSize; // The minimum grid size needed to achieve the // maximum occupancy for a full device launch int gridSize; // The actual grid size needed, based on input size cudaOccupancyMaxPotentialBlockSize( &minGridSize, &blockSize,
MyKernel, 0, 0);
// Round up according to array size
gridSize = (arrayCount + blockSize - 1) / blockSize;
MyKernel<<< gridSize, blockSize >>>(array, arrayCount);
cudaDeviceSynchronize();
The text was updated successfully, but these errors were encountered:
for_num_elements defaults to block size 1024 but this is often suboptimal for performance. See NVIDIA's article on optimal number of blocks and threads.
For example in C++ CUDA:
The text was updated successfully, but these errors were encountered: