use cudaOccupancyMaxPotentialBlockSize in LaunchConfig::for_num_elems #149

ariasanovsky · 2023-05-15T10:55:49Z

for_num_elements defaults to block size 1024 but this is often suboptimal for performance. See NVIDIA's article on optimal number of blocks and threads.

cudaOccupancyMaxPotentialBlockSize makes it possible to compute a reasonably efficient execution configuration for a kernel without having to directly query the kernel’s attributes or the device properties, regardless of what device is present or any compilation details

The CUDA Toolkit version 6.5 also provides a self-documenting, standalone occupancy calculator and launch configurator implementation in <CUDA_Toolkit_Path>/include/cuda_occupancy.h for any use cases that cannot depend on the CUDA software stack. A spreadsheet version of the occupancy calculator is also included (and has been for many CUDA releases). The spreadsheet version is particularly useful as a learning tool that visualizes the impact of changes to the parameters that affect occupancy (block size, registers per thread, and shared memory per thread). You can find more information in the CUDA C Programming Guide and CUDA Runtime API Reference.

For example in C++ CUDA:

#include "stdio.h"

__global__ void MyKernel(int *array, int arrayCount) 
{ 
  int idx = threadIdx.x + blockIdx.x * blockDim.x; 
  if (idx < arrayCount) 
  { 
    array[idx] *= array[idx]; 
  } 
} 

void launchMyKernel(int *array, int arrayCount) 
{ 
  int blockSize;   // The launch configurator returned block size 
  int minGridSize; // The minimum grid size needed to achieve the 
                   // maximum occupancy for a full device launch 
  int gridSize;    // The actual grid size needed, based on input size 

  cudaOccupancyMaxPotentialBlockSize( &minGridSize, &blockSize, 
                                      MyKernel, 0, 0); 
  // Round up according to array size 
  gridSize = (arrayCount + blockSize - 1) / blockSize; 

  MyKernel<<< gridSize, blockSize >>>(array, arrayCount); 

  cudaDeviceSynchronize();

coreylowman · 2023-05-15T11:33:22Z

Good idea! We can use this in dfdx, it will be much more optimal 😄

brandonros · 2024-06-12T21:51:29Z

Example in case anybody else lands on this:

extern "C" fn block_size_to_dynamic_smem_size(_block_size: std::ffi::c_int) -> usize {
    0
}

// get kernel
let my_kernel = device.get_func("my_module", "my_kernel").unwrap();

// get size
let (min_grid_size, min_block_size) = my_kernel.occupancy_max_potential_block_size(block_size_to_dynamic_smem_size, 0, 0, None)?;
log::info!("min_grid_size = {min_grid_size} min_block_size = {min_block_size}");

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use cudaOccupancyMaxPotentialBlockSize in LaunchConfig::for_num_elems #149

use cudaOccupancyMaxPotentialBlockSize in LaunchConfig::for_num_elems #149

ariasanovsky commented May 15, 2023

coreylowman commented May 15, 2023

brandonros commented Jun 12, 2024 •

edited

Loading

use cudaOccupancyMaxPotentialBlockSize in LaunchConfig::for_num_elems #149

use cudaOccupancyMaxPotentialBlockSize in LaunchConfig::for_num_elems #149

Comments

ariasanovsky commented May 15, 2023

coreylowman commented May 15, 2023

brandonros commented Jun 12, 2024 • edited Loading

brandonros commented Jun 12, 2024 •

edited

Loading