Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use cudaOccupancyMaxPotentialBlockSize in LaunchConfig::for_num_elems #149

Open
ariasanovsky opened this issue May 15, 2023 · 2 comments
Open

Comments

@ariasanovsky
Copy link

for_num_elements defaults to block size 1024 but this is often suboptimal for performance. See NVIDIA's article on optimal number of blocks and threads.

cudaOccupancyMaxPotentialBlockSize makes it possible to compute a reasonably efficient execution configuration for a kernel without having to directly query the kernel’s attributes or the device properties, regardless of what device is present or any compilation details

The CUDA Toolkit version 6.5 also provides a self-documenting, standalone occupancy calculator and launch configurator implementation in <CUDA_Toolkit_Path>/include/cuda_occupancy.h for any use cases that cannot depend on the CUDA software stack. A spreadsheet version of the occupancy calculator is also included (and has been for many CUDA releases). The spreadsheet version is particularly useful as a learning tool that visualizes the impact of changes to the parameters that affect occupancy (block size, registers per thread, and shared memory per thread). You can find more information in the CUDA C Programming Guide and CUDA Runtime API Reference.

For example in C++ CUDA:

#include "stdio.h"

__global__ void MyKernel(int *array, int arrayCount) 
{ 
  int idx = threadIdx.x + blockIdx.x * blockDim.x; 
  if (idx < arrayCount) 
  { 
    array[idx] *= array[idx]; 
  } 
} 

void launchMyKernel(int *array, int arrayCount) 
{ 
  int blockSize;   // The launch configurator returned block size 
  int minGridSize; // The minimum grid size needed to achieve the 
                   // maximum occupancy for a full device launch 
  int gridSize;    // The actual grid size needed, based on input size 

  cudaOccupancyMaxPotentialBlockSize( &minGridSize, &blockSize, 
                                      MyKernel, 0, 0); 
  // Round up according to array size 
  gridSize = (arrayCount + blockSize - 1) / blockSize; 

  MyKernel<<< gridSize, blockSize >>>(array, arrayCount); 

  cudaDeviceSynchronize(); 
@coreylowman
Copy link
Owner

Good idea! We can use this in dfdx, it will be much more optimal 😄

@brandonros
Copy link
Contributor

brandonros commented Jun 12, 2024

Example in case anybody else lands on this:

extern "C" fn block_size_to_dynamic_smem_size(_block_size: std::ffi::c_int) -> usize {
    0
}

// get kernel
let my_kernel = device.get_func("my_module", "my_kernel").unwrap();

// get size
let (min_grid_size, min_block_size) = my_kernel.occupancy_max_potential_block_size(block_size_to_dynamic_smem_size, 0, 0, None)?;
log::info!("min_grid_size = {min_grid_size} min_block_size = {min_block_size}");

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants