Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add reduction_unroll_factor to autotuning script #3487

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

rdspring1
Copy link
Collaborator

This PR renames unroll_factor to iteration_unroll_factor and adds reduction_unroll_factor. reduction_unroll_factor adds unroll factor on top of vectorization factor for the inner reduction domain.

@rdspring1 rdspring1 added the Autotune Generate heuristics through machine learning models. label Nov 27, 2024
@rdspring1 rdspring1 requested a review from liqiangxl November 27, 2024 01:48
@rdspring1 rdspring1 force-pushed the autotune_inner_reduction_2d branch from 7817368 to e7ffb29 Compare December 1, 2024 17:30
@rdspring1 rdspring1 force-pushed the autotune_inner_reduction_2d_update branch from 15bc05e to fdcf6a5 Compare December 1, 2024 17:31
)

# number of reduction elements not handled by a CTA
remaining_reduction = ceil_div(
num_reductions,
(scheduler_config.bdimx * scheduler_config.vectorize_factor),
(scheduler_config.bdimx * vectorize_factor * reduction_unroll_factor),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be ceil_div(ceil_div(num_reductions/vectorize_factor, bdimx), reduction_unroll_factor)

)

if unroll_factor == 1 and remaining_reduction > 1:
if iteration_unroll_factor == 1 and remaining_reduction > 1:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks strange to me. Why grdim = remaining_reduction? We can do serial reduction instread of grid reduction.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvFuser's default heuristic does:

  // When iteration dim is small, may have unused SMs, to increase SM usage
  // needs to shift from block reduction to grid reduction.
  int64_t grdim = 1;
  while (godim * grdim * 2 <= sm_count && getInnerRemainder() / grdim >= 2) {
    grdim *= 2;
  }

Base automatically changed from autotune_inner_reduction_2d to main December 11, 2024 19:43
@rdspring1
Copy link
Collaborator Author

!build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Autotune Generate heuristics through machine learning models.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants