Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Polaris example config #1154

Merged
merged 1 commit into from
May 23, 2023
Merged

Conversation

chris-janidlo
Copy link
Contributor

Description

As part of some PR spring cleaning, addressing #951 by including the suggested update from SingleNodeLauncher to MpiExecLauncher, and showing off the ability to pin workers to GPUs via an optional user-opt.

Type of change

  • Documentation update

* Switch to MpiExecLauncher over SingleNodeLauncher, as the former is
  more useful
* Demonstrate capability to pin each worker to a unique GPU
@chris-janidlo chris-janidlo added the no-news-is-good-news This change does not require a news file label May 23, 2023
Copy link
Contributor

@WardLT WardLT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. My suggestion about opt out is, appropriately, optional.

I would also suggest updating the description of the configuration in the documentation to explain what it does (multiple nodes per job, multiple tasks per GPU)

@@ -15,17 +15,21 @@
'scheduler_options': '#PBS -l filesystems=home:grand:eagle\n#PBS -k doe',
# ALCF allocation to use
'account': '',
# Un-comment to give each worker exclusive access to a single GPU
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why make GPU pinning opt-in?

I would vote for opt out for this option because it's rare for apps to fairly share GPUs amongst themselves (I don't know if any), so this default would lead to bad performance.

The balancing point to my suggestion is that restarting workers or reassigning them to different types is broken with accelerator pinning. (sorry, I don't recall the number of the issue I made about it)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WardLT If we have a bug here, (is this the one -> #722) I think it would be better to leave this as an opt-in with a note pointing at the bug.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! That's the one.

@chris-janidlo chris-janidlo merged commit 4b006ec into main May 23, 2023
@chris-janidlo chris-janidlo deleted the update-polaris-example-config branch May 23, 2023 18:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no-news-is-good-news This change does not require a news file
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants