Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix job resume on non-preemption type failures #803

Merged
merged 16 commits into from
Aug 13, 2024

Conversation

rayg1234
Copy link
Collaborator

@rayg1234 rayg1234 commented Aug 12, 2024

See T192775876,

Slurm jobs that are pre-empted currently call a checkpoint callback and get requeued automatically by slurm. Given a timestamp_id (unique job id) and a checkpoint, the job is able to resume correctly AND resume the same run on wandb.

However, if jobs experience a "node failure" or other methods of failure which does not trigger the "checkpoint" callback. The slurm job is missing the timestamp_id and checkpoint info to be able to resume properly. So it ends up starting a new job (and a new wandb run) instead.

Failure example

Example runs that have failed and restarted with new job from scratch when they should've resumed from last checkpoint:

This PR modifies

  • the task checkpoint loading logic to check for a previously saved checkpoint in the checkpoint dir. This will allow the job to pick up from last checkpoint automatically even if there was a catastrophic failure
  • writes the timestamp id in the slurm submission pickle manually. This allows failed jobs to pickup the timestamp_id

Testing

We can use scontrol requeue <job_id> to simulate a "node failure":

Single-node job requeue

Example job that has been requeued where it did not trigger the checkpoint callback:
https://fairwandb.org/fairchem/fm_testing/runs/2024-08-13-00-08-32-test_resume_requeue

Multi-node job requeue

https://fairwandb.org/fairchem/fm_testing/runs/2024-08-13-20-20-16-test_resume_requeue_5

**Note in this case if the last checkpoint was saved at step N and the most recent step written to wandb is N+K, then the job will resume from step N and overwrite K steps in the logs, it will look like the following:
198 wandb: WARNING (User provided step: 544 is less than current step: 676. Dropping entry: {'train/grad_norm': 10.882038116455078, '_timestamp': 1723508783.0449212}).

It is very important to have deterministic training so that when we repeat these steps the jobs end up in the same future state

Pre-emption

Manual preemptions can be triggered by starting a job on a low priority partition (ie: scavenge) and then start a new job targeting the node that the first job was running (ie: using the srun -w <node_id> ...)

Example job that was premepted where the checkpoint callback was triggered:
https://fairwandb.org/fairchem/fm_testing/runs/2024-08-12-23-40-48-test_resume_node_preemption?nw=nwuserrgao

Copy link

codecov bot commented Aug 12, 2024

Codecov Report

Attention: Patch coverage is 41.66667% with 14 lines in your changes missing coverage. Please review.

Files Patch % Lines
src/fairchem/core/common/slurm.py 35.71% 9 Missing ⚠️
src/fairchem/core/tasks/task.py 66.66% 2 Missing ⚠️
src/fairchem/core/trainers/base_trainer.py 33.33% 2 Missing ⚠️
src/fairchem/core/_cli.py 0.00% 1 Missing ⚠️
Files Coverage Δ
src/fairchem/core/_cli.py 63.49% <0.00%> (-1.03%) ⬇️
src/fairchem/core/tasks/task.py 66.03% <66.66%> (-1.31%) ⬇️
src/fairchem/core/trainers/base_trainer.py 89.17% <33.33%> (-0.39%) ⬇️
src/fairchem/core/common/slurm.py 35.71% <35.71%> (ø)

@rayg1234 rayg1234 marked this pull request as ready for review August 13, 2024 00:39
@rayg1234 rayg1234 requested review from misko, mshuaibii and wood-b August 13, 2024 00:39
@rayg1234 rayg1234 added bug Something isn't working patch Patch version release labels Aug 13, 2024
mshuaibii
mshuaibii previously approved these changes Aug 13, 2024
@rayg1234 rayg1234 enabled auto-merge August 13, 2024 20:42
@rayg1234 rayg1234 disabled auto-merge August 13, 2024 20:56
@rayg1234 rayg1234 enabled auto-merge August 13, 2024 20:58
@rayg1234 rayg1234 disabled auto-merge August 13, 2024 21:04
@rayg1234 rayg1234 enabled auto-merge August 13, 2024 21:35
@rayg1234 rayg1234 added this pull request to the merge queue Aug 13, 2024
Merged via the queue into main with commit 8143ccb Aug 13, 2024
8 checks passed
@rayg1234 rayg1234 deleted the rgao_resume_on_node_failure_2 branch August 13, 2024 23:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working patch Patch version release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants