Fix job resume on non-preemption type failures #803

rayg1234 · 2024-08-12T20:40:39Z

See T192775876,

Slurm jobs that are pre-empted currently call a checkpoint callback and get requeued automatically by slurm. Given a timestamp_id (unique job id) and a checkpoint, the job is able to resume correctly AND resume the same run on wandb.

However, if jobs experience a "node failure" or other methods of failure which does not trigger the "checkpoint" callback. The slurm job is missing the timestamp_id and checkpoint info to be able to resume properly. So it ends up starting a new job (and a new wandb run) instead.

Failure example

Example runs that have failed and restarted with new job from scratch when they should've resumed from last checkpoint:

This PR modifies

the task checkpoint loading logic to check for a previously saved checkpoint in the checkpoint dir. This will allow the job to pick up from last checkpoint automatically even if there was a catastrophic failure
writes the timestamp id in the slurm submission pickle manually. This allows failed jobs to pickup the timestamp_id

Testing

We can use scontrol requeue <job_id> to simulate a "node failure":

Single-node job requeue

Example job that has been requeued where it did not trigger the checkpoint callback:
https://fairwandb.org/fairchem/fm_testing/runs/2024-08-13-00-08-32-test_resume_requeue

Multi-node job requeue

https://fairwandb.org/fairchem/fm_testing/runs/2024-08-13-20-20-16-test_resume_requeue_5

**Note in this case if the last checkpoint was saved at step N and the most recent step written to wandb is N+K, then the job will resume from step N and overwrite K steps in the logs, it will look like the following:
198 wandb: WARNING (User provided step: 544 is less than current step: 676. Dropping entry: {'train/grad_norm': 10.882038116455078, '_timestamp': 1723508783.0449212}).

It is very important to have deterministic training so that when we repeat these steps the jobs end up in the same future state

Pre-emption

Manual preemptions can be triggered by starting a job on a low priority partition (ie: scavenge) and then start a new job targeting the node that the first job was running (ie: using the srun -w <node_id> ...)

Example job that was premepted where the checkpoint callback was triggered:
https://fairwandb.org/fairchem/fm_testing/runs/2024-08-12-23-40-48-test_resume_node_preemption?nw=nwuserrgao

codecov · 2024-08-12T21:17:16Z

Codecov Report

Attention: Patch coverage is 41.66667% with 14 lines in your changes missing coverage. Please review.

Files	Patch %	Lines
src/fairchem/core/common/slurm.py	35.71%	9 Missing ⚠️
src/fairchem/core/tasks/task.py	66.66%	2 Missing ⚠️
src/fairchem/core/trainers/base_trainer.py	33.33%	2 Missing ⚠️
src/fairchem/core/_cli.py	0.00%	1 Missing ⚠️

Files	Coverage Δ
src/fairchem/core/_cli.py	`63.49% <0.00%> (-1.03%)`	⬇️
src/fairchem/core/tasks/task.py	`66.03% <66.66%> (-1.31%)`	⬇️
src/fairchem/core/trainers/base_trainer.py	`89.17% <33.33%> (-0.39%)`	⬇️
src/fairchem/core/common/slurm.py	`35.71% <35.71%> (ø)`

rayg1234 added 2 commits August 12, 2024 20:38

simplest way to fix resume on node failure

8a2e8d0

lint

f1e51db

rayg1234 added 4 commits August 12, 2024 23:06

add slurm util [y

a417dbd

add slurm.py

0b773f0

fix typo

7a2e160

add print

3b0c49c

rayg1234 mentioned this pull request Aug 13, 2024

Allow proper resuming of slurm jobs on job failure #777

Closed

only modify pickle on master

a795ddd

rayg1234 marked this pull request as ready for review August 13, 2024 00:39

rayg1234 requested review from misko, mshuaibii and wood-b August 13, 2024 00:39

rayg1234 and others added 3 commits August 13, 2024 00:44

fix comment

e545e8d

use submit to get pickle path

64e8e6d

Merge branch 'main' into rgao_resume_on_node_failure_2

19e2869

rayg1234 added bug Something isn't working patch Patch version release labels Aug 13, 2024

rayg1234 added 2 commits August 13, 2024 16:50

str paths

6372fe3

typo

41543e2

mshuaibii previously approved these changes Aug 13, 2024

View reviewed changes

typo

f47eadb

rayg1234 dismissed mshuaibii’s stale review via f47eadb August 13, 2024 18:26

rayg1234 added 2 commits August 13, 2024 11:30

Merge branch 'main' into rgao_resume_on_node_failure_2

a584dea

Merge branch 'main' into rgao_resume_on_node_failure_2

1084d7b

rayg1234 enabled auto-merge August 13, 2024 20:42

rayg1234 disabled auto-merge August 13, 2024 20:56

timestamp_id was already in cmd

896efd6

rayg1234 enabled auto-merge August 13, 2024 20:58

rayg1234 disabled auto-merge August 13, 2024 21:04

mshuaibii approved these changes Aug 13, 2024

View reviewed changes

rayg1234 enabled auto-merge August 13, 2024 21:35

rayg1234 added this pull request to the merge queue Aug 13, 2024

Merged via the queue into main with commit 8143ccb Aug 13, 2024
8 checks passed

rayg1234 deleted the rgao_resume_on_node_failure_2 branch August 13, 2024 23:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix job resume on non-preemption type failures #803

Fix job resume on non-preemption type failures #803

rayg1234 commented Aug 12, 2024 •

edited

Loading

codecov bot commented Aug 12, 2024 •

edited

Loading

Fix job resume on non-preemption type failures #803

Fix job resume on non-preemption type failures #803

Conversation

rayg1234 commented Aug 12, 2024 • edited Loading

Failure example

This PR modifies

Testing

Single-node job requeue

Multi-node job requeue

Pre-emption

codecov bot commented Aug 12, 2024 • edited Loading

Codecov Report

rayg1234 commented Aug 12, 2024 •

edited

Loading

codecov bot commented Aug 12, 2024 •

edited

Loading