-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tune/train] Restore Tuner and results properly from moved storage path #40647
[tune/train] Restore Tuner and results properly from moved storage path #40647
Conversation
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
path=checkpoint_result.checkpoint.path.replace( | ||
original_storage.trial_fs_path, new_storage.trial_fs_path, 1 | ||
), | ||
filesystem=new_storage.storage_filesystem, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updating the storage filesystem means that we require resuming from an experiment directory that contains everything -- it's not possible to start training on S3, download everything except for checkpoints to local, then restore from local.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this makes sense... let's document this somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Given that checkpoints are stored in trial directory (also experiment directory), generally the users will download everything together?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so. However, I think the reason people do this is because loading results from cloud is not so easy.
path=checkpoint_result.checkpoint.path.replace( | ||
original_storage.trial_fs_path, new_storage.trial_fs_path, 1 | ||
), | ||
filesystem=new_storage.storage_filesystem, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this makes sense... let's document this somewhere?
# TODO(justinvyu): [populate_exception] for storage_path != None | ||
# assert len(results.errors) == 1 | ||
training_iteration = results[0].metrics["training_iteration"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't work right now because it's looking for the error locally at ~/ray_results
rather than the storage path. (Has always been a gap in functionality)
…le_moved_storage_path
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
…le_moved_storage_path
…le_moved_storage_path
trials = [] | ||
trial_states = experiment_state["trial_data"] | ||
for trial_json_state, trial_runtime_metadata in trial_states: | ||
trial = Trial.from_json_state(trial_json_state, stub=True) | ||
trial.restore_run_metadata(trial_runtime_metadata) | ||
# TODO(justinvyu): [handle_moved_storage_path] | ||
|
||
new_storage = copy.copy(trial.storage) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would happen if a experiment directory originally on S3, then moved to disk. Will we automatically detect the and build a new file system object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tuner.restore("local_path")
would pass in a local storage path and would be resolved into a local filesystem. I should add a test for this.
def test_result_grid_moved_experiment_path(ray_start_2_cpus, tmpdir): | ||
# TODO(justinvyu): [handle_moved_storage_path] | ||
pytest.skip("Not implemented yet.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this test got merged with the tuner restore test
Why are these changes needed?
This is a known regression introduced in 2.7: moving the path of the experiment directory and attempting to restore the experiment and/or the experiment results doesn't work due to the absolute paths saved in the trial metadata.
This PR implements a fix similar to #31669 -- replacing the root of the tracked checkpoint paths with the new storage path, and updating on experiment restoration / result loading from a path.
Related issue number
Closes #40585
Closes #40484
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.