Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tune/train] Restore Tuner and results properly from moved storage path #40647

Merged
merged 16 commits into from
Oct 30, 2023

Conversation

justinvyu
Copy link
Contributor

@justinvyu justinvyu commented Oct 24, 2023

Why are these changes needed?

This is a known regression introduced in 2.7: moving the path of the experiment directory and attempting to restore the experiment and/or the experiment results doesn't work due to the absolute paths saved in the trial metadata.

This PR implements a fix similar to #31669 -- replacing the root of the tracked checkpoint paths with the new storage path, and updating on experiment restoration / result loading from a path.

Related issue number

Closes #40585
Closes #40484

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@justinvyu justinvyu changed the title [tune/train] Restore properly from moved storage path [tune/train] Restore Tuner and results properly from moved storage path Oct 24, 2023
path=checkpoint_result.checkpoint.path.replace(
original_storage.trial_fs_path, new_storage.trial_fs_path, 1
),
filesystem=new_storage.storage_filesystem,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updating the storage filesystem means that we require resuming from an experiment directory that contains everything -- it's not possible to start training on S3, download everything except for checkpoints to local, then restore from local.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense... let's document this somewhere?

Copy link
Member

@woshiyyya woshiyyya Oct 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Given that checkpoints are stored in trial directory (also experiment directory), generally the users will download everything together?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. However, I think the reason people do this is because loading results from cloud is not so easy.

path=checkpoint_result.checkpoint.path.replace(
original_storage.trial_fs_path, new_storage.trial_fs_path, 1
),
filesystem=new_storage.storage_filesystem,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense... let's document this somewhere?

Comment on lines +793 to 795
# TODO(justinvyu): [populate_exception] for storage_path != None
# assert len(results.errors) == 1
training_iteration = results[0].metrics["training_iteration"]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't work right now because it's looking for the error locally at ~/ray_results rather than the storage path. (Has always been a gap in functionality)

trials = []
trial_states = experiment_state["trial_data"]
for trial_json_state, trial_runtime_metadata in trial_states:
trial = Trial.from_json_state(trial_json_state, stub=True)
trial.restore_run_metadata(trial_runtime_metadata)
# TODO(justinvyu): [handle_moved_storage_path]

new_storage = copy.copy(trial.storage)
Copy link
Member

@woshiyyya woshiyyya Oct 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would happen if a experiment directory originally on S3, then moved to disk. Will we automatically detect the and build a new file system object?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tuner.restore("local_path") would pass in a local storage path and would be resolved into a local filesystem. I should add a test for this.

Comment on lines -212 to -214
def test_result_grid_moved_experiment_path(ray_start_2_cpus, tmpdir):
# TODO(justinvyu): [handle_moved_storage_path]
pytest.skip("Not implemented yet.")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test got merged with the tuner restore test

@justinvyu justinvyu merged commit 0ec69f0 into ray-project:master Oct 30, 2023
2 checks passed
@justinvyu justinvyu deleted the handle_moved_storage_path branch October 30, 2023 20:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants