[tune/train] Restore Tuner and results properly from moved storage path #40647

justinvyu · 2023-10-24T22:49:29Z

Why are these changes needed?

This is a known regression introduced in 2.7: moving the path of the experiment directory and attempting to restore the experiment and/or the experiment results doesn't work due to the absolute paths saved in the trial metadata.

This PR implements a fix similar to #31669 -- replacing the root of the tracked checkpoint paths with the new storage path, and updating on experiment restoration / result loading from a path.

Related issue number

Closes #40585
Closes #40484

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Justin Yu <[email protected]>

justinvyu · 2023-10-24T23:33:51Z

python/ray/tune/experiment/trial.py

+                path=checkpoint_result.checkpoint.path.replace(
+                    original_storage.trial_fs_path, new_storage.trial_fs_path, 1
+                ),
+                filesystem=new_storage.storage_filesystem,


Updating the storage filesystem means that we require resuming from an experiment directory that contains everything -- it's not possible to start training on S3, download everything except for checkpoints to local, then restore from local.

I think this makes sense... let's document this somewhere?

Makes sense. Given that checkpoints are stored in trial directory (also experiment directory), generally the users will download everything together?

I think so. However, I think the reason people do this is because loading results from cloud is not so easy.

matthewdeng · 2023-10-25T23:40:38Z

python/ray/tune/experiment/trial.py

+                path=checkpoint_result.checkpoint.path.replace(
+                    original_storage.trial_fs_path, new_storage.trial_fs_path, 1
+                ),
+                filesystem=new_storage.storage_filesystem,


I think this makes sense... let's document this somewhere?

python/ray/tune/analysis/experiment_analysis.py

justinvyu · 2023-10-25T23:50:33Z

python/ray/tune/tests/test_tuner_restore.py

+    # TODO(justinvyu): [populate_exception] for storage_path != None
+    # assert len(results.errors) == 1
    training_iteration = results[0].metrics["training_iteration"]


This doesn't work right now because it's looking for the error locally at ~/ray_results rather than the storage path. (Has always been a gap in functionality)

…le_moved_storage_path

Signed-off-by: Justin Yu <[email protected]>

…le_moved_storage_path

python/ray/train/_internal/storage.py

woshiyyya · 2023-10-27T18:10:53Z

python/ray/tune/analysis/experiment_analysis.py

        trials = []
        trial_states = experiment_state["trial_data"]
        for trial_json_state, trial_runtime_metadata in trial_states:
            trial = Trial.from_json_state(trial_json_state, stub=True)
            trial.restore_run_metadata(trial_runtime_metadata)
-            # TODO(justinvyu): [handle_moved_storage_path]
+
+            new_storage = copy.copy(trial.storage)


What would happen if a experiment directory originally on S3, then moved to disk. Will we automatically detect the and build a new file system object?

Tuner.restore("local_path") would pass in a local storage path and would be resolved into a local filesystem. I should add a test for this.

justinvyu · 2023-10-27T23:18:01Z

python/ray/tune/tests/test_result_grid.py

-def test_result_grid_moved_experiment_path(ray_start_2_cpus, tmpdir):
-    # TODO(justinvyu): [handle_moved_storage_path]
-    pytest.skip("Not implemented yet.")


this test got merged with the tuner restore test

…le_moved_storage_path

Signed-off-by: Justin Yu <[email protected]>

justinvyu added 8 commits October 24, 2023 14:24

Add set_storage method for updating absolute paths in trial

fafe238

Signed-off-by: Justin Yu <[email protected]>

type annotation for checkpoint manager

e538749

Signed-off-by: Justin Yu <[email protected]>

use set_storage in exp analysis and tune controller restore

315fd84

Signed-off-by: Justin Yu <[email protected]>

update test

d1216f3

Signed-off-by: Justin Yu <[email protected]>

mark the broken exception population

589e6ad

Signed-off-by: Justin Yu <[email protected]>

add unit test

c92ab1d

Signed-off-by: Justin Yu <[email protected]>

rename

e97516b

Signed-off-by: Justin Yu <[email protected]>

rename 2

bce55d9

Signed-off-by: Justin Yu <[email protected]>

justinvyu changed the title ~~[tune/train] Restore properly from moved storage path~~ [tune/train] Restore Tuner and results properly from moved storage path Oct 24, 2023

justinvyu requested review from matthewdeng and woshiyyya October 24, 2023 22:52

justinvyu assigned matthewdeng and woshiyyya Oct 24, 2023

justinvyu commented Oct 24, 2023

View reviewed changes

matthewdeng approved these changes Oct 25, 2023

View reviewed changes

justinvyu commented Oct 25, 2023

View reviewed changes

justinvyu added 4 commits October 26, 2023 12:25

Merge branch 'master' of https://github.com/ray-project/ray into hand…

22eb216

…le_moved_storage_path

remove the original storage_path as an attribute of StorageContext

67eb6d3

Signed-off-by: Justin Yu <[email protected]>

add some info to docstrings

c499622

Signed-off-by: Justin Yu <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into hand…

c2d1b25

…le_moved_storage_path

justinvyu mentioned this pull request Oct 27, 2023

[Tune] FileNotFoundError on params.json when restoring Tune experiment from remote storage #40484

Closed

Merge branch 'master' of https://github.com/ray-project/ray into hand…

b05dbbe

…le_moved_storage_path

woshiyyya reviewed Oct 27, 2023

View reviewed changes

python/ray/train/_internal/storage.py Show resolved Hide resolved

woshiyyya reviewed Oct 27, 2023

View reviewed changes

woshiyyya approved these changes Oct 27, 2023

View reviewed changes

justinvyu commented Oct 27, 2023

View reviewed changes

justinvyu added 3 commits October 27, 2023 16:20

Merge branch 'master' of https://github.com/ray-project/ray into hand…

7157d22

…le_moved_storage_path

Merge branch 'master' of https://github.com/ray-project/ray into hand…

babb1bf

…le_moved_storage_path

fix experiment sync down to create missing dirs

d7ff5df

Signed-off-by: Justin Yu <[email protected]>

justinvyu merged commit 0ec69f0 into ray-project:master Oct 30, 2023
2 checks passed

justinvyu deleted the handle_moved_storage_path branch October 30, 2023 20:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tune/train] Restore Tuner and results properly from moved storage path #40647

[tune/train] Restore Tuner and results properly from moved storage path #40647

justinvyu commented Oct 24, 2023 •

edited

Loading

justinvyu Oct 24, 2023

matthewdeng Oct 25, 2023

woshiyyya Oct 28, 2023 •

edited

Loading

justinvyu Oct 28, 2023

matthewdeng Oct 25, 2023

justinvyu Oct 25, 2023

woshiyyya Oct 27, 2023 •

edited

Loading

justinvyu Oct 27, 2023

justinvyu Oct 27, 2023

[tune/train] Restore Tuner and results properly from moved storage path #40647

[tune/train] Restore Tuner and results properly from moved storage path #40647

Conversation

justinvyu commented Oct 24, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

justinvyu Oct 24, 2023

Choose a reason for hiding this comment

matthewdeng Oct 25, 2023

Choose a reason for hiding this comment

woshiyyya Oct 28, 2023 • edited Loading

Choose a reason for hiding this comment

justinvyu Oct 28, 2023

Choose a reason for hiding this comment

matthewdeng Oct 25, 2023

Choose a reason for hiding this comment

justinvyu Oct 25, 2023

Choose a reason for hiding this comment

woshiyyya Oct 27, 2023 • edited Loading

Choose a reason for hiding this comment

justinvyu Oct 27, 2023

Choose a reason for hiding this comment

justinvyu Oct 27, 2023

Choose a reason for hiding this comment

justinvyu commented Oct 24, 2023 •

edited

Loading

woshiyyya Oct 28, 2023 •

edited

Loading

woshiyyya Oct 27, 2023 •

edited

Loading