-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Tune] FileNotFoundError
on params.json
when restoring Tune experiment from remote storage
#40484
Comments
I have the same issue. I trained my model on my PC and cloned the results dir. into Collab. now when I try to restored the trained agent, it seems like it cant read trials correctly!
Reproduction scriptclass DRLlibv2:
def __init__(
self,
trainable: str | Any,
params: dict,
train_env=None,
run_name: str = "tune_run",
local_dir: str = "tune_results",
search_alg=None,
concurrent_trials: int = 0,
num_samples: int = 0,
scheduler_=None,
# num_cpus: float | int = 2,
dataframe_save: str = "tune.csv",
metric: str = "episode_reward_mean",
mode: str | list[str] = "max",
max_failures: int = 0,
training_iterations: int = 100,
checkpoint_num_to_keep: None | int = None,
checkpoint_freq: int = 0,
reuse_actors: bool = True
):
self.params = params
# if train_env is not None:
# register_env(self.params['env'], lambda env_config: train_env(env_config))
self.train_env = train_env
self.run_name = run_name
self.local_dir = local_dir
self.search_alg = search_alg
if concurrent_trials != 0:
self.search_alg = ConcurrencyLimiter(
self.search_alg, max_concurrent=concurrent_trials
)
self.scheduler_ = scheduler_
self.num_samples = num_samples
self.trainable = trainable
if isinstance(self.trainable, str):
self.trainable = self.trainable.upper()
# self.num_cpus = num_cpus
self.dataframe_save = dataframe_save
self.metric = metric
self.mode = mode
self.max_failures = max_failures
self.training_iterations = training_iterations
self.checkpoint_freq = checkpoint_freq
self.checkpoint_num_to_keep = checkpoint_num_to_keep
self.reuse_actors = reuse_actors
def train_tune_model(self):
# if ray.is_initialized():
# ray.shutdown()
# ray.init(num_cpus=self.num_cpus, num_gpus=self.params['num_gpus'], ignore_reinit_error=True)
if self.train_env is not None:
register_env(self.params['env'], lambda env_config: self.train_env)
tuner = tune.Tuner(
self.trainable,
param_space=self.params,
tune_config=TuneConfig(
search_alg=self.search_alg,
scheduler=self.scheduler_,
num_samples=self.num_samples,
# metric=self.metric,
# mode=self.mode,
**({'metric': self.metric, 'mode': self.mode} if self.scheduler_ is None else {}),
reuse_actors=self.reuse_actors,
),
run_config=RunConfig(
name=self.run_name,
storage_path=self.local_dir,
failure_config=FailureConfig(
max_failures=self.max_failures, fail_fast=False
),
stop={"training_iteration": self.training_iterations},
checkpoint_config=CheckpointConfig(
num_to_keep=self.checkpoint_num_to_keep,
checkpoint_score_attribute=self.metric,
checkpoint_score_order=self.mode,
checkpoint_frequency=self.checkpoint_freq,
checkpoint_at_end=True,
),
verbose=3,#Verbosity mode. 0 = silent, 1 = default, 2 = verbose, 3 = detailed
),
)
self.results = tuner.fit()
if self.search_alg is not None:
self.search_alg.save_to_dir(self.local_dir)
# ray.shutdown()
return self.results
def infer_results(self, to_dataframe: str = None, mode: str = "a"):
results_df = self.results.get_dataframe()
if to_dataframe is None:
to_dataframe = self.dataframe_save
results_df.to_csv(to_dataframe, mode=mode)
best_result = self.results.get_best_result()
# best_result = self.results.get_best_result()
# best_metric = best_result.metrics
# best_checkpoint = best_result.checkpoint
# best_trial_dir = best_result.log_dir
# results_df = self.results.get_dataframe()
return results_df, best_result
def restore_agent(
self,
checkpoint_path: str = "",
restore_search: bool = False,
resume_unfinished: bool = True,
resume_errored: bool = False,
restart_errored: bool = False,
):
# if restore_search:
# self.search_alg = self.search_alg.restore_from_dir(self.local_dir)
if checkpoint_path == "":
checkpoint_path = self.results.get_best_result().checkpoint._local_path
restored_agent = tune.Tuner.restore(
checkpoint_path, trainable = self.trainable,
param_space=self.params,
restart_errored=restart_errored,
resume_unfinished=resume_unfinished,
resume_errored=resume_errored,
)
print(restored_agent)
self.results = restored_agent.get_results()
if self.search_alg is not None:
self.search_alg.save_to_dir(self.local_dir)
return self.results
def get_test_agent(self, test_env_name: str=None, test_env=None, checkpoint=None):
# if test_env is not None:
# register_env(test_env_name, lambda config: [test_env])
if checkpoint is None:
checkpoint = self.results.get_best_result().checkpoint
testing_agent = Algorithm.from_checkpoint(checkpoint)
# testing_agent.config['env'] = test_env_name
return testing_agent drl_agent = DRLlibv2(
trainable="TD3",
# train_env = RankingEnv,
# num_cpus = num_cpus,
run_name = "TD3_TRAIN",
local_dir = local_dir,
params = train_config.to_dict(),
num_samples = 1,#Number of samples of hyperparameters config to run
# training_iterations=5,
checkpoint_freq=5,
# scheduler_=scheduler_,
search_alg=search_alg,
metric = "episode_reward_mean",
mode = "max"
# callbacks=[wandb_callback]
) results = drl_agent.restore_agent((local_dir/"TD3_TRAIN").as_posix()) |
@aloysius-lim Thanks for filing this issue. This has highlighted some problematic sync-down logic that happens on restoration, but it actually only turned up now because of the custom
The fsspec See here: https://github.com/fsspec/adlfs/blob/f15c37a43afd87a04f01b61cd90294dd57181e1d/adlfs/spec.py#L1128 Compare this to the https://github.com/fsspec/s3fs/blob/2c074502c2d6a9be0d3f05eb678f4cc5add2e7e5/s3fs/core.py#L787 I can put up a fix on our end to generally make the sync-down logic more robust, but this is actually something that Edit: I've posted an issue on their repo with a recommended fix -- maybe you can continue from that? fsspec/adlfs#435 |
@fardinabbasi I believe your issue is a different one that will be solved by #40647 |
@aloysius-lim On ray nightly, this logic has been updated so that it should no longer error for you. Let me know if it works out for you. @fardinabbasi Your issue should also be solved by the same PR. |
Hey @aloysius-lim, were you able to get a |
I'm sorry for the long radio silence. My issue is now resolved, thank you! I did not encounter any pickling / serialization issues. |
Hey @aloysius-lim. I'm still experiencing a pickling/serialization issue with the following dependencies:
Are your dependencies the same as mentioned above? #40484 (comment) |
@grizzlybearg they were the same, except I updated Ray to the latest version. |
@aloysius-lim thanks for the update. I'll try bump down some requirements and test |
What happened + What you expected to happen
Given a Ray Tune experiment has been run with an
fsspec
-backed remote storage (specifically usingadlfs
for Azure Blob Storage)When the tuner is restored using
tune.Tuner.restore()
And
tuner.fit()
is runThen a
FileNotFoundError
is thrown, stating thatparams.json
cannot be found in the root folder of the experiment.Here is the error log, with private details (paths) redacted:
A look in the remote storage folder confirms that
params.json
is not present at the root of the checkpoint folder. However, they are present in the subfolders of the individual trials:This error does not happen when using a local storage path for the checkpoints.
Versions / Dependencies
OS: Macos 14.0
python: 3.10.12
adlfs: 2023.9.0
fsspec: 2023.9.2
pyarrow: 13.0.0
ray: 2.7.0
torch: 2.0.1
Reproduction script
This example uses
adlfs
to access Azure Blob Storage. I do not have access to other remote storage services to test this elsewhere.Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: