Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot load checkpoint that contains ':' in filename with new release of fsspec #19124

Closed
tlpss opened this issue Dec 7, 2023 · 2 comments
Closed
Labels
3rd party Related to a 3rd-party bug Something isn't working ver: 1.9.x

Comments

@tlpss
Copy link

tlpss commented Dec 7, 2023

Bug description

With the latest release (2023.12.1) of ffspec, a checkpoint filename with a : inside, can no longer be loaded the way suggested in the docs.

Wandb uses this for example with their checkpoint artifacts.

What version are you seeing the problem on?

v1.9

How to reproduce the bug

import pytorch_lightning as pl

class DummyModel(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.test = "test"


    
model = DummyModel()
trainer = pl.Trainer()
trainer.strategy._lightning_module = model
trainer._checkpoint_connector.save_checkpoint("model:test.ckpt")
print("loading checkpoint")
model.load_from_checkpoint("model:test.ckpt")

Error messages and logs

GPU available: True (cuda), used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/home/tlips/miniconda3/envs/irm/lib/python3.10/site-packages/pytorch_lightning/trainer/setup.py:176: PossibleUserWarning: GPU available but not used. Set `accelerator` and `devices` using `Trainer(accelerator='gpu', devices=1)`.
  rank_zero_warn(
/home/tlips/miniconda3/envs/irm/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py:67: UserWarning: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `pytorch_lightning` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
  warning_cache.warn(
Traceback (most recent call last):
  File "/home/tlips/Documents/intelligent_robot_manipulation/practicals/5_keypoint_detection/test.py", line 13, in <module>
    trainer._checkpoint_connector.save_checkpoint("model:test.ckpt")
  File "/home/tlips/miniconda3/envs/irm/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 511, in save_checkpoint
    self.trainer.strategy.save_checkpoint(_checkpoint, filepath, storage_options=storage_options)
  File "/home/tlips/miniconda3/envs/irm/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 466, in save_checkpoint
    self.checkpoint_io.save_checkpoint(checkpoint, filepath, storage_options=storage_options)
  File "/home/tlips/miniconda3/envs/irm/lib/python3.10/site-packages/lightning_fabric/plugins/io/torch_io.py", line 50, in save_checkpoint
    fs = get_filesystem(path)
  File "/home/tlips/miniconda3/envs/irm/lib/python3.10/site-packages/lightning_fabric/utilities/cloud_io.py", line 52, in get_filesystem
    fs, _ = url_to_fs(str(path), **kwargs)
  File "/home/tlips/miniconda3/envs/irm/lib/python3.10/site-packages/fsspec/core.py", line 383, in url_to_fs
    chain = _un_chain(url, kwargs)
  File "/home/tlips/miniconda3/envs/irm/lib/python3.10/site-packages/fsspec/core.py", line 332, in _un_chain
    cls = get_filesystem_class(protocol)
  File "/home/tlips/miniconda3/envs/irm/lib/python3.10/site-packages/fsspec/registry.py", line 233, in get_filesystem_class
    raise ValueError(f"Protocol not known: {protocol}")
ValueError: Protocol not known: model```

Environment

Current environment
- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
- PyTorch Lightning Version (e.g., 1.5.0):  1.9.4
- Lightning App Version (e.g., 0.5.2):/
- PyTorch Version (e.g., 2.0): 2.1.1
- Python version (e.g., 3.9): 3.10
- OS (e.g., Linux): Ubuntu 20.04
- How you installed Lightning(`conda`, `pip`, source): pip

More info

I believe the commit that caused the change in ffspec is this one: fsspec/filesystem_spec@cf13c41 - fsspec/filesystem_spec#1415

one solution is to provide a protocol manually:

trainer._checkpoint_connector.save_checkpoint("file://model:test.ckpt")
print("loading checkpoint")
model.load_from_checkpoint("file://model:test.ckpt")

Though this causes an error with native torch.load("file://model:test.ckpt")

I leave it up to you guys to decide on the best way forward, but I think this is not something the user should have to navigate (It took me a while to figure it out so I would love to save others some time)

@tlpss tlpss added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Dec 7, 2023
@tlpss
Copy link
Author

tlpss commented Dec 7, 2023

I've just seen that the issue has also been reported already in the ffspec library here.

Because I did not know whether this was intentional or not, I opened an issue here. Feel free to close the issue if this is an issue with fsspec

@awaelchli
Copy link
Contributor

@tlpss Thanks for bringing attention to this. It looks like fsspec has already merged a fix, and then with the release of their next version the problem will be solved, and all you need is to upgrade the version. As a workaround now, you'd need to downgrade the version temporarily to before it was broken.

Based on this, I'm closing the issue. Cheers!

@awaelchli awaelchli closed this as not planned Won't fix, can't repro, duplicate, stale Dec 7, 2023
@awaelchli awaelchli added 3rd party Related to a 3rd-party and removed needs triage Waiting to be triaged by maintainers labels Dec 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3rd party Related to a 3rd-party bug Something isn't working ver: 1.9.x
Projects
None yet
Development

No branches or pull requests

2 participants