Cannot load checkpoint that contains ':' in filename with new release of fsspec #19124

tlpss · 2023-12-07T14:34:47Z

Bug description

With the latest release (2023.12.1) of ffspec, a checkpoint filename with a : inside, can no longer be loaded the way suggested in the docs.

Wandb uses this for example with their checkpoint artifacts.

What version are you seeing the problem on?

v1.9

How to reproduce the bug

import pytorch_lightning as pl

class DummyModel(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.test = "test"


    
model = DummyModel()
trainer = pl.Trainer()
trainer.strategy._lightning_module = model
trainer._checkpoint_connector.save_checkpoint("model:test.ckpt")
print("loading checkpoint")
model.load_from_checkpoint("model:test.ckpt")

Error messages and logs

GPU available: True (cuda), used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/home/tlips/miniconda3/envs/irm/lib/python3.10/site-packages/pytorch_lightning/trainer/setup.py:176: PossibleUserWarning: GPU available but not used. Set `accelerator` and `devices` using `Trainer(accelerator='gpu', devices=1)`.
  rank_zero_warn(
/home/tlips/miniconda3/envs/irm/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py:67: UserWarning: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `pytorch_lightning` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
  warning_cache.warn(
Traceback (most recent call last):
  File "/home/tlips/Documents/intelligent_robot_manipulation/practicals/5_keypoint_detection/test.py", line 13, in <module>
    trainer._checkpoint_connector.save_checkpoint("model:test.ckpt")
  File "/home/tlips/miniconda3/envs/irm/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 511, in save_checkpoint
    self.trainer.strategy.save_checkpoint(_checkpoint, filepath, storage_options=storage_options)
  File "/home/tlips/miniconda3/envs/irm/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 466, in save_checkpoint
    self.checkpoint_io.save_checkpoint(checkpoint, filepath, storage_options=storage_options)
  File "/home/tlips/miniconda3/envs/irm/lib/python3.10/site-packages/lightning_fabric/plugins/io/torch_io.py", line 50, in save_checkpoint
    fs = get_filesystem(path)
  File "/home/tlips/miniconda3/envs/irm/lib/python3.10/site-packages/lightning_fabric/utilities/cloud_io.py", line 52, in get_filesystem
    fs, _ = url_to_fs(str(path), **kwargs)
  File "/home/tlips/miniconda3/envs/irm/lib/python3.10/site-packages/fsspec/core.py", line 383, in url_to_fs
    chain = _un_chain(url, kwargs)
  File "/home/tlips/miniconda3/envs/irm/lib/python3.10/site-packages/fsspec/core.py", line 332, in _un_chain
    cls = get_filesystem_class(protocol)
  File "/home/tlips/miniconda3/envs/irm/lib/python3.10/site-packages/fsspec/registry.py", line 233, in get_filesystem_class
    raise ValueError(f"Protocol not known: {protocol}")
ValueError: Protocol not known: model```

Environment

Current environment

- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
- PyTorch Lightning Version (e.g., 1.5.0):  1.9.4
- Lightning App Version (e.g., 0.5.2):/
- PyTorch Version (e.g., 2.0): 2.1.1
- Python version (e.g., 3.9): 3.10
- OS (e.g., Linux): Ubuntu 20.04
- How you installed Lightning(`conda`, `pip`, source): pip

More info

I believe the commit that caused the change in ffspec is this one: fsspec/filesystem_spec@cf13c41 - fsspec/filesystem_spec#1415

one solution is to provide a protocol manually:

trainer._checkpoint_connector.save_checkpoint("file://model:test.ckpt")
print("loading checkpoint")
model.load_from_checkpoint("file://model:test.ckpt")

Though this causes an error with native torch.load("file://model:test.ckpt")

I leave it up to you guys to decide on the best way forward, but I think this is not something the user should have to navigate (It took me a while to figure it out so I would love to save others some time)

The text was updated successfully, but these errors were encountered:

tlpss · 2023-12-07T14:45:05Z

I've just seen that the issue has also been reported already in the ffspec library here.

Because I did not know whether this was intentional or not, I opened an issue here. Feel free to close the issue if this is an issue with fsspec

awaelchli · 2023-12-07T21:05:04Z

@tlpss Thanks for bringing attention to this. It looks like fsspec has already merged a fix, and then with the release of their next version the problem will be solved, and all you need is to upgrade the version. As a workaround now, you'd need to downgrade the version temporarily to before it was broken.

Based on this, I'm closing the issue. Cheers!

tlpss added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Dec 7, 2023

github-actions bot added the ver: 1.9.x label Dec 7, 2023

awaelchli closed this as not planned Won't fix, can't repro, duplicate, stale Dec 7, 2023

awaelchli added 3rd party Related to a 3rd-party and removed needs triage Waiting to be triaged by maintainers labels Dec 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot load checkpoint that contains ':' in filename with new release of fsspec #19124

Cannot load checkpoint that contains ':' in filename with new release of fsspec #19124

tlpss commented Dec 7, 2023 •

edited

Loading

tlpss commented Dec 7, 2023 •

edited

Loading

awaelchli commented Dec 7, 2023

Cannot load checkpoint that contains ':' in filename with new release of fsspec #19124

Cannot load checkpoint that contains ':' in filename with new release of fsspec #19124

Comments

tlpss commented Dec 7, 2023 • edited Loading

Bug description

What version are you seeing the problem on?

How to reproduce the bug

Error messages and logs

Environment

More info

tlpss commented Dec 7, 2023 • edited Loading

awaelchli commented Dec 7, 2023

tlpss commented Dec 7, 2023 •

edited

Loading

tlpss commented Dec 7, 2023 •

edited

Loading