Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zlib.error: Error -3 while decompressing data: incorrect header check #97

Open
RachitBansal opened this issue Oct 29, 2024 · 1 comment

Comments

@RachitBansal
Copy link

RachitBansal commented Oct 29, 2024

Running into this deserialization issue in src/nemo_run/core/runners/fdl_runner.py.

Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/opt/NeMo-Run/src/nemo_run/core/runners/fdl_runner.py", line 66, in <module>
    fdl_runner_app()
  File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 326, in __call__
    raise e
  File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 309, in __call__
    return get_command(self)(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 661, in main
    return _main(
  File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 193, in _main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 692, in wrapper
    return callback(**use_params)
  File "/opt/NeMo-Run/src/nemo_run/core/runners/fdl_runner.py", line 47, in fdl_direct_run
    fdl_deser_package: fdl.Buildable = ZlibJSONSerializer().deserialize(fdl_package_cfg)
  File "/opt/NeMo-Run/src/nemo_run/core/serialization/zlib_json.py", line 42, in deserialize
    zlib.decompress(base64.urlsafe_b64decode(serialized)).decode(),
zlib.error: Error -3 while decompressing data: incorrect header check

Context:
I am running a pretraining example job: python scripts/llm/llama3_pretraining.py --size=8b --slurm
The srun command being launched looks like this: srun --output /path/to/logfile.out python -m nemo_run.core.runners.fdl_runner -n llama3-8b \ -p /nemo_run/configs/llama3-8b_packager /nemo_run/configs/llama3-8b_fn_or_script

A probable cause of the issue is that the cfg being passed in this srun /nemo_run/configs/llama3-8b_fn_or_script does not exist. I can't find it anywhere in my filesystem. However, I don't understand what to change.

Any pointers would be useful!

CC: @hemildesai

@hemildesai
Copy link
Collaborator

Hi, for slurm, we currently only support clusters that have https://github.com/NVIDIA/pyxis enabled. With pyxis, we mount the configs at /nemo_run/configs/ using --container-mounts option in srun automatically. Does your cluster have pyxis enabled?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants