You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Running into this deserialization issue in src/nemo_run/core/runners/fdl_runner.py.
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/opt/NeMo-Run/src/nemo_run/core/runners/fdl_runner.py", line 66, in <module>
fdl_runner_app()
File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 326, in __call__
raise e
File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 309, in __call__
return get_command(self)(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 661, in main
return _main(
File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 193, in _main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 692, in wrapper
return callback(**use_params)
File "/opt/NeMo-Run/src/nemo_run/core/runners/fdl_runner.py", line 47, in fdl_direct_run
fdl_deser_package: fdl.Buildable = ZlibJSONSerializer().deserialize(fdl_package_cfg)
File "/opt/NeMo-Run/src/nemo_run/core/serialization/zlib_json.py", line 42, in deserialize
zlib.decompress(base64.urlsafe_b64decode(serialized)).decode(),
zlib.error: Error -3 while decompressing data: incorrect header check
Context:
I am running a pretraining example job: python scripts/llm/llama3_pretraining.py --size=8b --slurm
The srun command being launched looks like this: srun --output /path/to/logfile.out python -m nemo_run.core.runners.fdl_runner -n llama3-8b \ -p /nemo_run/configs/llama3-8b_packager /nemo_run/configs/llama3-8b_fn_or_script
A probable cause of the issue is that the cfg being passed in this srun /nemo_run/configs/llama3-8b_fn_or_script does not exist. I can't find it anywhere in my filesystem. However, I don't understand what to change.
Hi, for slurm, we currently only support clusters that have https://github.com/NVIDIA/pyxis enabled. With pyxis, we mount the configs at /nemo_run/configs/ using --container-mounts option in srun automatically. Does your cluster have pyxis enabled?
Running into this deserialization issue in
src/nemo_run/core/runners/fdl_runner.py
.Context:
I am running a pretraining example job:
python scripts/llm/llama3_pretraining.py --size=8b --slurm
The srun command being launched looks like this:
srun --output /path/to/logfile.out python -m nemo_run.core.runners.fdl_runner -n llama3-8b \ -p /nemo_run/configs/llama3-8b_packager /nemo_run/configs/llama3-8b_fn_or_script
A probable cause of the issue is that the cfg being passed in this srun
/nemo_run/configs/llama3-8b_fn_or_script
does not exist. I can't find it anywhere in my filesystem. However, I don't understand what to change.Any pointers would be useful!
CC: @hemildesai
The text was updated successfully, but these errors were encountered: