You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@2015aroras , I use the exact configs/official/OLMo-7B.yaml, and only modify the training data (change the training data path). So are you suggesting to add compile: null to that official Olmo-7B.yaml?
🐛 Describe the bug
h100-196-003:0 err: wandb: Synced 6 W&B file(s), 0 media file(s), 2 artifact file(s) and 0 other file(s)
h100-196-003:0 err: Traceback (most recent call last):
h100-196-003:0 err: File "/Users/H100/OLMo/scripts/train.py", line 347, in
h100-196-003:0 err: main(cfg)
h100-196-003:0 err: File "/Users/H100/OLMo/scripts/train.py", line 319, in main
h100-196-003:0 err: trainer.fit()
h100-196-003:0 err: File "/Users/H100/OLMo/olmo/train.py", line 1152, in fit
h100-196-003:0 err: metrics = self.train_step(batch, reduce_global_loss=should_log_this_step)
h100-196-003:0 err: File "/Users/H100/OLMo/olmo/train.py", line 781, in train_step
h100-196-003:0 err: ce_batch_loss, z_batch_loss = self.train_batch(batch)
h100-196-003:0 err: File "/Users/H100/miniconda3/envs/olmo/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 451, in _fn
h100-196-003:0 err: return fn(*args, **kwargs)
h100-196-003:0 err: File "/Users/H100/OLMo/olmo/train.py", line 758, in train_batch
h100-196-003:0 err: loss.backward()
h100-196-003:0 err: File "/Users/H100/miniconda3/envs/olmo/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward
h100-196-003:0 err: torch.autograd.backward(
h100-196-003:0 err: File "/Users/H100/miniconda3/envs/olmo/lib/python3.10/site-packages/torch/autograd/init.py", line 267, in backward
h100-196-003:0 err: _engine_run_backward(
h100-196-003:0 err: File "/Users/H100/miniconda3/envs/olmo/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
h100-196-003:0 err: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
h100-196-003:0 err: File "/Users/H100/miniconda3/envs/olmo/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply
h100-196-003:0 err: return user_fn(self, *args)
h100-196-003:0 err: File "/Users/H100/miniconda3/envs/olmo/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py", line 882, in backward
h100-196-003:0 err: out = call_compiled_backward()
h100-196-003:0 err: File "/Users/H100/miniconda3/envs/olmo/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py", line 831, in call_compiled_backward
h100-196-003:0 err: out = call_func_at_runtime_with_args(
h100-196-003:0 err: File "/Users/H100/miniconda3/envs/olmo/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/utils.py", line 113, in call_func_at_runtime_with_args
h100-196-003:0 err: out = normalize_as_list(f(args))
h100-196-003:0 err: File "/Users/H100/miniconda3/envs/olmo/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 451, in _fn
h100-196-003:0 err: return fn(*args, **kwargs)
h100-196-003:0 err: File "/Users/H100/miniconda3/envs/olmo/lib/python3.10/site-packages/torch/_dynamo/external_utils.py", line 36, in inner
h100-196-003:0 err: return fn(*args, **kwargs)
h100-196-003:0 err: File "/Users/H100/miniconda3/envs/olmo/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 906, in call
h100-196-003:0 err: return self.get_current_callable()(inputs)
h100-196-003:0 err: File "/Users/H100/miniconda3/envs/olmo/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 784, in run
h100-196-003:0 err: return model(new_inputs)
h100-196-003:0 err: File "/Users/H100/miniconda3/envs/olmo/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 934, in _run_from_cache
h100-196-003:0 err: return compiled_graph.compiled_artifact(inputs)
h100-196-003:0 err: File "/tmp/torchinductor_dejasu/yz/cyzj56loyzqxbsmpxbkpn2snn62qzjk6zvqc7nhgbi262jwngmlr.py", line 80, in call
h100-196-003:0 err: triton_poi_fused_div_0.run(tangents_1, buf0, 1, grid=grid(1), stream=stream0)
h100-196-003:0 err: File "/Users/H100/miniconda3/envs/olmo/lib/python3.10/site-packages/torch/_inductor/triton_heuristics.py", line 670, in run
h100-196-003:0 err: return launcher(
h100-196-003:0 err: File "", line 7, in launcher
h100-196-003:0 err: RuntimeError: Triton Error [CUDA]: invalid device context
Versions
Python 3.10.14
-e git+https://github.com/allenai/OLMo.git@4332c3224030a321c5894df18f97049b10a56582#egg=ai2_olmo
ai2-olmo-core==0.1.0
aiohappyeyeballs==2.3.5
aiohttp==3.10.2
aiosignal==1.3.1
annotated-types==0.7.0
antlr4-python3-runtime==4.9.3
async-timeout==4.0.3
attrs==24.2.0
backports.tarfile==1.2.0
beaker-gantry==1.8.3
beaker-py==1.31.2
black==23.12.1
boltons==24.0.0
boto3==1.34.158
botocore==1.34.158
build==1.2.1
cached_path==1.6.3
cachetools==5.4.0
certifi==2024.7.4
cffi==1.17.0
charset-normalizer==3.3.2
click==8.1.7
click-help-colors==0.9.4
cryptography==43.0.0
datasets==2.20.0
dill==0.3.8
docker==7.1.0
docker-pycreds==0.4.0
docutils==0.21.2
exceptiongroup==1.2.2
face==20.1.1
filelock==3.13.4
frozenlist==1.4.1
fsspec==2024.5.0
ftfy==6.2.3
gitdb==4.0.11
GitPython==3.1.43
glom==23.5.0
google-api-core==2.19.1
google-auth==2.33.0
google-cloud-core==2.4.1
google-cloud-storage==2.18.2
google-crc32c==1.5.0
google-resumable-media==2.7.2
googleapis-common-protos==1.63.2
huggingface-hub==0.23.5
idna==3.7
importlib_metadata==8.2.0
importlib_resources==6.4.0
iniconfig==2.0.0
isort==5.12.0
jaraco.classes==3.4.0
jaraco.context==5.3.0
jaraco.functools==4.0.2
jeepney==0.8.0
Jinja2==3.1.4
jmespath==1.0.1
joblib==1.4.2
keyring==25.3.0
lightning-utilities==0.11.6
markdown-it-py==3.0.0
MarkupSafe==2.1.5
mdurl==0.1.2
more-itertools==10.4.0
mpmath==1.3.0
msgspec==0.18.6
multidict==6.0.5
multiprocess==0.70.16
mypy==1.3.0
mypy-extensions==1.0.0
necessary==0.4.3
networkx==3.3
nh3==0.2.18
numpy==2.0.1
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.6.20
nvidia-nvtx-cu12==12.1.105
omegaconf==2.3.0
packaging==24.1
pandas==2.2.2
pathspec==0.12.1
petname==2.6
pkginfo==1.10.0
platformdirs==4.2.2
pluggy==1.5.0
proto-plus==1.24.0
protobuf==5.27.3
psutil==6.0.0
pyarrow==17.0.0
pyarrow-hotfix==0.6
pyasn1==0.6.0
pyasn1_modules==0.4.0
pycparser==2.22
pydantic==2.8.2
pydantic_core==2.20.1
Pygments==2.18.0
pyproject_hooks==1.1.0
pytest==8.3.2
pytest-sphinx==0.6.3
python-dateutil==2.9.0.post0
pytz==2024.1
PyYAML==6.0.2
readme_renderer==44.0
regex==2024.7.24
requests==2.32.3
requests-toolbelt==1.0.0
requirements-parser==0.10.2
rfc3986==2.0.0
rich==13.7.1
rsa==4.9
ruff==0.5.7
s3transfer==0.10.2
safetensors==0.4.4
scikit-learn==1.5.1
scipy==1.14.0
SecretStorage==3.3.3
sentry-sdk==2.12.0
setproctitle==1.3.3
six==1.16.0
smart-open==7.0.4
smashed==0.21.5
smmap==5.0.1
sympy==1.13.1
threadpoolctl==3.5.0
tokenizers==0.19.1
tomli==2.0.1
torch==2.3.1
torchmetrics==1.4.1
tqdm==4.66.5
transformers==4.44.0
triton==2.3.1
trouting==0.3.3
twine==5.1.1
types-setuptools==71.1.0.20240806
typing_extensions==4.12.2
tzdata==2024.1
urllib3==2.2.2
wandb==0.17.6
wcwidth==0.2.13
wrapt==1.16.0
xxhash==3.4.1
yarl==1.9.4
zipp==3.19.2
The text was updated successfully, but these errors were encountered: