Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resource table #663

Merged
merged 6 commits into from
Oct 27, 2023
Merged

Resource table #663

merged 6 commits into from
Oct 27, 2023

Conversation

rasbt
Copy link
Collaborator

@rasbt rasbt commented Oct 20, 2023

I made a resource table that I think will be helpful for people who have questions about the resource requirements. Also, it could be a useful reference for spot checking new select PRs to see whether they improve/decrease performance and memory requirements.

One question though is regarding the runtime of the multi-GPU cases. If I run max_iters = 1000 on 1 GPU, it will do 1000 iterations. If I run the same code on 2 GPUs, is it actually iterating through 2000 examples (even though it prints 1000 iterations)?

EDIT: I want to add 8-GPU tests some time too but that's currently not possible due to the 4 GPUs currently being occupied for other things)

@rasbt
Copy link
Collaborator Author

rasbt commented Oct 27, 2023

Added some additional info about the hardware and relevant libraries @carmocca

tutorials/finetune_lora.md Show resolved Hide resolved
tutorials/resource-tables.md Show resolved Hide resolved
tutorials/resource-tables.md Outdated Show resolved Hide resolved
tutorials/resource-tables.md Show resolved Hide resolved
tutorials/resource-tables.md Show resolved Hide resolved
@rasbt
Copy link
Collaborator Author

rasbt commented Oct 27, 2023

Interesting. Can you paste the stacktrace in a comment here for future reference?

@carmocca It's a weird one, I haven't seen it with other models and can reproduce:

../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [623,0,0], thread: [28,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [623,0,0], thread: [29,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [623,0,0], thread: [30,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
...
...
...
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [666,0,0], thread: [125,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [666,0,0], thread: [126,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [666,0,0], thread: [127,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
Traceback (most recent call last):
  File "/data/sebastian/resource-table/lit-gpt/finetune/lora.py", line 335, in <module>
    CLI(setup)
  File "/home/sebastian/.local/lib/python3.10/site-packages/jsonargparse/_cli.py", line 85, in CLI
    return _run_component(component, cfg_init)
  File "/home/sebastian/.local/lib/python3.10/site-packages/jsonargparse/_cli.py", line 147, in _run_component
    return component(**cfg)
  File "/data/sebastian/resource-table/lit-gpt/finetune/lora.py", line 96, in setup
    fabric.launch(main, data_dir, checkpoint_dir, out_dir, quantize)
  File "/home/sebastian/.local/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 834, in launch
    return self._wrap_and_launch(function, self, *args, **kwargs)
  File "/home/sebastian/.local/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 920, in _wrap_and_launch
    return to_run(*args, **kwargs)
  File "/home/sebastian/.local/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 925, in _wrap_with_setup
    return to_run(*args, **kwargs)
  File "/data/sebastian/resource-table/lit-gpt/finetune/lora.py", line 153, in main
    train(fabric, model, optimizer, scheduler, train_data, val_data, checkpoint_dir, out_dir, speed_monitor)
  File "/data/sebastian/resource-table/lit-gpt/finetune/lora.py", line 182, in train
    validate(fabric, model, val_data, tokenizer)  # sanity check
  File "/home/sebastian/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/data/sebastian/resource-table/lit-gpt/finetune/lora.py", line 264, in validate
    logits = model(input_ids)
  File "/home/sebastian/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/sebastian/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/sebastian/.local/lib/python3.10/site-packages/lightning/fabric/wrappers.py", line 121, in forward
    output = self._forward_module(*args, **kwargs)
  File "/home/sebastian/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/sebastian/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/sebastian/resource-table/lit-gpt/lit_gpt/lora.py", line 498, in forward
    x = block(x, cos, sin, mask, input_pos)
  File "/home/sebastian/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/sebastian/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/sebastian/resource-table/lit-gpt/lit_gpt/model.py", line 158, in forward
    h = self.attn(n_1, cos, sin, mask, input_pos)
  File "/home/sebastian/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/sebastian/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/sebastian/resource-table/lit-gpt/lit_gpt/model.py", line 196, in forward
    qkv = self.attn(x)
  File "/home/sebastian/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/sebastian/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/sebastian/resource-table/lit-gpt/lit_gpt/lora.py", line 372, in forward
    pretrained = self.linear(x)
  File "/home/sebastian/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/sebastian/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/sebastian/.local/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 248, in forward
    out = bnb.matmul_4bit(x, self.weight.t(), bias=bias, quant_state=self.weight.quant_state)
  File "/home/sebastian/.local/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 579, in matmul_4bit
    return MatMul4Bit.apply(A, B, out, bias, quant_state)
  File "/home/sebastian/.local/lib/python3.10/site-packages/torch/autograd/function.py", line 539, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/home/sebastian/.local/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 516, in forward
    output = torch.nn.functional.linear(A, F.dequantize_4bit(B, state).to(A.dtype).t(), bias)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`

@carmocca
Copy link
Contributor

Thanks. Since it's an error in bitsandbytes, I wouldn't worry too much about it.

@rasbt
Copy link
Collaborator Author

rasbt commented Oct 27, 2023

Thanks. Since it's an error in bitsandbytes, I wouldn't worry too much about it.

Agreed. It was just odd that it occurred only with one of the models.

@carmocca carmocca merged commit 0cc02af into main Oct 27, 2023
5 checks passed
@carmocca carmocca deleted the resource-table branch October 27, 2023 18:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants