Resource table #663

rasbt · 2023-10-20T15:22:28Z

I made a resource table that I think will be helpful for people who have questions about the resource requirements. Also, it could be a useful reference for spot checking new select PRs to see whether they improve/decrease performance and memory requirements.

One question though is regarding the runtime of the multi-GPU cases. If I run max_iters = 1000 on 1 GPU, it will do 1000 iterations. If I run the same code on 2 GPUs, is it actually iterating through 2000 examples (even though it prints 1000 iterations)?

EDIT: I want to add 8-GPU tests some time too but that's currently not possible due to the 4 GPUs currently being occupied for other things)

rasbt · 2023-10-27T17:04:19Z

Added some additional info about the hardware and relevant libraries @carmocca

tutorials/finetune_lora.md

tutorials/resource-tables.md

rasbt · 2023-10-27T17:48:13Z

Interesting. Can you paste the stacktrace in a comment here for future reference?

@carmocca It's a weird one, I haven't seen it with other models and can reproduce:

../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [623,0,0], thread: [28,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [623,0,0], thread: [29,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [623,0,0], thread: [30,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
...
...
...
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [666,0,0], thread: [125,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [666,0,0], thread: [126,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [666,0,0], thread: [127,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
Traceback (most recent call last):
  File "/data/sebastian/resource-table/lit-gpt/finetune/lora.py", line 335, in <module>
    CLI(setup)
  File "/home/sebastian/.local/lib/python3.10/site-packages/jsonargparse/_cli.py", line 85, in CLI
    return _run_component(component, cfg_init)
  File "/home/sebastian/.local/lib/python3.10/site-packages/jsonargparse/_cli.py", line 147, in _run_component
    return component(**cfg)
  File "/data/sebastian/resource-table/lit-gpt/finetune/lora.py", line 96, in setup
    fabric.launch(main, data_dir, checkpoint_dir, out_dir, quantize)
  File "/home/sebastian/.local/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 834, in launch
    return self._wrap_and_launch(function, self, *args, **kwargs)
  File "/home/sebastian/.local/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 920, in _wrap_and_launch
    return to_run(*args, **kwargs)
  File "/home/sebastian/.local/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 925, in _wrap_with_setup
    return to_run(*args, **kwargs)
  File "/data/sebastian/resource-table/lit-gpt/finetune/lora.py", line 153, in main
    train(fabric, model, optimizer, scheduler, train_data, val_data, checkpoint_dir, out_dir, speed_monitor)
  File "/data/sebastian/resource-table/lit-gpt/finetune/lora.py", line 182, in train
    validate(fabric, model, val_data, tokenizer)  # sanity check
  File "/home/sebastian/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/data/sebastian/resource-table/lit-gpt/finetune/lora.py", line 264, in validate
    logits = model(input_ids)
  File "/home/sebastian/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/sebastian/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/sebastian/.local/lib/python3.10/site-packages/lightning/fabric/wrappers.py", line 121, in forward
    output = self._forward_module(*args, **kwargs)
  File "/home/sebastian/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/sebastian/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/sebastian/resource-table/lit-gpt/lit_gpt/lora.py", line 498, in forward
    x = block(x, cos, sin, mask, input_pos)
  File "/home/sebastian/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/sebastian/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/sebastian/resource-table/lit-gpt/lit_gpt/model.py", line 158, in forward
    h = self.attn(n_1, cos, sin, mask, input_pos)
  File "/home/sebastian/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/sebastian/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/sebastian/resource-table/lit-gpt/lit_gpt/model.py", line 196, in forward
    qkv = self.attn(x)
  File "/home/sebastian/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/sebastian/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/sebastian/resource-table/lit-gpt/lit_gpt/lora.py", line 372, in forward
    pretrained = self.linear(x)
  File "/home/sebastian/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/sebastian/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/sebastian/.local/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 248, in forward
    out = bnb.matmul_4bit(x, self.weight.t(), bias=bias, quant_state=self.weight.quant_state)
  File "/home/sebastian/.local/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 579, in matmul_4bit
    return MatMul4Bit.apply(A, B, out, bias, quant_state)
  File "/home/sebastian/.local/lib/python3.10/site-packages/torch/autograd/function.py", line 539, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/home/sebastian/.local/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 516, in forward
    output = torch.nn.functional.linear(A, F.dequantize_4bit(B, state).to(A.dtype).t(), bias)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`

carmocca · 2023-10-27T17:50:50Z

Thanks. Since it's an error in bitsandbytes, I wouldn't worry too much about it.

rasbt · 2023-10-27T18:14:57Z

Thanks. Since it's an error in bitsandbytes, I wouldn't worry too much about it.

Agreed. It was just odd that it occurred only with one of the models.

resource table

6d28f83

rasbt requested review from awaelchli, carmocca and lantiga as code owners October 20, 2023 15:22

rasbt added 2 commits October 20, 2023 10:25

formattign fix

756ceb5

Merge branch 'main' into resource-table

2c995da

awaelchli approved these changes Oct 27, 2023

View reviewed changes

Add additional hardware info

bc35e1b

carmocca reviewed Oct 27, 2023

View reviewed changes

tutorials/finetune_lora.md Show resolved Hide resolved

tutorials/resource-tables.md Show resolved Hide resolved

tutorials/resource-tables.md Outdated Show resolved Hide resolved

tutorials/resource-tables.md Show resolved Hide resolved

tutorials/resource-tables.md Show resolved Hide resolved

address comments

ec6118f

Update finetune_lora.md table

7a7aee2

carmocca merged commit 0cc02af into main Oct 27, 2023
5 checks passed

carmocca deleted the resource-table branch October 27, 2023 18:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resource table #663

Resource table #663

rasbt commented Oct 20, 2023 •

edited

Loading

rasbt commented Oct 27, 2023

rasbt commented Oct 27, 2023

carmocca commented Oct 27, 2023

rasbt commented Oct 27, 2023

Resource table #663

Resource table #663

Conversation

rasbt commented Oct 20, 2023 • edited Loading

rasbt commented Oct 27, 2023

rasbt commented Oct 27, 2023

carmocca commented Oct 27, 2023

rasbt commented Oct 27, 2023

rasbt commented Oct 20, 2023 •

edited

Loading