Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot copy out of meta tensor; no data! #1378

Closed
Gooooooogo opened this issue May 1, 2024 · 2 comments
Closed

Cannot copy out of meta tensor; no data! #1378

Gooooooogo opened this issue May 1, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@Gooooooogo
Copy link

Gooooooogo commented May 1, 2024

When I run litgpt finetune lora --data Alpaca

error:

{'checkpoint_dir': PosixPath('checkpoints/TinyLlama/TinyLlama-1.1B-Chat-v1.0'),
 'data': Alpaca(mask_prompt=False, val_split_fraction=0.03865, prompt_style=<litgpt.prompts.Alpaca object at 0x7f1976ff0d00>, ignore_index=-100, seed=42, num_workers=4, download_dir=PosixPath('data/alpaca')),
 'devices': 3,
 'eval': EvalArgs(interval=100, max_new_tokens=100, max_iters=100, initial_validation=False),
 'logger_name': 'csv',
 'lora_alpha': 16,
 'lora_dropout': 0.05,
 'lora_head': False,
 'lora_key': False,
 'lora_mlp': False,
 'lora_projection': False,
 'lora_query': True,
 'lora_r': 8,
 'lora_value': True,
 'out_dir': PosixPath('out/finetune/lora'),
 'precision': None,
 'quantize': None,
 'seed': 1337,
 'train': TrainArgs(save_interval=1000, log_interval=1, global_batch_size=16, micro_batch_size=1, lr_warmup_steps=100, lr_warmup_fraction=None, epochs=1, max_tokens=None, max_steps=None, max_seq_length=None, tie_embeddings=None, learning_rate=0.0003, weight_decay=0.02, beta1=0.9, beta2=0.95, max_norm=None, min_lr=6e-05)}
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/3
{'checkpoint_dir': PosixPath('checkpoints/TinyLlama/TinyLlama-1.1B-Chat-v1.0'),
 'data': Alpaca(mask_prompt=False, val_split_fraction=0.03865, prompt_style=<litgpt.prompts.Alpaca object at 0x7f6b4bd8fd60>, ignore_index=-100, seed=42, num_workers=4, download_dir=PosixPath('data/alpaca')),
 'devices': 3,
 'eval': EvalArgs(interval=100, max_new_tokens=100, max_iters=100, initial_validation=False),
 'logger_name': 'csv',
 'lora_alpha': 16,
 'lora_dropout': 0.05,
 'lora_head': False,
 'lora_key': False,
 'lora_mlp': False,
 'lora_projection': False,
 'lora_query': True,
 'lora_r': 8,
 'lora_value': True,
 'out_dir': PosixPath('out/finetune/lora'),
 'precision': None,
 'quantize': None,
 'seed': 1337,
 'train': TrainArgs(save_interval=1000, log_interval=1, global_batch_size=16, micro_batch_size=1, lr_warmup_steps=100, lr_warmup_fraction=None, epochs=1, max_tokens=None, max_steps=None, max_seq_length=None, tie_embeddings=None, learning_rate=0.0003, weight_decay=0.02, beta1=0.9, beta2=0.95, max_norm=None, min_lr=6e-05)}
{'checkpoint_dir': PosixPath('checkpoints/TinyLlama/TinyLlama-1.1B-Chat-v1.0'),
 'data': Alpaca(mask_prompt=False, val_split_fraction=0.03865, prompt_style=<litgpt.prompts.Alpaca object at 0x7fca4ffcbb20>, ignore_index=-100, seed=42, num_workers=4, download_dir=PosixPath('data/alpaca')),
 'devices': 3,
 'eval': EvalArgs(interval=100, max_new_tokens=100, max_iters=100, initial_validation=False),
 'logger_name': 'csv',
 'lora_alpha': 16,
...
 'precision': None,
 'quantize': None,
 'seed': 1337,
 'train': TrainArgs(save_interval=1000, log_interval=1, global_batch_size=16, micro_batch_size=1, lr_warmup_steps=100, lr_warmup_fraction=None, epochs=1, max_tokens=None, max_steps=None, max_seq_length=None, tie_embeddings=None, learning_rate=0.0003, weight_decay=0.02, beta1=0.9, beta2=0.95, max_norm=None, min_lr=6e-05)}
Output is truncated. View as a [scrollable element](command:cellOutput.enableScrolling?5a6f3aaa-cfd8-4716-a486-c1f66bb0ae64) or open in a [text editor](command:workbench.action.openLargeOutput?5a6f3aaa-cfd8-4716-a486-c1f66bb0ae64). Adjust cell output [settings](command:workbench.action.openSettings?%5B%22%40tag%3AnotebookOutputLayout%22%5D)...
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/3
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/3
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 3 processes
----------------------------------------------------------------------------------------------------

[rank: 0] Seed set to 1337
[rank: 2] Seed set to 1337
[rank: 1] Seed set to 1337
Number of trainable parameters: 1,126,400
Number of non-trainable parameters: 1,100,048,384
The longest sequence length in the train data is 1305, the model's maximum sequence length is 1305 and context length is 2048
Validating ...
Traceback (most recent call last):
  File "/home/jwan3704/litgpt-venv/bin/litgpt", line 8, in <module>
    sys.exit(main())
  File "/share/home/jwan3704/litgpt-venv/lib/python3.9/site-packages/litgpt/__main__.py", line 143, in main
    fn(**kwargs)
  File "/share/home/jwan3704/litgpt-venv/lib/python3.9/site-packages/litgpt/finetune/lora.py", line 144, in setup
    fabric.launch(main, devices, seed, config, data, checkpoint_dir, out_dir, train, eval)
  File "/share/home/jwan3704/litgpt-venv/lib/python3.9/site-packages/lightning/fabric/fabric.py", line 845, in launch
    return self._wrap_and_launch(function, self, *args, **kwargs)
  File "/share/home/jwan3704/litgpt-venv/lib/python3.9/site-packages/lightning/fabric/fabric.py", line 931, in _wrap_and_launch
    return to_run(*args, **kwargs)
  File "/share/home/jwan3704/litgpt-venv/lib/python3.9/site-packages/lightning/fabric/fabric.py", line 936, in _wrap_with_setup
    return to_run(*args, **kwargs)
  File "/share/home/jwan3704/litgpt-venv/lib/python3.9/site-packages/litgpt/finetune/lora.py", line 197, in main
    fit(
  File "/share/home/jwan3704/litgpt-venv/lib/python3.9/site-packages/litgpt/finetune/lora.py", line 259, in fit
    validate(fabric, model, val_dataloader, dataclasses.replace(eval, max_iters=2))  # sanity check
  File "/share/home/jwan3704/litgpt-venv/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/share/home/jwan3704/litgpt-venv/lib/python3.9/site-packages/litgpt/finetune/lora.py", line 354, in validate
    logits = model(input_ids)
  File "/share/home/jwan3704/litgpt-venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/share/home/jwan3704/litgpt-venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
...
    lora = self.zero_pad(after_B) * self.scaling  # (64, 64, 256) after zero_pad (64, 64, 384)
  File "/share/home/jwan3704/litgpt-venv/lib/python3.9/site-packages/litgpt/lora.py", line 345, in zero_pad
    self._lora_ind_cache[result.device] = lora_ind = self._lora_ind.to(result.device)
NotImplementedError: Cannot copy out of meta tensor; no data!
@rasbt
Copy link
Collaborator

rasbt commented May 1, 2024

Haven't had a chance to test or try it yet, but this looks familiar @robieta re #1374:

self._lora_ind_cache[result.device] = lora_ind = self._lora_ind.to(result.device)

NotImplementedError: Cannot copy out of meta tensor; no data!

It may or may not be related. But I'm curious when you implemented #1374 have you tested in on multi-GPU?

@carmocca
Copy link
Contributor

carmocca commented May 6, 2024

Should be fixed by #770

@carmocca carmocca closed this as completed May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants