You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)
Reproduction
I'm running into the following error while trying to use the SFTTrainer with FSDP and the accelerate library (full stack trace provided at the very bottom of this post).
torch.utils.checkpoint.CheckpointError: torch.utils.checkpoint: Recomputed values for the following tensors have different metadata than during the forward pass
This occurs when I set gradient_checkpointing: false and activation_checkpointing: true. Curiously, it actually seems to work if I set gradient_checkpointing: true and activation_checkpointing: false, but that produces the following warning message:
# When using FSDP full shard, instead of using `gradient_checkpointing` in TrainingArguments, please use `activation_checkpointing` in `fsdp_config`. The former introduces a redundant AllGather operation in backward pass. Reference: https://github.com/huggingface/transformers/issues/30404`.
There are a few related GitHub issues around that touch on this issue:
One of these suggested setting use_reentrant: true, but that doesn't resolve the issue for me.
I'm attempting to run this as a SageMaker training job using the official HuggingFace estimator (this amounts to the following command: torchrun --nnodes 1 --nproc_per_node 8 train.py. My training script is essentially a lightly adapted version of the official examples. Below is how I'm instantiating the HuggingFace estimator object:
Traceback (most recent call last):
File "train.py", line 224, in <module>
main(cfg)
File "train.py", line 207, in main
main(cfg)
File "train.py", line 207, in main
trainer.train()
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2123, in train
trainer.train()
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2123, in train
Traceback (most recent call last):
File "train.py", line 224, in <module>
main(cfg)main(cfg)
File "train.py", line 207, in main
trainer.train()trainer.train()
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2123, in train
return inner_training_loop(
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2481, in _inner_training_loop
return inner_training_loop(
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2481, in _inner_training_loop
return inner_training_loop(return inner_training_loop(
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2481, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 3612, in training_step
tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 3612, in training_step
tr_loss_step = self.training_step(model, inputs, num_items_in_batch)tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 3612, in training_step
self.accelerator.backward(loss, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 2241, in backward
Traceback (most recent call last):
File "/opt/ml/code/train.py", line 224, in <module>
main(cfg)
File "/opt/ml/code/train.py", line 207, in main
trainer.train()
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2123, in train
return inner_training_loop(
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2481, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 3612, in training_step
self.accelerator.backward(loss, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 2241, in backward
loss.backward(**kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward
torch.autograd.backward(
File "/opt/conda/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 1075, in unpack_hook
frame.check_recomputed_tensors_match(gid)
File "/opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 850, in check_recomputed_tensors_match
raise CheckpointError(
torch.utils.checkpoint.CheckpointError: torch.utils.checkpoint: Recomputed values for the following tensors have different metadata than during the forward pass.
tensor at position 18:
saved metadata: {'shape': torch.Size([2, 1024, 28, 128]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
recomputed metadata: {'shape': torch.Size([2, 2048, 28, 128]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
tensor at position 19:
saved metadata: {'shape': torch.Size([2, 1024, 28, 128]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
recomputed metadata: {'shape': torch.Size([2, 2048, 28, 128]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
loss.backward(**kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward
loss.backward(**kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward
self.accelerator.backward(loss, **kwargs)self.accelerator.backward(loss, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 2241, in backward
File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 2241, in backward
torch.autograd.backward(
File "/opt/conda/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
torch.autograd.backward(
File "/opt/conda/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 1075, in unpack_hook
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 1075, in unpack_hook
frame.check_recomputed_tensors_match(gid)
File "/opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 850, in check_recomputed_tensors_match
loss.backward(**kwargs)loss.backward(**kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward
File "/opt/conda/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward
frame.check_recomputed_tensors_match(gid)
File "/opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 850, in check_recomputed_tensors_match
torch.autograd.backward(torch.autograd.backward(
File "/opt/conda/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
File "/opt/conda/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
raise CheckpointError(
torch.utils.checkpoint.CheckpointError: torch.utils.checkpoint: Recomputed values for the following tensors have different metadata than during the forward pass.
tensor at position 18:
saved metadata: {'shape': torch.Size([2, 1024, 28, 128]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=2)}
recomputed metadata: {'shape': torch.Size([2, 2048, 28, 128]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=2)}
tensor at position 19:
saved metadata: {'shape': torch.Size([2, 1024, 28, 128]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=2)}
recomputed metadata: {'shape': torch.Size([2, 2048, 28, 128]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=2)}
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward passVariable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 1075, in unpack_hook
File "/opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 1075, in unpack_hook
raise CheckpointError(
torch.utils.checkpoint.CheckpointError: torch.utils.checkpoint: Recomputed values for the following tensors have different metadata than during the forward pass.
tensor at position 18:
saved metadata: {'shape': torch.Size([2, 1024, 28, 128]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=1)}
recomputed metadata: {'shape': torch.Size([2, 2048, 28, 128]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=1)}
tensor at position 19:
saved metadata: {'shape': torch.Size([2, 1024, 28, 128]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=1)}
recomputed metadata: {'shape': torch.Size([2, 2048, 28, 128]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=1)}
frame.check_recomputed_tensors_match(gid)frame.check_recomputed_tensors_match(gid)
File "/opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 850, in check_recomputed_tensors_match
File "/opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 850, in check_recomputed_tensors_match
raise CheckpointError(
torch.utils.checkpoint.CheckpointError: torch.utils.checkpoint: Recomputed values for the following tensors have different metadata than during the forward pass.
tensor at position 18:
saved metadata: {'shape': torch.Size([2, 1024, 28, 128]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=3)}
recomputed metadata: {'shape': torch.Size([2, 2048, 28, 128]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=3)}
tensor at position 19:
saved metadata: {'shape': torch.Size([2, 1024, 28, 128]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=3)}
recomputed metadata: {'shape': torch.Size([2, 2048, 28, 128]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=3)}
raise CheckpointError(
torch.utils.checkpoint.CheckpointError: torch.utils.checkpoint: Recomputed values for the following tensors have different metadata than during the forward pass.
tensor at position 18:
saved metadata: {'shape': torch.Size([2, 1024, 28, 128]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=6)}
recomputed metadata: {'shape': torch.Size([2, 2048, 28, 128]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=6)}
tensor at position 19:
saved metadata: {'shape': torch.Size([2, 1024, 28, 128]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=6)}
recomputed metadata: {'shape': torch.Size([2, 2048, 28, 128]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=6)}
0%| | 0/100 [00:13<?, ?it/s]
[E ProcessGroupGloo.cpp:138] Rank 5 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank.
[E ProcessGroupGloo.cpp:138] Rank 4 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank.
[E ProcessGroupGloo.cpp:138] Rank 7 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank.
[2024-11-25 18:39:43,758] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 69 closing signal SIGTERM
[2024-11-25 18:39:43,758] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 73 closing signal SIGTERM
[2024-11-25 18:39:43,758] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 74 closing signal SIGTERM
[2024-11-25 18:39:43,758] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 76 closing signal SIGTERM
[2024-11-25 18:39:47,931] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 70) of binary: /opt/conda/bin/python
Expected behavior
The expected behavior is for the SFTTrainer's train() method to run without errors.
The text was updated successfully, but these errors were encountered:
jjbuck
changed the title
Recomputed tensor size does not match when using activation checkpointing in FSDP strategy
Recomputed tensor size does not match when using activation checkpointing when using FSDP
Nov 25, 2024
jjbuck
changed the title
Recomputed tensor size does not match when using activation checkpointing when using FSDP
Recomputed tensor size does not match when using activation checkpointing when using FSDP and accelerate
Nov 25, 2024
System Info
Who can help?
@muellerz @SunMarc @ArthurZucker
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I'm running into the following error while trying to use the SFTTrainer with FSDP and the
accelerate
library (full stack trace provided at the very bottom of this post).This occurs when I set
gradient_checkpointing: false
andactivation_checkpointing: true
. Curiously, it actually seems to work if I setgradient_checkpointing: true
andactivation_checkpointing: false
, but that produces the following warning message:There are a few related GitHub issues around that touch on this issue:
One of these suggested setting
use_reentrant: true
, but that doesn't resolve the issue for me.I'm attempting to run this as a SageMaker training job using the official HuggingFace estimator (this amounts to the following command:
torchrun --nnodes 1 --nproc_per_node 8 train.py
. My training script is essentially a lightly adapted version of the official examples. Below is how I'm instantiating the HuggingFace estimator object:Below are some of the relevant parameters from my input config.
Full Stack Trace
Expected behavior
The expected behavior is for the SFTTrainer's
train()
method to run without errors.The text was updated successfully, but these errors were encountered: