Recomputed tensor size does not match when using activation checkpointing when using FSDP and accelerate #34928

jjbuck · 2024-11-25T19:02:12Z

System Info

- `transformers` version: 4.46.3
- Platform: Linux-6.8.0-1015-aws-x86_64-with-glibc2.35
- Python version: 3.12.6
- Huggingface_hub version: 0.26.2
- Safetensors version: 0.4.5
- Accelerate version: 1.1.1
- Accelerate config:    not found
- PyTorch version (GPU?): 2.5.1+cu124 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: distributed (`accelerate`)
- Using GPU in script?: Yes
- GPU type: NVIDIA A100-SXM4-40GB

Who can help?

@muellerz @SunMarc @ArthurZucker

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

I'm running into the following error while trying to use the SFTTrainer with FSDP and the accelerate library (full stack trace provided at the very bottom of this post).

torch.utils.checkpoint.CheckpointError: torch.utils.checkpoint: Recomputed values for the following tensors have different metadata than during the forward pass

This occurs when I set gradient_checkpointing: false and activation_checkpointing: true. Curiously, it actually seems to work if I set gradient_checkpointing: true and activation_checkpointing: false, but that produces the following warning message:

 # When using FSDP full shard, instead of using `gradient_checkpointing` in TrainingArguments, please use `activation_checkpointing` in `fsdp_config`. The former introduces a redundant AllGather operation in backward pass. Reference: https://github.com/huggingface/transformers/issues/30404`.

There are a few related GitHub issues around that touch on this issue:

One of these suggested setting use_reentrant: true, but that doesn't resolve the issue for me.

I'm attempting to run this as a SageMaker training job using the official HuggingFace estimator (this amounts to the following command: torchrun --nnodes 1 --nproc_per_node 8 train.py. My training script is essentially a lightly adapted version of the official examples. Below is how I'm instantiating the HuggingFace estimator object:

huggingface_estimator = HuggingFace(
    entry_point          = 'train.py',        # train script
    #entry_point          = 'launch.py',        # train script
    dependencies=['requirements.txt'],         
    source_dir           = './',            
    instance_type        = 'ml.p4d.24xlarge',
    instance_count       = 1,               
    max_run              = 2*24*60*60,     
    base_job_name        = job_name,          
    role                 = role,            
    volume_size          = 1024,              
    transformers_version = '4.36.0',     
    pytorch_version      = '2.1.0',          
    py_version           = 'py310',          
    hyperparameters      =  {
        "config_s3_uri": "s3://<foo>
    },
    #metric_definitions=metric_definitions,
    disable_output_compression = True,  
    distribution={"torch_distributed": {"enabled": True}},   # enables torchrun
    environment  = {
        "HUGGINGFACE_HUB_CACHE": "/tmp/.cache", 
        "HF_TOKEN": HfFolder.get_token(),      
        "ACCELERATE_USE_FSDP": "1",             # enable FSDP
        "FSDP_CPU_RAM_EFFICIENT_LOADING": "0",   # enable CPU RAM efficient loading
        "FSDP_AUTO_WRAP_POLICY": "TRANSFORMER_BASED_WRAP",
        "FSDP_BACKWARD_PREFETCH": "BACKWARD_PRE",
        "FSDP_STATE_DICT_TYPE": "FULL_STATE_DICT",
        "NCCL_TIMEOUT": "3600",  # 1 hour timeout
        "NCCL_DEBUG": "WARN",    
        "NCCL_IB_TIMEOUT": "3600",
        "NCCL_SOCKET_TIMEOUT": "3600",
        "NCCL_ASYNC_ERROR_HANDLING": "1",
        "NCCL_P2P_LEVEL": "NVL",
        "CUDA_DEVICE_MAX_CONNECTIONS": "1",        
        "MAX_JOBS": "1",                           
        "PYTORCH_CUDA_ALLOC_CONF": "max_split_size_mb:512",
        "TORCH_DISTRIBUTED_DEBUG": "DETAIL",     
    },
    checkpoint_s3_uri=f's3://<foo>'
)

Below are some of the relevant parameters from my input config.

gradient_checkpointing: false 
gradient_checkpointing_kwargs:
  use_reentrant: true
attn_implementation: "flash_attention_2"
packing: false
bf16: "auto"
fsdp: "full_shard auto_wrap offload"
fsdp_config:
  limit_all_gathers: true
  backward_prefetch: "backward_pre"
  forward_prefetch: "false"
  use_orig_params: "false"
  min_num_params: 0
  activation_checkpointing: "true"

Full Stack Trace

Traceback (most recent call last):
  File "train.py", line 224, in <module>
main(cfg)
  File "train.py", line 207, in main
    main(cfg)
  File "train.py", line 207, in main
    trainer.train()
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2123, in train
    trainer.train()
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2123, in train
Traceback (most recent call last):
  File "train.py", line 224, in <module>
main(cfg)main(cfg)
  File "train.py", line 207, in main
trainer.train()trainer.train()
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2123, in train
return inner_training_loop(
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2481, in _inner_training_loop
    return inner_training_loop(
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2481, in _inner_training_loop
return inner_training_loop(return inner_training_loop(
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2481, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 3612, in training_step
    tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 3612, in training_step
tr_loss_step = self.training_step(model, inputs, num_items_in_batch)tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 3612, in training_step
self.accelerator.backward(loss, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 2241, in backward
Traceback (most recent call last):
  File "/opt/ml/code/train.py", line 224, in <module>
    main(cfg)
  File "/opt/ml/code/train.py", line 207, in main
    trainer.train()
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2123, in train
    return inner_training_loop(
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2481, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 3612, in training_step
    self.accelerator.backward(loss, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 2241, in backward
    loss.backward(**kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward
    torch.autograd.backward(
  File "/opt/conda/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 1075, in unpack_hook
    frame.check_recomputed_tensors_match(gid)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 850, in check_recomputed_tensors_match
    raise CheckpointError(
torch.utils.checkpoint.CheckpointError: torch.utils.checkpoint: Recomputed values for the following tensors have different metadata than during the forward pass.
tensor at position 18:
saved metadata: {'shape': torch.Size([2, 1024, 28, 128]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
recomputed metadata: {'shape': torch.Size([2, 2048, 28, 128]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
tensor at position 19:
saved metadata: {'shape': torch.Size([2, 1024, 28, 128]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
recomputed metadata: {'shape': torch.Size([2, 2048, 28, 128]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
loss.backward(**kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward
    loss.backward(**kwargs)
          File "/opt/conda/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward
self.accelerator.backward(loss, **kwargs)self.accelerator.backward(loss, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 2241, in backward
  File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 2241, in backward
    torch.autograd.backward(
  File "/opt/conda/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
    torch.autograd.backward(
  File "/opt/conda/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 1075, in unpack_hook
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 1075, in unpack_hook
frame.check_recomputed_tensors_match(gid)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 850, in check_recomputed_tensors_match
        loss.backward(**kwargs)loss.backward(**kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward
  File "/opt/conda/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward
    frame.check_recomputed_tensors_match(gid)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 850, in check_recomputed_tensors_match
        torch.autograd.backward(torch.autograd.backward(
  File "/opt/conda/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
  File "/opt/conda/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
    raise CheckpointError(
torch.utils.checkpoint.CheckpointError: torch.utils.checkpoint: Recomputed values for the following tensors have different metadata than during the forward pass.
tensor at position 18:
saved metadata: {'shape': torch.Size([2, 1024, 28, 128]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=2)}
recomputed metadata: {'shape': torch.Size([2, 2048, 28, 128]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=2)}
tensor at position 19:
saved metadata: {'shape': torch.Size([2, 1024, 28, 128]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=2)}
recomputed metadata: {'shape': torch.Size([2, 2048, 28, 128]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=2)}
        Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward passVariable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 1075, in unpack_hook
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 1075, in unpack_hook
    raise CheckpointError(
torch.utils.checkpoint.CheckpointError: torch.utils.checkpoint: Recomputed values for the following tensors have different metadata than during the forward pass.
tensor at position 18:
saved metadata: {'shape': torch.Size([2, 1024, 28, 128]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=1)}
recomputed metadata: {'shape': torch.Size([2, 2048, 28, 128]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=1)}
tensor at position 19:
saved metadata: {'shape': torch.Size([2, 1024, 28, 128]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=1)}
recomputed metadata: {'shape': torch.Size([2, 2048, 28, 128]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=1)}
frame.check_recomputed_tensors_match(gid)frame.check_recomputed_tensors_match(gid)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 850, in check_recomputed_tensors_match
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 850, in check_recomputed_tensors_match
    raise CheckpointError(
torch.utils.checkpoint.CheckpointError: torch.utils.checkpoint: Recomputed values for the following tensors have different metadata than during the forward pass.
tensor at position 18:
saved metadata: {'shape': torch.Size([2, 1024, 28, 128]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=3)}
recomputed metadata: {'shape': torch.Size([2, 2048, 28, 128]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=3)}
tensor at position 19:
saved metadata: {'shape': torch.Size([2, 1024, 28, 128]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=3)}
recomputed metadata: {'shape': torch.Size([2, 2048, 28, 128]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=3)}
    raise CheckpointError(
torch.utils.checkpoint.CheckpointError: torch.utils.checkpoint: Recomputed values for the following tensors have different metadata than during the forward pass.
tensor at position 18:
saved metadata: {'shape': torch.Size([2, 1024, 28, 128]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=6)}
recomputed metadata: {'shape': torch.Size([2, 2048, 28, 128]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=6)}
tensor at position 19:
saved metadata: {'shape': torch.Size([2, 1024, 28, 128]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=6)}
recomputed metadata: {'shape': torch.Size([2, 2048, 28, 128]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=6)}
0%|          | 0/100 [00:13<?, ?it/s]
[E ProcessGroupGloo.cpp:138] Rank 5 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank.
[E ProcessGroupGloo.cpp:138] Rank 4 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank.
[E ProcessGroupGloo.cpp:138] Rank 7 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank.
[2024-11-25 18:39:43,758] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 69 closing signal SIGTERM
[2024-11-25 18:39:43,758] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 73 closing signal SIGTERM
[2024-11-25 18:39:43,758] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 74 closing signal SIGTERM
[2024-11-25 18:39:43,758] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 76 closing signal SIGTERM
[2024-11-25 18:39:47,931] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 70) of binary: /opt/conda/bin/python

Expected behavior

The expected behavior is for the SFTTrainer's train() method to run without errors.

The text was updated successfully, but these errors were encountered:

jjbuck added the bug label Nov 25, 2024

jjbuck changed the title ~~Recomputed tensor size does not match when using activation checkpointing in FSDP strategy~~ Recomputed tensor size does not match when using activation checkpointing when using FSDP Nov 25, 2024

jjbuck changed the title ~~Recomputed tensor size does not match when using activation checkpointing when using FSDP~~ Recomputed tensor size does not match when using activation checkpointing when using FSDP and accelerate Nov 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recomputed tensor size does not match when using activation checkpointing when using FSDP and accelerate #34928

Recomputed tensor size does not match when using activation checkpointing when using FSDP and accelerate #34928

jjbuck commented Nov 25, 2024 •

edited

Loading

Recomputed tensor size does not match when using activation checkpointing when using FSDP and accelerate #34928

Recomputed tensor size does not match when using activation checkpointing when using FSDP and accelerate #34928

Comments

jjbuck commented Nov 25, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

jjbuck commented Nov 25, 2024 •

edited

Loading