Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recomputed tensor size does not match when using activation checkpointing when using FSDP and accelerate #34928

Open
2 of 4 tasks
jjbuck opened this issue Nov 25, 2024 · 0 comments
Labels

Comments

@jjbuck
Copy link

jjbuck commented Nov 25, 2024

System Info

- `transformers` version: 4.46.3
- Platform: Linux-6.8.0-1015-aws-x86_64-with-glibc2.35
- Python version: 3.12.6
- Huggingface_hub version: 0.26.2
- Safetensors version: 0.4.5
- Accelerate version: 1.1.1
- Accelerate config:    not found
- PyTorch version (GPU?): 2.5.1+cu124 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: distributed (`accelerate`)
- Using GPU in script?: Yes
- GPU type: NVIDIA A100-SXM4-40GB

Who can help?

@muellerz @SunMarc @ArthurZucker

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I'm running into the following error while trying to use the SFTTrainer with FSDP and the accelerate library (full stack trace provided at the very bottom of this post).

torch.utils.checkpoint.CheckpointError: torch.utils.checkpoint: Recomputed values for the following tensors have different metadata than during the forward pass

This occurs when I set gradient_checkpointing: false and activation_checkpointing: true. Curiously, it actually seems to work if I set gradient_checkpointing: true and activation_checkpointing: false, but that produces the following warning message:

 # When using FSDP full shard, instead of using `gradient_checkpointing` in TrainingArguments, please use `activation_checkpointing` in `fsdp_config`. The former introduces a redundant AllGather operation in backward pass. Reference: https://github.com/huggingface/transformers/issues/30404`. 

There are a few related GitHub issues around that touch on this issue:

  1. Recomputed tensor size does not match when using activation checkpointing in FSDP strategy Lightning-AI/pytorch-lightning#19267
  2. activation_checkpointing error when using --fsdp #28499
  3. torch.nn.checkpoint.checkpoint ignores default device in backward() call pytorch/pytorch#124788
  4. cannot use activation_checkpoint in torch native fsdp #32073

One of these suggested setting use_reentrant: true, but that doesn't resolve the issue for me.

I'm attempting to run this as a SageMaker training job using the official HuggingFace estimator (this amounts to the following command: torchrun --nnodes 1 --nproc_per_node 8 train.py. My training script is essentially a lightly adapted version of the official examples. Below is how I'm instantiating the HuggingFace estimator object:

huggingface_estimator = HuggingFace(
    entry_point          = 'train.py',        # train script
    #entry_point          = 'launch.py',        # train script
    dependencies=['requirements.txt'],         
    source_dir           = './',            
    instance_type        = 'ml.p4d.24xlarge',
    instance_count       = 1,               
    max_run              = 2*24*60*60,     
    base_job_name        = job_name,          
    role                 = role,            
    volume_size          = 1024,              
    transformers_version = '4.36.0',     
    pytorch_version      = '2.1.0',          
    py_version           = 'py310',          
    hyperparameters      =  {
        "config_s3_uri": "s3://<foo>
    },
    #metric_definitions=metric_definitions,
    disable_output_compression = True,  
    distribution={"torch_distributed": {"enabled": True}},   # enables torchrun
    environment  = {
        "HUGGINGFACE_HUB_CACHE": "/tmp/.cache", 
        "HF_TOKEN": HfFolder.get_token(),      
        "ACCELERATE_USE_FSDP": "1",             # enable FSDP
        "FSDP_CPU_RAM_EFFICIENT_LOADING": "0",   # enable CPU RAM efficient loading
        "FSDP_AUTO_WRAP_POLICY": "TRANSFORMER_BASED_WRAP",
        "FSDP_BACKWARD_PREFETCH": "BACKWARD_PRE",
        "FSDP_STATE_DICT_TYPE": "FULL_STATE_DICT",
        "NCCL_TIMEOUT": "3600",  # 1 hour timeout
        "NCCL_DEBUG": "WARN",    
        "NCCL_IB_TIMEOUT": "3600",
        "NCCL_SOCKET_TIMEOUT": "3600",
        "NCCL_ASYNC_ERROR_HANDLING": "1",
        "NCCL_P2P_LEVEL": "NVL",
        "CUDA_DEVICE_MAX_CONNECTIONS": "1",        
        "MAX_JOBS": "1",                           
        "PYTORCH_CUDA_ALLOC_CONF": "max_split_size_mb:512",
        "TORCH_DISTRIBUTED_DEBUG": "DETAIL",     
    },
    checkpoint_s3_uri=f's3://<foo>'
)

Below are some of the relevant parameters from my input config.

gradient_checkpointing: false 
gradient_checkpointing_kwargs:
  use_reentrant: true
attn_implementation: "flash_attention_2"
packing: false
bf16: "auto"
fsdp: "full_shard auto_wrap offload"
fsdp_config:
  limit_all_gathers: true
  backward_prefetch: "backward_pre"
  forward_prefetch: "false"
  use_orig_params: "false"
  min_num_params: 0
  activation_checkpointing: "true"

Full Stack Trace

Traceback (most recent call last):
  File "train.py", line 224, in <module>
main(cfg)
  File "train.py", line 207, in main
    main(cfg)
  File "train.py", line 207, in main
    trainer.train()
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2123, in train
    trainer.train()
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2123, in train
Traceback (most recent call last):
  File "train.py", line 224, in <module>
main(cfg)main(cfg)
  File "train.py", line 207, in main
trainer.train()trainer.train()
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2123, in train
return inner_training_loop(
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2481, in _inner_training_loop
    return inner_training_loop(
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2481, in _inner_training_loop
return inner_training_loop(return inner_training_loop(
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2481, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 3612, in training_step
    tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 3612, in training_step
tr_loss_step = self.training_step(model, inputs, num_items_in_batch)tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 3612, in training_step
self.accelerator.backward(loss, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 2241, in backward
Traceback (most recent call last):
  File "/opt/ml/code/train.py", line 224, in <module>
    main(cfg)
  File "/opt/ml/code/train.py", line 207, in main
    trainer.train()
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2123, in train
    return inner_training_loop(
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2481, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 3612, in training_step
    self.accelerator.backward(loss, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 2241, in backward
    loss.backward(**kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward
    torch.autograd.backward(
  File "/opt/conda/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 1075, in unpack_hook
    frame.check_recomputed_tensors_match(gid)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 850, in check_recomputed_tensors_match
    raise CheckpointError(
torch.utils.checkpoint.CheckpointError: torch.utils.checkpoint: Recomputed values for the following tensors have different metadata than during the forward pass.
tensor at position 18:
saved metadata: {'shape': torch.Size([2, 1024, 28, 128]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
recomputed metadata: {'shape': torch.Size([2, 2048, 28, 128]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
tensor at position 19:
saved metadata: {'shape': torch.Size([2, 1024, 28, 128]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
recomputed metadata: {'shape': torch.Size([2, 2048, 28, 128]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
loss.backward(**kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward
    loss.backward(**kwargs)
          File "/opt/conda/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward
self.accelerator.backward(loss, **kwargs)self.accelerator.backward(loss, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 2241, in backward
  File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 2241, in backward
    torch.autograd.backward(
  File "/opt/conda/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
    torch.autograd.backward(
  File "/opt/conda/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 1075, in unpack_hook
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 1075, in unpack_hook
frame.check_recomputed_tensors_match(gid)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 850, in check_recomputed_tensors_match
        loss.backward(**kwargs)loss.backward(**kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward
  File "/opt/conda/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward
    frame.check_recomputed_tensors_match(gid)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 850, in check_recomputed_tensors_match
        torch.autograd.backward(torch.autograd.backward(
  File "/opt/conda/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
  File "/opt/conda/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
    raise CheckpointError(
torch.utils.checkpoint.CheckpointError: torch.utils.checkpoint: Recomputed values for the following tensors have different metadata than during the forward pass.
tensor at position 18:
saved metadata: {'shape': torch.Size([2, 1024, 28, 128]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=2)}
recomputed metadata: {'shape': torch.Size([2, 2048, 28, 128]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=2)}
tensor at position 19:
saved metadata: {'shape': torch.Size([2, 1024, 28, 128]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=2)}
recomputed metadata: {'shape': torch.Size([2, 2048, 28, 128]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=2)}
        Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward passVariable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 1075, in unpack_hook
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 1075, in unpack_hook
    raise CheckpointError(
torch.utils.checkpoint.CheckpointError: torch.utils.checkpoint: Recomputed values for the following tensors have different metadata than during the forward pass.
tensor at position 18:
saved metadata: {'shape': torch.Size([2, 1024, 28, 128]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=1)}
recomputed metadata: {'shape': torch.Size([2, 2048, 28, 128]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=1)}
tensor at position 19:
saved metadata: {'shape': torch.Size([2, 1024, 28, 128]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=1)}
recomputed metadata: {'shape': torch.Size([2, 2048, 28, 128]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=1)}
frame.check_recomputed_tensors_match(gid)frame.check_recomputed_tensors_match(gid)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 850, in check_recomputed_tensors_match
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 850, in check_recomputed_tensors_match
    raise CheckpointError(
torch.utils.checkpoint.CheckpointError: torch.utils.checkpoint: Recomputed values for the following tensors have different metadata than during the forward pass.
tensor at position 18:
saved metadata: {'shape': torch.Size([2, 1024, 28, 128]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=3)}
recomputed metadata: {'shape': torch.Size([2, 2048, 28, 128]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=3)}
tensor at position 19:
saved metadata: {'shape': torch.Size([2, 1024, 28, 128]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=3)}
recomputed metadata: {'shape': torch.Size([2, 2048, 28, 128]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=3)}
    raise CheckpointError(
torch.utils.checkpoint.CheckpointError: torch.utils.checkpoint: Recomputed values for the following tensors have different metadata than during the forward pass.
tensor at position 18:
saved metadata: {'shape': torch.Size([2, 1024, 28, 128]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=6)}
recomputed metadata: {'shape': torch.Size([2, 2048, 28, 128]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=6)}
tensor at position 19:
saved metadata: {'shape': torch.Size([2, 1024, 28, 128]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=6)}
recomputed metadata: {'shape': torch.Size([2, 2048, 28, 128]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=6)}
0%|          | 0/100 [00:13<?, ?it/s]
[E ProcessGroupGloo.cpp:138] Rank 5 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank.
[E ProcessGroupGloo.cpp:138] Rank 4 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank.
[E ProcessGroupGloo.cpp:138] Rank 7 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank.
[2024-11-25 18:39:43,758] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 69 closing signal SIGTERM
[2024-11-25 18:39:43,758] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 73 closing signal SIGTERM
[2024-11-25 18:39:43,758] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 74 closing signal SIGTERM
[2024-11-25 18:39:43,758] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 76 closing signal SIGTERM
[2024-11-25 18:39:47,931] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 70) of binary: /opt/conda/bin/python

Expected behavior

The expected behavior is for the SFTTrainer's train() method to run without errors.

@jjbuck jjbuck added the bug label Nov 25, 2024
@jjbuck jjbuck changed the title Recomputed tensor size does not match when using activation checkpointing in FSDP strategy Recomputed tensor size does not match when using activation checkpointing when using FSDP Nov 25, 2024
@jjbuck jjbuck changed the title Recomputed tensor size does not match when using activation checkpointing when using FSDP Recomputed tensor size does not match when using activation checkpointing when using FSDP and accelerate Nov 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant