Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG][IMPORTANT] zero_to_fp32.py consolidated weights are all zero after this commit #6791

Open
npuichigo opened this issue Nov 26, 2024 · 5 comments · May be fixed by #6792
Open

[BUG][IMPORTANT] zero_to_fp32.py consolidated weights are all zero after this commit #6791

npuichigo opened this issue Nov 26, 2024 · 5 comments · May be fixed by #6792
Labels
bug Something isn't working compression

Comments

@npuichigo
Copy link

npuichigo commented Nov 26, 2024

Describe the bug
After this commit dd40269, specify max_shard_size to zero_to_fp32.py would generate empty weights

To Reproduce
Use deepspeed v0.16.0

Reason
According to the code

def to_torch_tensor(state_dict, return_empty_tensor=False):
"""
Convert state_dict of GatheredTensor to torch tensor
"""
converted_tensors = {}
for name, tensor in state_dict.items():
tensor_id = id(tensor)
if tensor_id in converted_tensors:
shared_tensor = state_dict[converted_tensors[tensor_id]]
state_dict[name] = shared_tensor
else:
converted_tensors[tensor_id] = name
if return_empty_tensor:
state_dict[name] = torch.empty(tensor.shape, dtype=tensor.dtype)
else:
state_dict[name] = tensor.contiguous()
return state_dict
, the state_dict is overrided instead of making a copy, that's why the original state_dict become empty

@npuichigo npuichigo added bug Something isn't working compression labels Nov 26, 2024
@npuichigo npuichigo changed the title [BUG] zero_to_fp32.py consolidated weights are all zero after this commit [BUG][IMPORTANT] zero_to_fp32.py consolidated weights are all zero after this commit Nov 26, 2024
@npuichigo
Copy link
Author

@xu-song @tjruwase @loadams

@xu-song
Copy link
Contributor

xu-song commented Nov 26, 2024

Can you provide more details to reproduce your issue? zero-stage, model, world_size, convert_script .etc

@npuichigo
Copy link
Author

I think it will affect many models. Let's just see the code here, the state_dict passed in to to_torch_tensor is a shallow-copy. state_dict[name] = torch.empty(tensor.shape, dtype=tensor.dtype) will override the original weights read from checkpoint.

@xu-song
Copy link
Contributor

xu-song commented Nov 26, 2024

I get it, thanks for your issue

@xu-song xu-song linked a pull request Nov 26, 2024 that will close this issue
@loadams
Copy link
Contributor

loadams commented Nov 26, 2024

Thanks for identifying the bug @npuichigo and thanks for the quick fix @xu-song

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working compression
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants