[BUG][IMPORTANT] zero_to_fp32.py consolidated weights are all zero after this commit #6791

npuichigo · 2024-11-26T11:51:17Z

Describe the bug
After this commit dd40269, specify max_shard_size to zero_to_fp32.py would generate empty weights

To Reproduce
Use deepspeed v0.16.0

Reason
According to the code

DeepSpeed/deepspeed/utils/zero_to_fp32.py

Lines 513 to 529 in f743fec

    
           def to_torch_tensor(state_dict, return_empty_tensor=False): 
        
               """ 
        
               Convert state_dict of GatheredTensor to torch tensor 
        
               """ 
        
               converted_tensors = {} 
        
               for name, tensor in state_dict.items(): 
        
                   tensor_id = id(tensor) 
        
                   if tensor_id in converted_tensors: 
        
                       shared_tensor = state_dict[converted_tensors[tensor_id]] 
        
                       state_dict[name] = shared_tensor 
        
                   else: 
        
                       converted_tensors[tensor_id] = name 
        
                       if return_empty_tensor: 
        
                           state_dict[name] = torch.empty(tensor.shape, dtype=tensor.dtype) 
        
                       else: 
        
                           state_dict[name] = tensor.contiguous() 
        
               return state_dict

, the state_dict is overrided instead of making a copy, that's why the original state_dict become empty

The text was updated successfully, but these errors were encountered:

npuichigo · 2024-11-26T11:59:21Z

@xu-song @tjruwase @loadams

xu-song · 2024-11-26T12:13:17Z

Can you provide more details to reproduce your issue? zero-stage, model, world_size, convert_script .etc

npuichigo · 2024-11-26T12:23:04Z

I think it will affect many models. Let's just see the code here, the state_dict passed in to to_torch_tensor is a shallow-copy. state_dict[name] = torch.empty(tensor.shape, dtype=tensor.dtype) will override the original weights read from checkpoint.

xu-song · 2024-11-26T12:47:22Z

I get it, thanks for your issue

loadams · 2024-11-26T22:23:10Z

Thanks for identifying the bug @npuichigo and thanks for the quick fix @xu-song

npuichigo added bug Something isn't working compression labels Nov 26, 2024

npuichigo changed the title ~~[BUG] zero_to_fp32.py consolidated weights are all zero after this commit~~ [BUG][IMPORTANT] zero_to_fp32.py consolidated weights are all zero after this commit Nov 26, 2024

xu-song linked a pull request Nov 26, 2024 that will close this issue

Fix zero checkpoint #6792

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG][IMPORTANT] zero_to_fp32.py consolidated weights are all zero after this commit #6791

[BUG][IMPORTANT] zero_to_fp32.py consolidated weights are all zero after this commit #6791

npuichigo commented Nov 26, 2024 •

edited

Loading

npuichigo commented Nov 26, 2024

xu-song commented Nov 26, 2024 •

edited

Loading

npuichigo commented Nov 26, 2024

xu-song commented Nov 26, 2024 •

edited

Loading

loadams commented Nov 26, 2024

[BUG][IMPORTANT] zero_to_fp32.py consolidated weights are all zero after this commit #6791

[BUG][IMPORTANT] zero_to_fp32.py consolidated weights are all zero after this commit #6791

Comments

npuichigo commented Nov 26, 2024 • edited Loading

npuichigo commented Nov 26, 2024

xu-song commented Nov 26, 2024 • edited Loading

npuichigo commented Nov 26, 2024

xu-song commented Nov 26, 2024 • edited Loading

loadams commented Nov 26, 2024

npuichigo commented Nov 26, 2024 •

edited

Loading

xu-song commented Nov 26, 2024 •

edited

Loading

xu-song commented Nov 26, 2024 •

edited

Loading