Why ZeRO-offload parameter update after model backward? can do them pipeline? #5478

Zijie-Tian · 2024-04-29T13:15:12Z

Zijie-Tian
Apr 29, 2024

After reading the ZeRO-offload paper, I noticed the following figure:

So why can't model parameter updates be divided into multiple blocks, and why can't the parameter update and backward process be executed in a pipelined manner?

Below is what I expect it to look like.

Can someone explain it to me? I'm very confused.

Answered by GuanhuaWang

Apr 29, 2024

Hi @Zijie-Tian, what is U1234 here? I guess it is step on CPU side.

The main reason is because CPU compute is super slow compared with GPU. In your pipeline case, the first needed F1/P1 will be the last updated params (wait until u4321 all finished) on CPU thus have the longest delay. Therefore, if doing such pipeline, CPU will be the bottleneck of the whole training pipeline.

Because of this we also did some opitmizations of delaying 1 iteration param updates as described in paper https://arxiv.org/pdf/2101.06840, section 5

Second, we develop a one-step delayed parameter update schedule that overlaps the CPU parameter update computation with the forward and backward computation on the G…

View full answer

GuanhuaWang · 2024-04-29T18:33:18Z

GuanhuaWang
Apr 29, 2024
Collaborator

Hi @Zijie-Tian, what is U1234 here? I guess it is step on CPU side.

The main reason is because CPU compute is super slow compared with GPU. In your pipeline case, the first needed F1/P1 will be the last updated params (wait until u4321 all finished) on CPU thus have the longest delay. Therefore, if doing such pipeline, CPU will be the bottleneck of the whole training pipeline.

Because of this we also did some opitmizations of delaying 1 iteration param updates as described in paper https://arxiv.org/pdf/2101.06840, section 5

Second, we develop a one-step delayed parameter update schedule that overlaps the CPU parameter update computation with the forward and backward computation on the GPU, hiding the CPU execution time when enabled.

Hope it helps

2 replies

Zijie-Tian May 24, 2024
Author

Sorry for the late response, @GuanhuaWang. Excluding methods that consider the impact on model convergence due to delay 1 step, if I understand correctly, the current ZeRO-Offload should be as shown in the following diagram, where P stands for "Param Update" and C stands for "Communication.".

Did I understand it correctly? (PS: In runtime, the CPU-side optimizer might take a bit longer.)

fzyzcjy Nov 23, 2024

Hi, sorry but I am still confused. What I observe is as below. It seems that GPU and CPU are both half-idle. Therefore, it would be great if we can parallel them.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why ZeRO-offload parameter update after model backward? can do them pipeline? #5478

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Why ZeRO-offload parameter update after model backward? can do them pipeline? #5478

Zijie-Tian Apr 29, 2024

Replies: 1 comment · 2 replies

GuanhuaWang Apr 29, 2024 Collaborator

Zijie-Tian May 24, 2024 Author

fzyzcjy Nov 23, 2024

Zijie-Tian
Apr 29, 2024

Replies: 1 comment 2 replies

GuanhuaWang
Apr 29, 2024
Collaborator

Zijie-Tian May 24, 2024
Author