Staged batchsize training #80

ClashLuke · 2022-09-11T20:31:53Z

Some papers such as "Don't Decay the Learning Rate, Increase the Batch Size" have shown that training with progressively larger batch sizes instead of progressively lower learning rates helps models find a better local minimum by improving stability in the final stages of training. Additionally, this increases training speed, as the model gets progressively faster (in tokens/s) with increasing batch size.
Intuitively, this allows the model to take many small updates initially, as all samples in the batch will point in a similar direction. However, during later stages of the training, the gradients might point in different directions, so larger batches (or lower learning rates) are required.

ClashLuke added research Creative project that might fail but could give high returns engineering Software-engineering problems that don't require ML-Expertise core Improves core model while keeping core idea intact labels Sep 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Staged batchsize training #80

Staged batchsize training #80

ClashLuke commented Sep 11, 2022

Staged batchsize training #80

Staged batchsize training #80

Comments

ClashLuke commented Sep 11, 2022