-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
grouping according to similar lengths #2
Comments
Hello @homelifes I also received your email but I will answer here in case it is helpful for others. I think you'd definitely need to batch sequences such that each batch contains the same number of tokens, because the predictions are at the token level, and therefore losses are averaged over tokens. A batch with fewer tokens will grant a disproportional amount of influence to those tokens. About whether you need to group sequences of the same or similar lengths in a batch - I don't think you need to do this. Grouping sequences by length is mainly for two reasons:
So, if this is not a concern, then I think you'd be fine having sequences of varying lengths in your batches as long as there are approximately the same number of tokens being predicted in the batch. In fact, it might even be better because now your batching can be much more random each epoch, whereas trying to ensure source and target sequences are of the same length will reduce the variation in batches' contents. (That is, trying to minimize padding in batches like I did may result in the same batches each time, and it is just the order of the batches that is shuffled for different epochs.) |
Thanks for your answer @sgrvinod .
But according to your code, you aren't computing the loss across the pads. So therefore the loss calculated is only for the non-padded tokens, as written here:
So why would there be an imbalance? |
Hi @sgrvinod
Thank you for your Tutorial posted for Attention is all you need. I have a small question, and would appreciate an answer.
In data loader.py you've grouped the batches according to their lengths, so that a batch has similar lengths. Is that necessary to be done? I do understand that it speeds up the training and reduces memory. But my question is does it have any effect on the performance if I don't group the data according to the lengths?
Thanks
The text was updated successfully, but these errors were encountered: