Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grouping according to similar lengths #2

Open
homelifes opened this issue Feb 26, 2020 · 2 comments
Open

grouping according to similar lengths #2

homelifes opened this issue Feb 26, 2020 · 2 comments

Comments

@homelifes
Copy link

homelifes commented Feb 26, 2020

Hi @sgrvinod
Thank you for your Tutorial posted for Attention is all you need. I have a small question, and would appreciate an answer.

In data loader.py you've grouped the batches according to their lengths, so that a batch has similar lengths. Is that necessary to be done? I do understand that it speeds up the training and reduces memory. But my question is does it have any effect on the performance if I don't group the data according to the lengths?

Thanks

@sgrvinod
Copy link
Owner

sgrvinod commented Feb 27, 2020

Hello @homelifes

I also received your email but I will answer here in case it is helpful for others.

I think you'd definitely need to batch sequences such that each batch contains the same number of tokens, because the predictions are at the token level, and therefore losses are averaged over tokens. A batch with fewer tokens will grant a disproportional amount of influence to those tokens.

About whether you need to group sequences of the same or similar lengths in a batch - I don't think you need to do this.

Grouping sequences by length is mainly for two reasons:

  • There are fewer or no pads that you're computing over.
  • For a fixed number of tokens in the batch, a batch tensor containing sequences of varying lengths will take much more memory than a batch tensor containing sequences of the same length. For example, if most sequences in a batch are of an average length of 50, but you have just one sequence that is of length 95, that's a lot of padding on every sequence on that batch, resulting in a much larger tensor.

So, if this is not a concern, then I think you'd be fine having sequences of varying lengths in your batches as long as there are approximately the same number of tokens being predicted in the batch. In fact, it might even be better because now your batching can be much more random each epoch, whereas trying to ensure source and target sequences are of the same length will reduce the variation in batches' contents. (That is, trying to minimize padding in batches like I did may result in the same batches each time, and it is just the order of the batches that is shuffled for different epochs.)

@homelifes
Copy link
Author

homelifes commented Feb 27, 2020

Thanks for your answer @sgrvinod .

I think you'd definitely need to batch sequences such that each batch contains the same number of tokens, because the predictions are at the token level, and therefore losses are averaged over tokens. A batch with fewer tokens will grant a disproportional amount of influence to those tokens.

But according to your code, you aren't computing the loss across the pads. So therefore the loss calculated is only for the non-padded tokens, as written here:

        inputs, _, _, _ = pack_padded_sequence(input=inputs,
                                               lengths=lengths,
                                               batch_first=True,
                                               enforce_sorted=False)  # (sum(lengths), vocab_size)
        targets, _, _, _ = pack_padded_sequence(input=targets,
                                                lengths=lengths,
                                                batch_first=True,
                                                enforce_sorted=False)  # (sum(lengths))

So why would there be an imbalance?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants