Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Provide safeguards during training (#168)
* fix: add safeguards during data processing Signed-off-by: Oleg S <[email protected]> * fix: add a safeguard for max_batch_len & max_seq_len in training We currently have certain values that need to be validated against others, but no logic to ensure that this works adequately. This commit provides a pre-training check that errors out if the value of max_batch_len is smaller than max_seq_len, since this breaks our ability to generate training batches Signed-off-by: Oleg S <[email protected]> * fix: add fallback logic to use the distributed sampler When we use the multipack sampler, it requires a certain shape of the dataset relative to the GPUs to be able to sufficiently distribute all of the samples across different nodes. When this happens, the train loaderbecomes empty which prevents us from being able to train. This commit resolves that issue by falling back to the distributed sampler when the multipack fails. Signed-off-by: Oleg S <[email protected]> --------- Signed-off-by: Oleg S <[email protected]>
- Loading branch information