[NeMo UX] Support generating datasets using different train/valid/test distributions #9771

ashors1 · 2024-07-17T18:31:10Z

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Collection: [Note which collection this PR will affect]

Changelog

Add specific line by line info of high level changes in this PR.

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

Signed-off-by: ashors1 <[email protected]>

nemo/collections/llm/gpt/data/pre_training.py

athitten · 2024-07-18T23:12:58Z

nemo/collections/llm/gpt/data/pre_training.py

+            build_kwargs["blend_per_split"] = [
+                get_blend_from_list(paths["train"]),
+                get_blend_from_list(paths["validation"]),
+                get_blend_from_list(paths["test"]),


@ashors1 does get_blend_from_list(paths["train"]) work even when you have multiple data files. For ex: {"train": /datafile1/, /datafile2/}. Also in this case if weights are ignored then is the dataset built with all samples from both /datafile1/ and /datafile2/ ?

Yes, get_blend_from_list(paths["train"]) works when you have multiple paths. You're also able to pass in weights by interleaving them with the paths. For example, the following would work:

paths={ "train": [25, PATH1, 75, PATH2], "validation": [PATH3, PATH4], "test": ['1', PATH5], }

The only time the weights are not used is when limit_val_batches <= 1.0, in which case we want to return the full validation dataset. In this case, users are expected not to provide weights for the paths.

Signed-off-by: ashors1 <[email protected]>

athitten

LGTM, thanks @ashors1! Also for future references might be good to document the behavior of the megatron datasets as to what is supported and what is not.

…t distributions (#9771) * support building train/valid/test datasets from separate distributions Signed-off-by: ashors1 <[email protected]> * add minimal test Signed-off-by: ashors1 <[email protected]> * Apply isort and black reformatting Signed-off-by: ashors1 <[email protected]> * set limit_val_batches for nemo 2 example Signed-off-by: ashors1 <[email protected]> * improve assert statement Signed-off-by: ashors1 <[email protected]> * Apply isort and black reformatting Signed-off-by: ashors1 <[email protected]> --------- Signed-off-by: ashors1 <[email protected]> Signed-off-by: ashors1 <[email protected]> Co-authored-by: ashors1 <[email protected]>

…t distributions (#9771) (#9841) * support building train/valid/test datasets from separate distributions * add minimal test * Apply isort and black reformatting * set limit_val_batches for nemo 2 example * improve assert statement * Apply isort and black reformatting --------- Signed-off-by: ashors1 <[email protected]> Signed-off-by: ashors1 <[email protected]> Co-authored-by: Anna Shors <[email protected]> Co-authored-by: ashors1 <[email protected]>

…t distributions (NVIDIA#9771) (NVIDIA#9841) * support building train/valid/test datasets from separate distributions * add minimal test * Apply isort and black reformatting * set limit_val_batches for nemo 2 example * improve assert statement * Apply isort and black reformatting --------- Signed-off-by: ashors1 <[email protected]> Signed-off-by: ashors1 <[email protected]> Co-authored-by: Anna Shors <[email protected]> Co-authored-by: ashors1 <[email protected]> Signed-off-by: Boxiang Wang <[email protected]>

…t distributions (NVIDIA#9771) (NVIDIA#9841) * support building train/valid/test datasets from separate distributions * add minimal test * Apply isort and black reformatting * set limit_val_batches for nemo 2 example * improve assert statement * Apply isort and black reformatting --------- Signed-off-by: ashors1 <[email protected]> Signed-off-by: ashors1 <[email protected]> Co-authored-by: Anna Shors <[email protected]> Co-authored-by: ashors1 <[email protected]> Signed-off-by: Vivian Chen <[email protected]>

…t distributions (#9771) (#9841) * support building train/valid/test datasets from separate distributions * add minimal test * Apply isort and black reformatting * set limit_val_batches for nemo 2 example * improve assert statement * Apply isort and black reformatting --------- Signed-off-by: ashors1 <[email protected]> Signed-off-by: ashors1 <[email protected]> Co-authored-by: Anna Shors <[email protected]> Co-authored-by: ashors1 <[email protected]>

…t distributions (NVIDIA#9771) (NVIDIA#9841) * support building train/valid/test datasets from separate distributions * add minimal test * Apply isort and black reformatting * set limit_val_batches for nemo 2 example * improve assert statement * Apply isort and black reformatting --------- Signed-off-by: ashors1 <[email protected]> Signed-off-by: ashors1 <[email protected]> Co-authored-by: Anna Shors <[email protected]> Co-authored-by: ashors1 <[email protected]> Signed-off-by: Hainan Xu <[email protected]>

ashors1 added 2 commits July 17, 2024 10:59

support building train/valid/test datasets from separate distributions

4158a85

Signed-off-by: ashors1 <[email protected]>

add minimal test

a4fac11

Signed-off-by: ashors1 <[email protected]>

ashors1 requested a review from cuichenx July 17, 2024 18:31

ashors1 added the Run CICD label Jul 17, 2024

Apply isort and black reformatting

b7f6c2c

Signed-off-by: ashors1 <[email protected]>

ashors1 added Run CICD and removed Run CICD labels Jul 17, 2024

ashors1 requested a review from athitten July 18, 2024 00:55

set limit_val_batches for nemo 2 example

7274c24

Signed-off-by: ashors1 <[email protected]>

ashors1 added Run CICD and removed Run CICD labels Jul 18, 2024

athitten reviewed Jul 18, 2024

View reviewed changes

nemo/collections/llm/gpt/data/pre_training.py Show resolved Hide resolved

athitten reviewed Jul 18, 2024

View reviewed changes

ashors1 and others added 3 commits July 18, 2024 21:22

improve assert statement

8764a4a

Signed-off-by: ashors1 <[email protected]>

Apply isort and black reformatting

5ae2531

Signed-off-by: ashors1 <[email protected]>

Merge branch 'r2.0.0rc1' into ashors/nemo-ux-dataloader-distributions

069a03d

ashors1 added Run CICD and removed Run CICD labels Jul 19, 2024

ericharper added 2.0.0rc1 bug Something isn't working labels Jul 19, 2024

pablo-garay added Run CICD and removed Run CICD labels Jul 19, 2024

ashors1 requested a review from athitten July 19, 2024 23:08

athitten approved these changes Jul 22, 2024

View reviewed changes

ashors1 merged commit 5143065 into r2.0.0rc1 Jul 23, 2024
332 checks passed

ashors1 deleted the ashors/nemo-ux-dataloader-distributions branch July 23, 2024 03:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NeMo UX] Support generating datasets using different train/valid/test distributions #9771

[NeMo UX] Support generating datasets using different train/valid/test distributions #9771

ashors1 commented Jul 17, 2024

athitten Jul 18, 2024

ashors1 Jul 19, 2024

athitten left a comment

[NeMo UX] Support generating datasets using different train/valid/test distributions #9771

[NeMo UX] Support generating datasets using different train/valid/test distributions #9771

Conversation

ashors1 commented Jul 17, 2024

What does this PR do ?

Changelog

Usage

GitHub Actions CI

Before your PR is "Ready for review"

Who can review?

Additional Information

athitten Jul 18, 2024

Choose a reason for hiding this comment

ashors1 Jul 19, 2024

Choose a reason for hiding this comment

athitten left a comment

Choose a reason for hiding this comment