New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Context Parallel SFT Support for dataset in THD format #10688

Open

tomlifu wants to merge 29 commits into NVIDIA:main from tomlifu:thd_cp_support

+144 −28

Contributor

tomlifu commented Sep 30, 2024 •

edited

Loading

What does this PR do ?

This PR adds CP support for THD format and is compatible with cu_seqlen_padded in the latest CUDNN fused attention.

Steps to run SFT + CP + THD format:

Prepare packed dataset in THD format: run scripts/nlp_language_modeling/prepare_packed_ft_dataset.py to pack the dataset into THD format in desired sequence length. For example:

python <NeMo_top_dir>/scripts/nlp_language_modeling/prepare_packed_ft_dataset.py \
        model.data.train_ds.file_names=[<dataset_top_dir>/squad/1_squad_train.jsonl] \
        model.data.train_ds.max_seq_length=4096 \
        +model.context_parallel_size=2 \
        +tokenizer_path=<tokenizer_path> \
        +output_dir=<output_dir> +pack_sizes=[4096] \

Run SFT on the packed dataset in THD format with the same CP size specified in the last step

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

tomlifu added 4 commits

August 19, 2024 16:45


          Add context parallel support for packed dataset in THD format

9a2de52


          Merge remote-tracking branch 'origin/main' into thd_cp_support

b1ef8f0


          add changes with debug print

a11c351


          remove debug print

2fd0456

Signed-off-by: Lifu Zhang <[email protected]>

github-actions bot added the NLP label

tomlifu and others added 3 commits

September 30, 2024 16:39


          Merge branch 'main' into thd_cp_support

cf1a88d


          Apply isort and black reformatting

4ad3511

Signed-off-by: tomlifu <[email protected]>


          Merge branch 'main' into thd_cp_support

43d60a3

tomlifu changed the title ~~Draft: Context Parallel SFT Support for dataset in THD format~~ Context Parallel SFT Support for dataset in THD format

tomlifu mentioned this pull request

adding cu_seqlens_padded to packed_seq_params.py NVIDIA/Megatron-LM#1163

Closed

xrennvidia self-requested a review

October 2, 2024 17:44

xrennvidia reviewed

View reviewed changes

scripts/nlp_language_modeling/prepare_packed_ft_dataset.py Outdated Show resolved Hide resolved

nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py Outdated Show resolved Hide resolved

nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py Outdated Show resolved Hide resolved

nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py Outdated Show resolved Hide resolved

nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py Outdated Show resolved Hide resolved

xrennvidia reviewed

View reviewed changes

nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py Show resolved Hide resolved

tomlifu and others added 5 commits

October 8, 2024 19:16


          fix cu_seqlens and cu_seqlens_padded

63447f7

Signed-off-by: Lifu Zhang <[email protected]>


          Merge branch 'thd_cp_support' of https://github.com/tomlifu/NeMo into…

e287857

… thd_cp_support


          cu_seqlens and cu_seqlens_padded fix

850d9ae

Signed-off-by: Lifu Zhang <[email protected]>


          more fix on cu_seqlens and cu_seqlens_padded

d50d88e

Signed-off-by: Lifu Zhang <[email protected]>


          Apply isort and black reformatting

adea017

Signed-off-by: tomlifu <[email protected]>

github-advanced-security bot found potential problems

View reviewed changes

nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py Fixed Show resolved Hide resolved

xrennvidia reviewed

View reviewed changes

scripts/nlp_language_modeling/prepare_packed_ft_dataset.py Outdated Show resolved Hide resolved

nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py Fixed Show resolved Hide resolved

nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py Outdated Show resolved Hide resolved

nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py Outdated Show resolved Hide resolved

scripts/nlp_language_modeling/prepare_packed_ft_dataset.py Outdated Show resolved Hide resolved

scripts/nlp_language_modeling/prepare_packed_ft_dataset.py Outdated Show resolved Hide resolved

tomlifu added 2 commits

October 21, 2024 16:44


          addressing Xiaowei's review

8c76e48

Signed-off-by: Lifu Zhang <[email protected]>


          addressing more review comments

78b3b1c

Signed-off-by: Lifu Zhang <[email protected]>

Collaborator

xrennvidia commented Oct 25, 2024

Please fix DCO also.


          fix for the case where cp=1

Signed-off-by: Lifu Zhang <[email protected]>

tomlifu force-pushed the thd_cp_support branch from 0339a8f to 90feb4a Compare

October 25, 2024 22:33

tomlifu requested review from pablo-garay and ko3n1g as code owners

October 25, 2024 22:33

github-actions bot added core TTS ASR labels

github-actions bot removed core TTS ASR CI common Multi Modal labels


          Apply isort and black reformatting

a4bbb20

Signed-off-by: tomlifu <[email protected]>

xrennvidia removed the audio label

tomlifu and others added 9 commits

October 25, 2024 17:39


          more fix to address Xiaowei's comment

bcda0db

Signed-off-by: Lifu Zhang <[email protected]>


          fix the loss_mask for THD formated data

aa59854

Signed-off-by: Lifu Zhang <[email protected]>


          Apply isort and black reformatting

ab02643

Signed-off-by: tomlifu <[email protected]>


          fixed eos_idx[0][0] out of bounds issue

2a8b21f

Signed-off-by: Lifu Zhang <[email protected]>


          Merge branch 'thd_cp_support' of https://github.com/tomlifu/NeMo into…

f930f77

… thd_cp_support


          Merge branch 'main' into thd_cp_support

d3e9354


          fixed CP=1 case

02bccd7

Signed-off-by: Lifu Zhang <[email protected]>


          fixed thd_get_partitioned_indices assertion issue when pp=1

11d68b4

Signed-off-by: Lifu Zhang <[email protected]>


          Apply isort and black reformatting

463a478

Signed-off-by: tomlifu <[email protected]>

tomlifu force-pushed the thd_cp_support branch from 6e8ac3b to 463a478 Compare

November 14, 2024 04:56

switiz commented Nov 20, 2024

Could you let me know when it will be completed? I’ve been really looking forward to this feature. It works in pretrain, but it’s really strange that it doesn’t work in SFT.

xrennvidia requested a review from cuichenx

November 21, 2024 20:33

cuichenx reviewed

View reviewed changes

scripts/nlp_language_modeling/prepare_packed_ft_dataset.py Outdated Show resolved Hide resolved

nemo/utils/sequence_packing_utils.py Show resolved Hide resolved

root and others added 2 commits

November 22, 2024 09:27


          fixed data packing issue

12de6bb

Signed-off-by: root <[email protected]>


          fixed an issue where cp>1 has different loss curves

cc236ba

Signed-off-by: Lifu Zhang <[email protected]>

xrennvidia reviewed

View reviewed changes

nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py Show resolved Hide resolved

tomlifu added 2 commits

November 26, 2024 14:54


          Merge branch 'main' into thd_cp_support

a988522

Signed-off-by: tomlifu <[email protected]>


          remove redudant check for cu_seqlens

29a8dea

Signed-off-by: Lifu Zhang <[email protected]>

xrennvidia added Run CICD and removed Run CICD labels

Contributor

github-actions bot commented Nov 27, 2024

beep boop 🤖: 🙏 The following files have warnings. In case you are familiar with these, please try helping us to improve the code base.

Your code was analyzed with PyLint. The following annotations have been identified:

************* Module nemo.collections.nlp.data.language_modeling.megatron.gpt_sft_dataset
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py:70:0: C0301: Line too long (353/119) (line-too-long)
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py:72:0: C0301: Line too long (173/119) (line-too-long)
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py:73:0: C0301: Line too long (156/119) (line-too-long)
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py:79:0: C0301: Line too long (157/119) (line-too-long)
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py:82:0: C0301: Line too long (147/119) (line-too-long)
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py:83:0: C0301: Line too long (178/119) (line-too-long)
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py:84:0: C0301: Line too long (138/119) (line-too-long)
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py:85:0: C0301: Line too long (121/119) (line-too-long)
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py:87:0: C0301: Line too long (144/119) (line-too-long)
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py:90:0: C0301: Line too long (247/119) (line-too-long)
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py:165:0: C0301: Line too long (125/119) (line-too-long)
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py:174:0: C0301: Line too long (121/119) (line-too-long)
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py:244:0: C0301: Line too long (137/119) (line-too-long)
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py:247:0: C0301: Line too long (133/119) (line-too-long)
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py:272:0: C0301: Line too long (146/119) (line-too-long)
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py:277:0: C0301: Line too long (153/119) (line-too-long)
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py:278:0: C0301: Line too long (155/119) (line-too-long)
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py:300:0: C0301: Line too long (127/119) (line-too-long)
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py:389:0: C0301: Line too long (120/119) (line-too-long)
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py:655:0: C0301: Line too long (120/119) (line-too-long)
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py:36:0: C0115: Missing class docstring (missing-class-docstring)
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py:526:0: C0115: Missing class docstring (missing-class-docstring)
************* Module nemo.collections.nlp.models.language_modeling.megatron_gpt_model
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:328:0: C0301: Line too long (149/119) (line-too-long)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:368:0: C0301: Line too long (136/119) (line-too-long)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:414:0: C0301: Line too long (126/119) (line-too-long)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:461:0: C0301: Line too long (122/119) (line-too-long)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:792:0: C0301: Line too long (131/119) (line-too-long)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1107:0: C0301: Line too long (146/119) (line-too-long)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1128:0: C0301: Line too long (168/119) (line-too-long)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1372:0: C0301: Line too long (122/119) (line-too-long)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1429:0: C0301: Line too long (140/119) (line-too-long)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1610:0: C0301: Line too long (132/119) (line-too-long)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1611:0: C0301: Line too long (136/119) (line-too-long)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1613:0: C0301: Line too long (159/119) (line-too-long)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1784:0: C0301: Line too long (128/119) (line-too-long)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1804:0: C0301: Line too long (140/119) (line-too-long)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1812:0: C0301: Line too long (155/119) (line-too-long)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1833:0: C0301: Line too long (141/119) (line-too-long)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1903:0: C0301: Line too long (125/119) (line-too-long)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1930:0: C0301: Line too long (134/119) (line-too-long)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:141:0: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:154:0: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:180:0: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:199:0: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:246:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:284:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:288:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:300:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:308:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:470:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:473:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:702:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:706:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:788:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1105:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1178:12: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1227:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1256:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1442:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1577:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1585:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1594:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1878:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:2031:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:2038:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:2044:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:20:0: W0611: Unused fields imported from dataclasses (unused-import)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:29:0: W0611: Unused _DataFetcherWrapper imported from lightning.pytorch.loops.fetchers (unused-import)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:31:0: W0611: Unused OmegaConf imported from omegaconf (unused-import)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:73:0: W0611: Unused activation_to_func imported from nemo.collections.nlp.parts.utils_funcs (unused-import)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:82:4: W0611: Unused megatron.core imported as core (unused-import)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:83:4: W0611: Unused tensor_parallel imported from megatron.core (unused-import)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:102:4: W0611: Unused init_method_normal imported from megatron.core.utils (unused-import)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:102:4: W0611: Unused scaled_init_method_normal imported from megatron.core.utils (unused-import)
************* Module nemo.utils.sequence_packing_utils
nemo/utils/sequence_packing_utils.py:53:0: C0301: Line too long (125/119) (line-too-long)
nemo/utils/sequence_packing_utils.py:112:0: C0301: Line too long (127/119) (line-too-long)
nemo/utils/sequence_packing_utils.py:121:0: C0301: Line too long (122/119) (line-too-long)
nemo/utils/sequence_packing_utils.py:122:0: C0301: Line too long (139/119) (line-too-long)
************* Module scripts.nlp_language_modeling.prepare_packed_ft_dataset
scripts/nlp_language_modeling/prepare_packed_ft_dataset.py:206:0: C0301: Line too long (157/119) (line-too-long)
scripts/nlp_language_modeling/prepare_packed_ft_dataset.py:169:0: C0115: Missing class docstring (missing-class-docstring)
scripts/nlp_language_modeling/prepare_packed_ft_dataset.py:175:4: C0116: Missing function or method docstring (missing-function-docstring)
scripts/nlp_language_modeling/prepare_packed_ft_dataset.py:188:0: C0116: Missing function or method docstring (missing-function-docstring)

-----------------------------------
Your code has been rated at 9.46/10

Thank you for improving NeMo's documentation!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels