Skip to content

Commit

Permalink
Merge branch 'main' into huiyingl/nemo2_peftmerge
Browse files Browse the repository at this point in the history
Signed-off-by: HuiyingLi <[email protected]>
  • Loading branch information
HuiyingLi committed Oct 22, 2024
2 parents c8b3a40 + c20e892 commit 31e0c0a
Show file tree
Hide file tree
Showing 131 changed files with 8,912 additions and 2,718 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/cherry-pick-release-commit.yml
Original file line number Diff line number Diff line change
Expand Up @@ -120,7 +120,7 @@ jobs:
"type": "section",
"text": {
"type": "mrkdwn",
"text": ":alert: Cherrypick bot 🤖: Hey <@'$USERNAME'>: Cherry-pick of <'$URL'|#'$PR_ID'> failed (3-way merge impossible). Please resolve manually and create a PR.\n\ncc: <!subteam^{{ secrets.SLACK_WEBHOOK_ADMIN }}>"
"text": ":alert: Cherrypick bot 🤖: Hey <@'$USERNAME'>: Cherry-pick of <'$URL'|#'$PR_ID'> failed (3-way merge impossible). Please resolve manually and create a PR.\n\ncc: <!subteam^${{ secrets.SLACK_WEBHOOK_ADMIN }}>"
}
}
]
Expand Down
1,332 changes: 87 additions & 1,245 deletions .github/workflows/cicd-main.yml

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion Dockerfile.ci
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ RUN pip install nemo_run@git+https://github.com/NVIDIA/NeMo-Run.git@${NEMO_RUN_T
# Install NeMo requirements
ARG TE_TAG=7d576ed25266a17a7b651f2c12e8498f67e0baea
ARG MODELOPT_VERSION=0.17.0
ARG MCORE_TAG=0d89fc4c0d4394f915fffff11212d6957652337f
ARG MCORE_TAG=563d5d1726012e8077895b732d5bc81b6e975e8d

ARG APEX_TAG=810ffae374a2b9cb4b5c5e28eaeca7d7998fca0c
RUN \
Expand Down
134 changes: 134 additions & 0 deletions docs/source/performance/performance_long_sequence.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
# Long Sequence Performance

## LLAMA2-7B (FP8)

- The table below shows the pre-training performance of the LLAMA2-7B with CP (context parallelism) and compares it against the results without CP at various input sequence lengths. The detailed model-parallel configurations and the achieved performance are shown in the training results with CP. In non-CP training runs, we use the most performant model- and data-parallel configurations without CP given the memory capacity constraint of the H100 GPU system.

- Container: [NeMo24.03.01.framework](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo/tags)
- System: DGX-H100


<table>
<thead>
<tr>
<th rowspan="2" class="top-border">SeqLen (K)</th>
<th rowspan="2" class="top-border"># of GPUs</th>
<th rowspan="1" class="top-border">Without CP</th>
<th colspan="5" class="top-border">With CP</th>
<th rowspan="2" class="top-border">Speedup with CP/without CP</th>
</tr>
<tr>
<th>TFLOPS / GPU</th>
<th>TP</th>
<th>PP</th>
<th>DP</th>
<th>CP</th>
<th>TFLOPS / GPU</th>
</tr>
</thead>
<tbody>
<tr>
<td>4</td>
<td>4</td>
<td>768</td>
<td>1</td>
<td>1</td>
<td>4</td>
<td>1</td>
<td>768</td>
<td class="speedup">1.00</td>
</tr>
<tr>
<td>8</td>
<td>8</td>
<td>730</td>
<td>1</td>
<td>2</td>
<td>4</td>
<td>1</td>
<td>730</td>
<td class="speedup">1.00</td>
</tr>
<tr>
<td>16</td>
<td>16</td>
<td>660</td>
<td>2</td>
<td>1</td>
<td>8</td>
<td>1</td>
<td>660</td>
<td class="speedup">1.00</td>
</tr>
<tr>
<td>32</td>
<td>32</td>
<td>595</td>
<td>2</td>
<td>1</td>
<td>8</td>
<td>2</td>
<td>610</td>
<td class="speedup">1.03</td>
</tr>
<tr>
<td>64</td>
<td>64</td>
<td>534</td>
<td>4</td>
<td>1</td>
<td>8</td>
<td>2</td>
<td>574</td>
<td class="speedup">1.07</td>
</tr>
<tr>
<td>128</td>
<td>128</td>
<td>424</td>
<td>4</td>
<td>1</td>
<td>8</td>
<td>4</td>
<td>555</td>
<td class="speedup">1.31</td>
</tr>
<tr>
<td>256</td>
<td>256</td>
<td>392</td>
<td>4</td>
<td>1</td>
<td>8</td>
<td>8</td>
<td>549</td>
<td class="speedup">1.40</td>
</tr>
<tr>
<td>512</td>
<td>512</td>
<td>104</td>
<td>8</td>
<td>1</td>
<td>4</td>
<td>16</td>
<td>549</td>
<td class="speedup">5.28</td>
</tr>
<tr>
<td>1024</td>
<td>1024</td>
<td>26.5</td>
<td>8</td>
<td>1</td>
<td>4</td>
<td>32</td>
<td>536</td>
<td class="speedup">20.23</td>
</tr>
</tbody>
</table>


### Speedup of LLAMA2 7B training with CP over without CP
![cp_speedup_figure](https://github.com/NVIDIA/NeMo/releases/download/r2.0.0rc1/tutorial_cp_speedup_figure.png)
4 changes: 2 additions & 2 deletions examples/audio/process_audio.py
Original file line number Diff line number Diff line change
Expand Up @@ -159,8 +159,8 @@ def main(cfg: ProcessConfig) -> ProcessConfig:
audio_to_audio_model.set_trainer(trainer)
audio_to_audio_model = audio_to_audio_model.eval()

# override sampler
if cfg.sampler is not None:
# override sampler if necessary
if cfg.sampler:
logging.info('Overriding sampler with %s', cfg.sampler)

if hasattr(audio_to_audio_model, 'sampler'):
Expand Down
10 changes: 5 additions & 5 deletions examples/llm/pretrain/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
### Listing the available recipes for pretraining

```bash
nemorun llm pretrain --help
nemo llm pretrain --help
```

![recipe-listing](https://github.com/NVIDIA/NeMo/releases/download/v2.0.0rc0/list-recipes.png)
Expand All @@ -12,15 +12,15 @@ nemorun llm pretrain --help
### Run pre-training with a default recipe

```bash
nemorun llm pretrain --factory llama3_8b
nemo llm pretrain --factory llama3_8b
```

![llama3_70b](https://github.com/NVIDIA/NeMo/releases/download/v2.0.0rc0/llama3_70b.png)

We can also call the factory function with custom parameters:

```bash
nemorun llm pretrain --factory "llama3_70b(num_nodes=128)"
nemo llm pretrain --factory "llama3_70b(num_nodes=128)"
```

![llama3_70b-128-nodes](https://github.com/NVIDIA/NeMo/releases/download/v2.0.0rc0/llama3_70b_128nodes.png)
Expand All @@ -29,13 +29,13 @@ nemorun llm pretrain --factory "llama3_70b(num_nodes=128)"
The CLI allows you to overwrite any parameter. For example, to run the recipe with 2000 steps:

```bash
nemorun llm pretrain --factory llama3_70b trainer.max_steps=2000
nemo llm pretrain --factory llama3_70b trainer.max_steps=2000
```

The syntax of the CLI is the same as the Python code. Which is great but in some cases you might want to inspect & edit a recipe interactively. An easy way to do this using the cli is the use the `--repl` flag.

```bash
nemorun llm pretrain --factory llama3_70b --repl
nemo llm pretrain --factory llama3_70b --repl
```

![repl](https://github.com/NVIDIA/NeMo/releases/download/v2.0.0rc0/repl.gif)
Expand Down
2 changes: 2 additions & 0 deletions examples/nlp/duplex_text_normalization/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
> [!IMPORTANT]
> This section is no longer supported in NeMo and is scheduled for removal in the 23.11 release.
10 changes: 6 additions & 4 deletions examples/nlp/language_modeling/megatron_ckpt_to_nemo.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,9 @@
--checkpoint_name <checkpoint_name> \
--nemo_file_path <path_to_output_nemo_file> \
--tensor_model_parallel_size <tensor_model_parallel_size> \
--pipeline_model_parallel_size <pipeline_model_parallel_size>
--pipeline_model_parallel_size <pipeline_model_parallel_size> \
--gpus_per_node <gpus_per_node> \
--model_type <model_type>
"""

import dis
Expand Down Expand Up @@ -100,7 +102,7 @@ def get_args():
default="gpt",
choices=["gpt", "sft", "t5", "bert", "nmt", "bart", "retro"],
)
parser.add_argument("--local_rank", type=int, required=False, default=os.getenv('LOCAL_RANK', -1))
parser.add_argument("--local-rank", type=int, required=False, default=os.getenv('LOCAL_RANK', -1))
parser.add_argument("--bcp", action="store_true", help="Whether on BCP platform")
parser.add_argument(
"--precision",
Expand Down Expand Up @@ -134,15 +136,15 @@ def convert(local_rank, rank, world_size, args):
'accelerator': 'gpu',
'precision': args.precision,
},
'model': {'native_amp_init_scale': 2 ** 32, 'native_amp_growth_interval': 1000, 'hysteresis': 2},
'model': {'native_amp_init_scale': 2**32, 'native_amp_growth_interval': 1000, 'hysteresis': 2},
}
cfg = OmegaConf.create(cfg)

scaler = None
# If FP16 create a GradScaler as the build_model_parallel_config of MegatronBaseModel expects it
if cfg.trainer.precision == '16-mixed':
scaler = GradScaler(
init_scale=cfg.model.get('native_amp_init_scale', 2 ** 32),
init_scale=cfg.model.get('native_amp_init_scale', 2**32),
growth_interval=cfg.model.get('native_amp_growth_interval', 1000),
hysteresis=cfg.model.get('hysteresis', 2),
)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -126,6 +126,13 @@ model:
tunable_base_param_names: ["self_attention", "word_embeddings"] # TODO: regex support @adithyre

data:
chat: False # whether use chatbot data or not
chat_prompt_tokens: # special tokens for the chat prompts, a dictionary of {token_type: token}. note that some tokenizer may combine the characters at the junction between {end_of_turn}{turn_start}. e.g. '<im end><im start>', the '><' sometimes is merged to be a single token. This is not supported, try to avoid
system_turn_start: "\x00"
turn_start: "\x11"
label_start: "\x12"
end_of_turn: "\x0A" # \0x0A is '\n'
end_of_name: "\x0A" # \0x0A is '\n'
train_ds:
# Example of how to specify paths to multiple datasets
# file_names:
Expand Down
2 changes: 2 additions & 0 deletions examples/nlp/token_classification/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
> [!IMPORTANT]
> This section is no longer supported in NeMo and is scheduled for removal in the 23.11 release.
Original file line number Diff line number Diff line change
Expand Up @@ -293,6 +293,13 @@ def __call__(
device: torch.device,
partial_hypotheses: Optional[List[rnnt_utils.Hypothesis]] = None,
):
if x.device.type != "cuda":
# If CUDA graphs are enabled and "frame-looping" algorithm is requested, current class
# is not suitable to handle non-CUDA inputs; thus we are passing them to original caller
return self.caller._greedy_decode_blank_as_pad_loop_frames(
x=x, out_len=out_len, device=device, partial_hypotheses=partial_hypotheses
)

if partial_hypotheses is not None:
raise NotImplementedError(
"`partial_hypotheses` support is not available "
Expand Down
2 changes: 1 addition & 1 deletion nemo/collections/asr/parts/utils/transcribe_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -314,7 +314,7 @@ def prepare_audio_data(cfg: DictConfig) -> Tuple[List[str], bool]:
with NamedTemporaryFile(mode='w', suffix='.json', delete=False) as f:
for item in read_and_maybe_sort_manifest(cfg.dataset_manifest, try_sort=cfg.presort_manifest):
audio_file = get_full_path(audio_file=item[audio_key], manifest_file=cfg.dataset_manifest)
item[audio_key] = audio_file
item['audio_filepath'] = audio_file
filepaths.append(audio_file)
f.write(json.dumps(item) + "\n")
sorted_manifest_path = f.name
Expand Down
11 changes: 10 additions & 1 deletion nemo/collections/common/tokenizers/huggingface/auto_tokenizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
# limitations under the License.

from collections import OrderedDict
from typing import Optional
from typing import List, Optional

from transformers import AutoTokenizer as AUTOTOKENIZER

Expand Down Expand Up @@ -43,6 +43,7 @@ def __init__(
sep_token: Optional[str] = None,
cls_token: Optional[str] = None,
unk_token: Optional[str] = None,
additional_special_tokens: Optional[List] = [],
use_fast: Optional[bool] = False,
trust_remote_code: Optional[bool] = False,
):
Expand All @@ -60,6 +61,7 @@ def __init__(
sep_token: token used for separating sequences
cls_token: class token. Usually equal to bos_token
unk_token: token to use for unknown tokens
additional_special_tokens: list of other tokens beside standard special tokens (bos, eos, pad, etc.). For example, sentinel tokens for T5 (<extra_id_0>, <extra_id_1>, etc.)
use_fast: whether to use fast HuggingFace tokenizer
"""
try:
Expand Down Expand Up @@ -124,10 +126,17 @@ def __init__(
elif self.tokenizer.cls_token is None and self.tokenizer.bos_token:
special_tokens_dict["cls_token"] = self.tokenizer.bos_token

# add additional special tokens (not standard special tokens such as bos, eod, sep)
if additional_special_tokens is not None:
special_tokens_dict["additional_special_tokens"] = additional_special_tokens

new_tokens_in_vocab = []
for token in [mask_token, bos_token, eos_token, pad_token, sep_token, cls_token, unk_token]:
if token is not None and token not in self.tokenizer.get_vocab():
new_tokens_in_vocab.append(token)
for token in additional_special_tokens:
if token is not None and token not in self.tokenizer.get_vocab():
new_tokens_in_vocab.append(token)

if len(new_tokens_in_vocab) > 0:
"""
Expand Down
13 changes: 13 additions & 0 deletions nemo/collections/diffusion/encoders/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
Loading

0 comments on commit 31e0c0a

Please sign in to comment.