Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document the training loop #186

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
88 changes: 88 additions & 0 deletions ABOUT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# About the InstructLab Training Library

<!-- update this whenever we update this document -->
*Date Written: August 27th, 2024*

Check failure on line 4 in ABOUT.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Emphasis used instead of a heading

ABOUT.md:4 MD036/no-emphasis-as-heading Emphasis used instead of a heading [Context: "Date Written: August 27th, 202..."] https://github.com/DavidAnson/markdownlint/blob/v0.34.0/doc/md036.md

We've written this document to help the reader understand the InstructLab training library
so they understand what the code is actually doing, and can make changes on their own as-needed.

This document was originally written on August 27th, 2024 and will most likely drift out of sync
over time. However; the core ideas & philosophy behind the decision choices made will likely
remain constant.

<!-- - very good to have a well documented codebase that we can talk about in the paper
- high level documentation that gives people a gist of what the code is doing
- highlight that the code doesnt have any abstraction at all
- want to see examples, something like a script or jupyter notebook -->


Check failure on line 18 in ABOUT.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Multiple consecutive blank lines

ABOUT.md:18 MD012/no-multiple-blanks Multiple consecutive blank lines [Expected: 1; Actual: 2] https://github.com/DavidAnson/markdownlint/blob/v0.34.0/doc/md012.md
## Structure of the Training Library

The training library is very simple and exposes one primary function in its public API `instructlab.training:run_training`

The `run_training` function serves as the primary gateway through which users are able to
run a multitude of different training scenarios with ease. Its argument definition is as follows:

```py
def run_training(torch_args: TorchrunArgs, train_args: TrainingArgs) -> None:
# implementation here
```

Through this simple API, one can go from training a quantized LoRA on a constrained system to a vast
multi-node distributed full fine-tune by changing only a few parameters.

The parmameters themselves are very straight-forward:
- `torch_args` contains the arguments which are passed when `torchrun` gets invoked on the training loop itself. This is where you would specify how many GPUs to utilize, what rendezvous endpoint each node should connect to, etc.

Check failure on line 35 in ABOUT.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Trailing spaces

ABOUT.md:35:230 MD009/no-trailing-spaces Trailing spaces [Expected: 0 or 2; Actual: 1] https://github.com/DavidAnson/markdownlint/blob/v0.34.0/doc/md009.md

Check failure on line 35 in ABOUT.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Lists should be surrounded by blank lines

ABOUT.md:35 MD032/blanks-around-lists Lists should be surrounded by blank lines [Context: "- `torch_args` contains the ar..."] https://github.com/DavidAnson/markdownlint/blob/v0.34.0/doc/md032.md
- `train_args` controls the parameters of the training script itself. This is where you would specify values such as the learning rate, number of epochs to run, whether to use LoRA, quantization, CPU-offloading, etc.


Check failure on line 38 in ABOUT.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Multiple consecutive blank lines

ABOUT.md:38 MD012/no-multiple-blanks Multiple consecutive blank lines [Expected: 1; Actual: 2] https://github.com/DavidAnson/markdownlint/blob/v0.34.0/doc/md012.md
We've documented these arguments elsewhere and so we won't cover them here so as to avoid misinforming the
reader with dated information. If you are interested, please see [README.md](./README.md) or the [config file](./src/instructlab/training/config.py) where they're implemented.


Check failure on line 42 in ABOUT.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Multiple consecutive blank lines

ABOUT.md:42 MD012/no-multiple-blanks Multiple consecutive blank lines [Expected: 1; Actual: 2] https://github.com/DavidAnson/markdownlint/blob/v0.34.0/doc/md012.md

Check failure on line 43 in ABOUT.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Multiple consecutive blank lines

ABOUT.md:43 MD012/no-multiple-blanks Multiple consecutive blank lines [Expected: 1; Actual: 3] https://github.com/DavidAnson/markdownlint/blob/v0.34.0/doc/md012.md
## How `run_training` works

The implementation of `run_training` itself is fairly simple: the [main_ds.py](./src/instructlab/training/main_ds.py) file is actually a script that one can invoke by directly calling `python` on it. It receives a number of flags and arguments which control its parameters. By default, the script is designed to run on only a single GPU.

When `run_training` is called, it uses the values supplied by `torch_args` to invoke the `main_ds.py` script
via the [`torchrun`](https://pytorch.org/docs/stable/elastic/run.html) CLI utility from PyTorch, whereas the `train_args` are passed as values to the training script itself. This causes the script to be invoked via PyTorch's elastic-run utility, where it invokes multiple copies of the `main_ds.py` file that each run on their own separate GPU.

Check failure on line 49 in ABOUT.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Trailing spaces

ABOUT.md:49:345 MD009/no-trailing-spaces Trailing spaces [Expected: 0 or 2; Actual: 1] https://github.com/DavidAnson/markdownlint/blob/v0.34.0/doc/md009.md

In a multi-node scenario, the rendezvous endpoint is also used to allow multiple nodes to network with the
primary runner node. The current implementation of training uses NCCL as the distributed backend.

### Overall structure of `run_training`

Here's a general view of what happens when `run_training` is invoked:

1. The training dataset provided through `train_args.data_path` is processed into a copy where all of the samples are tokenized and all samples where the number of tokens exceeds `train_args.max_seq_len` are dropped. The resultant dataset is written to an intermediate location, ideally into shared memory `/dev/shm` so that it's simpler for all of the subsequent processes to load.
1. The `train_args` are processed and formatted into their corresponding CLI args for `main_ds.py`
1. Main thread invokes the training script by calling `torchrun main_ds.py` with the `torch_args` and streams the output of the script
1. Within the training script, we perform setup such as initializing the distributed networking, loading datasets, tokenizers, etc.
1. With the dataset loaded, we analyze the state of our system (num GPUs, max batch length, etc.) and determine what sampler we can use (multipack vs. distributed)
1. We load the model with whatever parameters we decided, e.g. whether to quantize with BitsAndBytes, use flash-attention, load as a padding-free transformer, etc.

Check failure on line 63 in ABOUT.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Trailing spaces

ABOUT.md:63:164 MD009/no-trailing-spaces Trailing spaces [Expected: 0 or 2; Actual: 1] https://github.com/DavidAnson/markdownlint/blob/v0.34.0/doc/md009.md
1. Depending on whether we use a padding-free transformer or not, we define the appropriate loss function
* Padding-free requires us to adjust the tokens so that last token of one example doesn't predict the first token of the next example

Check failure on line 65 in ABOUT.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Unordered list style

ABOUT.md:65:1 MD004/ul-style Unordered list style [Expected: dash; Actual: asterisk] https://github.com/DavidAnson/markdownlint/blob/v0.34.0/doc/md004.md
* The loss for all of the models needs to be reduced for all of the devices for proper backpropagation
1. Noised embeddings are also added to the model
1. LoRA is applied and wrapped as a peft model if specified.
1. Enable gradient checkpointing
1. [DeepSpeed] Select the optimizer to be either `FusedAdam` or `DeepSpeedCPUAdam` depending on whether CPU offloading is being used or not
1. Create a learning-rate scheduler with our provided warmup steps, training states, and selected optimizer
1. Wrap the model with DeepSpeed using the configured optimizer, scheduler, and DeepSpeed options
1. If we can resume training from a given checkpoint then do so
1. Adjust the rate of saving samples depending on our effective batch size (EBS)
1. Begin the training loop
1. Iterate through each batch to compute the outputs and calculate loss
1. Broadcast the loss across all of the nodes running the training
1. Resize the loss by dividing it by the number of non-padding tokens and scaling it by the world size
1. Each node then backpropagates the loss throughout the model
1. Set a distributed barrier so all nodes finish training before continuing




###