Skip to content

Commit

Permalink
Merge pull request #90 from epfLLM/mistral_docs
Browse files Browse the repository at this point in the history
added mistral docs by @AleHD
  • Loading branch information
martinjaggi authored Dec 3, 2023
2 parents 162a0d7 + 62de0b2 commit 1b06b12
Show file tree
Hide file tree
Showing 8 changed files with 25 additions and 14 deletions.
1 change: 1 addition & 0 deletions AUTHORS
Original file line number Diff line number Diff line change
Expand Up @@ -11,5 +11,6 @@ Kyle Matoba, Idiap Research Institute and EPFL
Amirkeivan Mohtashami, EPFL
Matteo Pagliardini, EPFL
Francesco Salvi,
Xingyao Wang


2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ If you use this software please cite it:
Francesco Salvi and
Antoine Bosselut and
Martin Jaggi},
title = {epfLLM Megatron-LM},
title = {epfLLM Megatron-LLM},
year = 2023,
url = {https://github.com/epfLLM/Megatron-LLM}
}
Expand Down
4 changes: 3 additions & 1 deletion docs/guide/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,14 +68,16 @@ In order to launch training on multiple nodes, you will set the appropriate argu

## What are the basic hardware requirements?

In this section we give a brief overview on the minimal hardware requirements we observed during our experiments.
A brief overview on the minimal training hardware requirements we observed during our experiments.

| Model | min VRAM | tp | pp |
| :--------- | :------: | :-: | :-: |
| LLaMa2-7B | 2x 80GB | 2 | 1 |
| Mistral-7B | 4x 80GB | 4 | 1 |
| Falcon-40B | 16x 80GB | 8 | 2 |
| LLaMa2-70B | 32x 80GB | 8 | 4 |

Note that you might observe different values depending on the sequence length, batch size and other configurations.

(shard)=
## How to shard and merge models?
Expand Down
7 changes: 4 additions & 3 deletions docs/guide/weights_conversion.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,13 @@ Convert weights from models in other formats (primarily huggingface) to megatron
This script supports converting Falcon, LLaMa and LLaMa 2 weights to megatron checkpoints.
Depending on the model to convert, the inputs might differ.

- **Falcon**:
- **Falcon**/**Mistral**:
Weights are automatically retrieved from the official implementation hosted in huggingface.
Thus, the `--cache-dir` argument is optional, if specified it should point to
the huggingface cache directory where the huggingface Falcon weights will be stored.
the huggingface cache directory where the huggingface Falcon/Mistral weights will be stored.
You will need to specify the `--size` argument to determine which version to download
(i.e. Falcon 7B or 40B).
Note that mistral only has 7B weights available.

- **LLaMa**, **LLaMa 2** and **CodeLlama**:
Converting llama weights can be done either fetching the weights hosted
Expand Down Expand Up @@ -44,7 +45,7 @@ More information about the arguments:

```
positional arguments:
{llama2,falcon,codellama,llama}
{llama2,falcon,codellama,llama,mistral}
options:
-h, --help show this help message and exit
Expand Down
9 changes: 5 additions & 4 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,14 +8,14 @@ Our repository is a modification of the `original Megatron-LM codebase <https://

Added key features include:

- `LLaMa <https://arxiv.org/abs/2302.13971>`_, `LLaMa 2 <https://arxiv.org/abs/2307.09288>`_, `Falcon <https://huggingface.co/tiiuae>`_, and `Code Llama <https://together.ai/blog/llama-2-7b-32k>`_ support.
- support training of large models (70B Llama 2, 65B Llama 1, 34B Code Llama, and 40B Falcon) on commodity hardware on multiple nodes
- architectures supported: `LLaMa <https://arxiv.org/abs/2302.13971>`_, `LLaMa 2 <https://arxiv.org/abs/2307.09288>`_, `Falcon <https://huggingface.co/tiiuae>`_, `Code Llama <https://together.ai/blog/llama-2-7b-32k>`_ and `Mistral https://arxiv.org/abs/2310.06825`_.
- support training of large models (70B Llama 2, 65B Llama 1, 34B Code Llama, 40B Falcon and Mistral) on commodity hardware on multiple nodes
- 3-way parallelism: tensor parallel, pipeline parallel and data parallel training (inherited from Megatron)
- full pretraining, finetuning and instruct tuning support
- Support for special tokens & tokenizers
- grouped-query attention (GQA) and multi-query attention (MQA)
- Rotary Position Embeddings (RoPE), RMS layer norm, Lima dropout
- `ROPE scaling <https://together.ai/blog/llama-2-7b-32k>`_ for longer attention context support
- `RoPE scaling <https://together.ai/blog/llama-2-7b-32k>`_ for longer attention context support
- FlashAttention 2
- BF16 / FP16 training
- WandB integration
Expand Down Expand Up @@ -61,6 +61,7 @@ If you use this software please cite it:
Andreas Köpf and
Kyle Matoba and
Amirkeivan Mohtashami and
Xingyao Wang and
Olivia Simin Fan and
Axel Marmet and
Deniz Bayazit and
Expand All @@ -69,7 +70,7 @@ If you use this software please cite it:
Francesco Salvi and
Antoine Bosselut and
Martin Jaggi},
title = {epfLLM Megatron-LM},
title = {epfLLM Megatron-LLM},
year = 2023,
url = {https://github.com/epfLLM/Megatron-LLM}
}
9 changes: 7 additions & 2 deletions examples/finetune.sh
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ HELP_STR="[--rank=$RANK] [--size=$SIZE] [--tp=$TP] [--pp=$PP] [--gpus=$GPUS_PER_

# define help function
help () {
echo "Usage: $0 <gpt/llama/llama2/codellama/falcon> $HELP_STR"
echo "Usage: $0 <gpt/llama/llama2/codellama/falcon/mistral> $HELP_STR"
}


Expand Down Expand Up @@ -159,8 +159,13 @@ elif [[ $MODEL = gpt ]]; then
if [[ $SEQ_LEN = none ]]; then
SEQ_LEN=2048
fi
elif [[ $MODEL = mistral ]]; then
TOKENIZER=SentencePieceTokenizer
if [[ $SEQ_LEN = none ]]; then
SEQ_LEN=8192
fi
else
echo "Model should be either gpt, llama or falcon, not $MODEL"
echo "Model should be either gpt, llama, llama2, codellama, mistral, or falcon, not $MODEL"
help
exit 1
fi
Expand Down
5 changes: 3 additions & 2 deletions weights_conversion/hf_to_megatron.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,13 @@
This script supports converting Falcon, LLaMa and LLaMa 2 weights to megatron checkpoints.
Depending on the model to convert, the inputs might differ.
- Falcon:
- Falcon/Mistral:
Weights are automatically retrieved from the official implementation hosted in huggingface.
Thus, the `--cache-dir` argument is optional, if specified it should point to
the huggingface cache directory where the huggingface Falcon weights will be stored.
the huggingface cache directory where the huggingface Falcon/Mistral weights will be stored.
You will need to specify the `--size` argument to determine which version to download
(i.e. Falcon 7B or 40B).
Note that mistral only has 7B weights available.
- LLaMa, LLaMa 2 and CodeLlama:
Converting llama weights can be done either fetching the weights hosted
in huggingface (recommended as it is the easier method) or directly from the
Expand Down
2 changes: 1 addition & 1 deletion weights_conversion/megatron_to_hf.py
Original file line number Diff line number Diff line change
Expand Up @@ -314,7 +314,7 @@ def write_mistral_model(
# update config
config.vocab_size = vocab_size

print("Loading the checkpoint in a Llama model...")
print("Loading the checkpoint in a Mistral model...")
model = MistralForCausalLM.from_pretrained(
tmp_model_path,
torch_dtype=torch_dtype
Expand Down

0 comments on commit 1b06b12

Please sign in to comment.