Skip to content
forked from swiss-ai/nanotron

Minimalistic large language model 3D-parallelism training

License

Notifications You must be signed in to change notification settings

eryaguo/nanotron

 
 

Repository files navigation

⚡️ Nanotron

GitHub release License

Pretraining models made easy

Nanotron is a library for pretraining transformer models. It provides a simple and flexible API to pretrain models on custom datasets. Nanotron is designed to be easy to use, fast, and scalable. It is built with the following principles in mind:

  • Simplicity: Nanotron is designed to be easy to use. It provides a simple and flexible API to pretrain models on custom datasets.
  • Performance: Optimized for speed and scalability, Nanotron uses the latest techniques to train models faster and more efficiently.

Installation

# Requirements: Python>=3.10
git clone https://github.com/huggingface/nanotron
cd nanotron
pip install --upgrade pip
pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu121
pip install -e .

# Install dependencies if you want to use the example scripts
pip install datasets transformers
pip install triton "flash-attn>=2.5.0" --no-build-isolation

Note

If you get undefined symbol: ncclCommRegister error you should install torch 2.1.2 instead: pip install torch==2.1.2 --index-url https://download.pytorch.org/whl/cu121

Tip

We log to wandb automatically if it's installed. For that you can use pip install wandb. If you don't want to use wandb, you can run wandb disabled.

Quick Start

Training a tiny Llama model

The following command will train a tiny Llama model on a single node with 8 GPUs. The model will be saved in the checkpoints directory as specified in the config file.

CUDA_DEVICE_MAX_CONNECTIONS=1 torchrun --nproc_per_node=8 run_train.py --config-file examples/config_tiny_llama.yaml

Run generation from your checkpoint

torchrun --nproc_per_node=1 run_generate.py --ckpt-path checkpoints/10/ --tp 1 --pp 1
# We could set a larger TP for faster generation, and a larger PP in case of very large models.

Custom examples

You can find more examples in the /examples directory:

Example Description
custom-dataloader Plug a custom dataloader to nanotron
datatrove Use the datatrove library to load data
doremi Use DoReMi to speed up training
mamba Train an example Mamba model
moe Train an example Mixture-of-Experts (MoE) model
mup Use spectral µTransfer to scale up your model

We're working on adding more examples soon! Feel free to add a PR to add your own example. 🚀

Features

We currently support the following features:

  • 3D parallelism (DP+TP+PP)
  • Expert parallelism for MoEs
  • AFAB and 1F1B schedules for PP
  • Explicit APIs for TP and PP which enables easy debugging
  • ZeRO-1 optimizer
  • FP32 gradient accumulation
  • Parameter tying/sharding
  • Custom module checkpointing for large models
  • Spectral µTransfer parametrization for scaling up neural networks
  • Mamba example

And we have on our roadmap:

  • FP8 training
  • ZeRO-3 optimizer (a.k.a FSDP)
  • torch.compile support
  • Ring attention
  • Interleaved 1f1b schedule

Credits

We would like to thank everyone working on LLMs, especially those sharing their work openly from which we took great inspiration: Nvidia for Megatron-LM/apex, Microsoft for DeepSpeed, HazyResearch for flash-attn..

About

Minimalistic large language model 3D-parallelism training

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%