Changelog

NVIDIA Megatron Core 0.8.0

Multimodal
- Added initial support for training vision language models using the LLaVA architecture
- Added initial support for inference with multimodal inputs
- End-to-end multimodal example from data collection to training to evaluation is provided in examples/multimodal
MoE
- Context Parallel support.
- Distributed checkpoint support for grouped GEMM.
Mamba

NVIDIA Megatron Core 0.7.0

MoE
- Token drop support
- Several efficiency optimizations
- Improved model parallelism
- Memory optimizations
Distributed checkpointing
- Enabled for Retro
- Asynchronous checkpoint saving
Several minor bug fixes, speed improvements, and memory optimizations

NVIDIA Megatron Core 0.6.0

MoE (Mixture of Experts)
- Performance optimization
  - Communication optimization for multi GPU and Single GPU
  - 23% improvement (323 TFLOPS/GPU) over MCore 0.5.0 on Mixtral with Hopper BF16
  - GroupedMLP enhancement for Hopper
  - DP Overlapping. Support overlapping computation with gradient reduction and parameter gathering.
- All-to-All based Token Dispatcher
- Layer-wise logging for load balancing loss.
- Improved expert parallel support including distributed optimizer.
Distributed optimizer
RETRO
- Data processing
BERT
- Distributed checkpointing
Dist checkpointing
- PyTorch native distributed backend
- Improved saving/loading speed
TensorRT-LLM Export
- Integration with TensorRT Model Optimizer Post-training quantization (PTQ)
- Text generation driver to perform PTQ in Megatron-LM
- Llama2 and Nemotron3-8b examples to use TensorRT-LLM unified build API to build engine after training.
Several minor enhancements, bug fixes, and documentation updates

NVIDIA Megatron Core 0.5.0

Key Features and Enhancements

Megatron core documentation is now live!

Model Features

MoE (Mixture of Experts)
- Support for Z-loss, Load balancing and Sinkhorn
- Layer and communications refactor
- Richer parallelism mappings and EP can be combined with other model parallel techniques for larger MoE variants, e.g. EP + TP + DP + SP + PP
- Token dropless architecture with Top-K routing
- Performance optimization with with GroupedGEMM when number of local experts is > 1
- Distributed checkpointing
Interleaved rotary embedding

Datasets

Masked WordPiece datasets for BERT and T5
Raw and mock datasets

Parallelism

Performance

Activation offloading to CPU
Rope and Swiglu fusion
Sliding window attention (via Transformer Engine)

General Improvements

Timers

NVIDIA Megatron Core 0.4.0

Key Features and Enhancements

Models

BERT
RETRO
T5

Parallelism

Mixture of Experts support for GPT
Model parallel efficient Distributed Data Parallel (DDP)
Context Parallel (2D Tensor Parallel) support

Datasets

GPT Dataset
Blended Dataset