ML Systems Onboarding Reading List

This is a reading list of papers/videos/repos I've personally found useful as I was ramping up on ML Systems and that I wish more people would just sit and study carefully during their work hours. If you're looking for more recommendations, go through the citations of the below papers and enjoy!

Attention Mechanism

Attention is all you need: Start here, Still one of the best intros
Online normalizer calculation for softmax: A must read before reading the flash attention. Will help you get the main "trick"
Self Attention does not need O(n^2) memory:
Flash Attention 2: The diagrams here do a better job of explaining flash attention 1 as well
Llama 2 paper: Skim it for the model details
gpt-fast: A great repo to come back to for minimal yet performant code
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation: There's tons of papers on long context lengths but I found this to be among the clearest
Google the different kinds of attention: cosine, dot product, cross, local, sparse, convolutional

Performance Optimizations

Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems: Wonderful survey, start here
Efficiently Scaling transformer inference: Introduced many ideas most notably KV caches
Making Deep Learning go Brrr from First Principles: One of the best intros to fusions and overhead
Fast Inference from Transformers via Speculative Decoding: This is the paper that helped me grok the difference in performance characteristics between prefill and autoregressive decoding
Group Query Attention: KV caches can be chunky this is how you fix it
Orca: A Distributed Serving System for Transformer-Based Generative Models: introduced continuous batching (great pre-read for the PagedAttention paper).
Efficient Memory Management for Large Language Model Serving with PagedAttention: the most crucial optimization for high throughput batch inference
Colfax Research Blog: Excellent blog if you're interested in learning more about CUTLASS and modern GPU programming
Sarathi LLM: Introduces chunked prefill to make workloads more balanced between prefill and decode
Epilogue Visitor Tree: Fuse custom epilogues by adding more epilogues to the same class (visitor design pattern) and represent the whole epilogue as a tree

Quantization

A White Paper on Neural Network Quantization: Start here this is will give you the foundation to quickly skim all the other papers
LLM.int8: All of Dettmers papers are great but this is a natural intro
FP8 formats for deep learning: For a first hand look of how new number formats come about
Smoothquant: Balancing rounding errors between weights and activations
Mixed precision training: The OG paper describing mixed precision training strategies for half

Long context length

RoFormer: Enhanced Transformer with Rotary Position Embedding: The paper that introduced rotary positional embeddings
YaRN: Efficient Context Window Extension of Large Language Models: Extend base model context lengths with finetuning
Ring Attention with Blockwise Transformers for Near-Infinite Context: Scale to infinite context lengths as long as you can stack more GPUs

Sparsity

Venom: Vectorized N:M Format for sparse tensor cores when hardware only supports 2:4
Megablocks: Efficient Sparse training with mixture of experts
ReLu Strikes Back: Really enjoyed this paper as an example of doing model surgery for more efficient inference

Distributed

Singularity: Shows how to make jobs preemptible, migratable and elastic
Local SGD: So hot right now
OpenDiloco: Asynchronous training for decentralized training
torchtitan: Minimal repository showing how to implement 4D parallelism in pure PyTorch
pipedream: The pipeline parallel paper
jit checkpointing: a very clever alternative to periodic checkpointing
Reducing Activation Recomputation in Large Transformer models: THe paper thatt introduced selective activation checkpointing and goes over activation recomputation strategies
Breaking the computation and communication abstraction barrier: God tier paper that goes over research at the intersection of distributed computing and compilers to maximize comms overlap
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models: The ZeRO algorithm behind FSDP and DeepSpeed intelligently reducing memory usage for data parallelism.
Megatron-LM: For an introduction to Tensor Parallelism

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

ML Systems Onboarding Reading List

Attention Mechanism

Performance Optimizations

Quantization

Long context length

Sparsity

Distributed

Files

README.md

Latest commit

History

README.md

File metadata and controls

ML Systems Onboarding Reading List

Attention Mechanism

Performance Optimizations

Quantization

Long context length

Sparsity

Distributed