A chronological overview of several representative Mixture-of-Experts (MoE) models in recent years. The timeline is primarily structured according to the release dates of the models. MoE models located above the arrow are open-source, while those below the arrow are proprietary and closed-source. MoE models from various domains are marked with distinct colors: Natural Language Processing (NLP) in green, Computer Vision in yellow, Multimodal in pink, and Recommender Systems (RecSys) in cyan.
Important
A curated collection of papers and resources on Mixture of Experts with Large Language Models.
Please refer to our survey "A Survey on Mixture of Experts" for the detailed contents.
Please let us know if you discover any mistakes or have suggestions by emailing us: [email protected]
If you find our survey beneficial for your research, please consider citing the following paper:
@article{cai2024survey,
title={A Survey on Mixture of Experts},
author={Cai, Weilin and Jiang, Juyong and Wang, Fan and Tang, Jing and Kim, Sunghun and Huang, Jiayi},
journal={arXiv preprint arXiv:2407.06204},
year={2024}
}
- Table of Contents
- Taxonomy
- Paper List (Organized Chronologically and Categorically)
- Contributors
- Star History
-
M4oE: A Foundation Model for Medical Multimodal Image Segmentation with Mixture of Experts, [MICCAI 2024], 2024-05-15
-
MoH: Multi-Head Attention as Mixture-of-Head Attention, [ArXiv 2024], 2024-10-15
-
Mixture of A Million Experts, [ArXiv 2024], 2024-7-4
-
Flextron: Many-in-One Flexible Large Language Model, [ICML 2024], 2024-6-11
-
Demystifying the Compression of Mixture-of-Experts Through a Unified Framework, [ArXiv 2024], 2024-6-4
-
Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models, [ArXiv 2024], 2024-6-3
-
Yuan 2.0-M32: Mixture of Experts with Attention Router, [ArXiv 2024], 2024-5-28
-
MoGU: A Framework for Enhancing Safety of Open-Sourced LLMs While Preserving Their Usability, [ArXiv 2024], 2024-5-23
-
Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models, [ArXiv 2024], 2024-5-23
-
Unchosen Experts Can Contribute Too: Unleashing MoE Models' Power by Self-Contrast, [ArXiv 2024], 2024-5-23
-
MeteoRA: Multiple-tasks Embedded LoRA for Large Language Models, [ArXiv 2024], 2024-5-19
-
Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts, [ArXiv 2024], 2024-5-18
-
Optimizing Distributed ML Communication with Fused Computation-Collective Operations, [ArXiv 2023], 2023-5-11
-
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model, [ArXiv 2024], 2024-5-7
-
Lory: Fully Differentiable Mixture-of-Experts for Autoregressive Language Model Pre-training, [ArXiv 2024], 2024-5-6
-
Lancet: Accelerating Mixture-of-Experts Training via Whole Graph Computation-Communication Overlapping, [ArXiv 2024], 2024-4-30
-
M3oE: Multi-Domain Multi-Task Mixture-of Experts Recommendation Framework, [SIGIR 2024], 2024-4-29
-
Multi-Head Mixture-of-Experts, [ArXiv 2024], 2024-4-23
-
ScheMoE: An Extensible Mixture-of-Experts Distributed Training System with Tasks Scheduling, [EuroSys 2024], 2024-4-22
-
MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts, [ArXiv 2024], 2024-4-22
-
Intuition-aware Mixture-of-Rank-1-Experts for Parameter Efficient Finetuning, [ArXiv 2024], 2024-4-13
-
JetMoE: Reaching Llama2 Performance with 0.1M Dollars, [ArXiv 2024], 2024-4-11
-
Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models, [ArXiv 2024], 2024-4-8
-
Shortcut-connected Expert Parallelism for Accelerating Mixture-of-Experts, [ArXiv 2024], 2024-4-7
-
Mixture-of-Depths: Dynamically allocating compute in transformer-based language models [ArXiv 2024], 2024-4-2
-
MTLoRA: A Low-Rank Adaptation Approach for Efficient Multi-Task Learning [ArXiv 2024], 2024-3-29
-
Jamba: A Hybrid Transformer-Mamba Language Model, [ArXiv 2024], 2024-3-28
-
Scattered Mixture-of-Experts Implementation, [ArXiv 2024], 2024-3-13
-
Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM [ArXiv 2024], 2024-3-12
-
HyperMoE: Towards Better Mixture of Experts via Transferring Among Experts, [ACL 2024], 2024-2-20
-
Higher Layers Need More LoRA Experts [ArXiv 2024], 2024-2-13
-
FuseMoE: Mixture-of-Experts Transformers for Fleximodal Fusion, [ArXiv 2024], 2024-2-5
-
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models [ArXiv 2024], 2024-1-29
-
OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models [ArXiv 2024], 2024-1-29
-
LLaVA-MoLE: Sparse Mixture of LoRA Experts for Mitigating Data Conflicts in Instruction Finetuning MLLMs [ArXiv 2024], 2024-1-29
-
Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference [ArXiv 2024], 2024-1-16
-
MOLE: MIXTURE OF LORA EXPERTS [ICLR 2024], 2024-1-16
-
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models [ArXiv 2024], 2024-1-11
-
LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training, [Github 2023], 2023-12
-
Mixture of Cluster-conditional LoRA Experts for Vision-language Instruction Tuning [ArXiv 2023], 2023-12-19
-
LoRAMoE: Alleviate World Knowledge Forgetting in Large Language Models via MoE-Style Plugin [ArXiv 2023], 2023-12-15
-
Mixtral of Experts, [ArXiv 2024], 2023-12-11
-
Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts [ArXiv 2023], 2023-12-1
-
HOMOE: A Memory-Based and Composition-Aware Framework for Zero-Shot Learning with Hopfield Network and Soft Mixture of Experts, [ArXiv 2023], 2023-11-23
-
Sira: Sparse mixture of low rank adaptation, [ArXiv 2023], 2023-11-15
-
When MOE Meets LLMs: Parameter Efficient Fine-tuning for Multi-task Medical Applications, [SIGIR 2024], 2023-10-21
-
Unlocking Emergent Modularity in Large Language Models, [NAACL 2024], 2023-10-17
-
Merging Experts into One: Improving Computational Efficiency of Mixture of Experts, [EMNLP 2023], 2023-10-15
-
Sparse Universal Transformer, [EMNLP 2023], 2023-10-11
-
SMoP: Towards Efficient and Effective Prompt Tuning with Sparse Mixture-of-Prompts, [EMNLP 2023], 2023-10-8
-
FUSING MODELS WITH COMPLEMENTARY EXPERTISE [ICLR 2024], 2023-10-2
-
Pushing Mixture of Experts to the Limit: Extremely Parameter Efficient MoE for Instruction Tuning [ICLR 2024], 2023-9-11
-
EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models [ArXiv 2023], 2023-8-28
-
Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference [ArXiv 2023], 2023-8-23
-
Robust Mixture-of-Expert Training for Convolutional Neural Networks, [ICCV 2023], 2023-8-19
-
Experts Weights Averaging: A New General Training Scheme for Vision Transformers, [ArXiv 2023], 2023-8-11
-
From Sparse to Soft Mixtures of Experts ICLR 2024, 2023-8-2
-
SmartMoE: Efficiently Training Sparsely-Activated Models through Combining Offline and Online Parallelization [USENIX ATC 2023], 2023-7-10
-
Mixture-of-Domain-Adapters: Decoupling and Injecting Domain Knowledge to Pre-trained Language Models’ Memories [ACL 2023], 2023-6-8
-
Moduleformer: Learning modular large language models from uncurated data, [ArXiv 2023], 2023-6-7
-
Patch-level Routing in Mixture-of-Experts is Provably Sample-efficient for Convolutional Neural Networks, [ICML 2023], 2023-6-7
-
Soft Merging of Experts with Adaptive Routing, [TMLR 2024], 2023-6-6
-
Brainformers: Trading Simplicity for Efficiency, [ICML 2023], 2023-5-29
-
Emergent Modularity in Pre-trained Transformers, [ACL 2023], 2023-5-28
-
PaCE: Unified Multi-modal Dialogue Pre-training with Progressive and Compositional Experts, [ACL 2023], 2023-5-24
-
Mixture-of-Experts Meets Instruction Tuning: A Winning Combination for Large Language Models [ICLR 2024], 2023-5-24
-
PipeMoE: Accelerating Mixture-of-Experts through Adaptive Pipelining, [INFOCOM 2023], 2023-5-17
-
MPipeMoE: Memory Efficient MoE for Pre-trained Models with Adaptive Pipeline Parallelism [IPDPS 2023], 2023-5-15
-
FlexMoE: Scaling Large-scale Sparse Pre-trained Model Training via Dynamic Device Placement [Proc. ACM Manag. Data 2023], 2023-4-8
-
PANGU-Σ: TOWARDS TRILLION PARAMETER LANGUAGE MODEL WITH SPARSE HETEROGENEOUS COMPUTING [ArXiv 2023], 2023-3-20
-
Scaling Vision-Language Models with Sparse Mixture of Experts EMNLP (Findings) 2023, 2023-3-13
-
A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training [ICS 2023], 2023-3-11
-
SPARSE MOE AS THE NEW DROPOUT: SCALING DENSE AND SELF-SLIMMABLE TRANSFORMERS [ICLR 2023], 2023-3-2
-
TA-MoE: Topology-Aware Large Scale Mixture-of-Expert Training [NIPS 2022], 2023-2-20
-
PIT: Optimization of Dynamic Sparse Deep Learning Models via Permutation Invariant Transformation, [SOSP 2023], 2023-1-26
-
Mod-Squad: Designing Mixture of Experts As Modular Multi-Task Learners, [CVPR 2023], 2022-12-15
-
Hetu: a highly efficient automatic parallel distributed deep learning system [Sci. China Inf. Sci. 2023], 2022-12
-
MEGABLOCKS: EFFICIENT SPARSE TRAINING WITH MIXTURE-OF-EXPERTS [MLSys 2023], 2022-11-29
-
PAD-Net: An Efficient Framework for Dynamic Networks, [ACL 2023], 2022-11-10
-
Mixture of Attention Heads: Selecting Attention Heads Per Token, [EMNLP 2022], 2022-10-11
-
Sparsity-Constrained Optimal Transport, [ICLR 2023], 2022-9-30
-
A Review of Sparse Expert Models in Deep Learning, [ArXiv 2022], 2022-9-4
-
A Theoretical View on Sparsely Activated Networks [NIPS 2022], 2022-8-8
-
Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models [NIPS 2022], 2022-8-5
-
Towards Understanding Mixture of Experts in Deep Learning, [ArXiv 2022], 2022-8-4
-
No Language Left Behind: Scaling Human-Centered Machine Translation [ArXiv 2022], 2022-7-11
-
Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs [NIPS 2022], 2022-6-9
-
TUTEL: ADAPTIVE MIXTURE-OF-EXPERTS AT SCALE [MLSys 2023], 2022-6-7
-
Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts [NIPS 2022], 2022-6-6
-
Task-Specific Expert Pruning for Sparse Mixture-of-Experts [ArXiv 2022], 2022-6-1
-
Eliciting and Understanding Cross-Task Skills with Task-Level Mixture-of-Experts, [EMNLP 2022], 2022-5-25
-
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning, [EMNLP 2022], 2022-5-24
-
SE-MoE: A Scalable and Efficient Mixture-of-Experts Distributed Training and Inference System, [ArXiv 2022], 2022-5-20
-
On the Representation Collapse of Sparse Mixture of Experts [NIPS 2022], 2022-4-20
-
Residual Mixture of Experts, [ArXiv 2022], 2022-4-20
-
STABLEMOE: Stable Routing Strategy for Mixture of Experts [ACL 2022], 2022-4-18
-
MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation, [NAACL 2022], 2022-4-15
-
BaGuaLu: Targeting Brain Scale Pretrained Models with over 37 Million Cores [PPoPP 2022], 2022-3-28
-
FasterMoE: Modeling and Optimizing Training of Large-Scale Dynamic Pre-Trained Models [PPoPP 2022], 2022-3-28
-
HetuMoE: An Efficient Trillion-scale Mixture-of-Expert Distributed Training System [ArXiv 2022], 2022-3-28
-
Parameter-Efficient Mixture-of-Experts Architecture for Pre-trained Language Models [COLING 2022], 2022-3-2
-
Mixture-of-Experts with Expert Choice Routing [NIPS 2022], 2022-2-18
-
ST-MOE: DESIGNING STABLE AND TRANSFERABLE SPARSE EXPERT MODELS [ArXiv 2022], 2022-2-17
-
UNIFIED SCALING LAWS FOR ROUTED LANGUAGE MODELS [ICML 2022], 2022-2-2
-
One Student Knows All Experts Know: From Sparse to Dense, [ArXiv 2022], 2022-1-26
-
DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale [ICML 2022], 2022-1-14
-
EvoMoE: An Evolutional Mixture-of-Experts Training Framework via Dense-To-Sparse Gate [ArXiv 2021], 2021-12-29
-
Efficient Large Scale Language Modeling with Mixtures of Experts [EMNLP 2022], 2021-12-20
-
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts [ICML 2022], 2021-12-13
-
Dselect-k: Differentiable selection in the mixture of experts with applications to multi-task learning, [NIPS 2021], 2021.12.6
-
Tricks for Training Sparse Translation Models, [NAACL 2022], 2021-10-15
-
Taming Sparsely Activated Transformer with Stochastic Experts, [ICLR 2022], 2021-10-8
-
MoEfication: Transformer Feed-forward Layers are Mixtures of Experts, [ACL 2022], 2021-10-5
-
Beyond distillation: Task-level mixture-of-experts for efficient inference, [EMNLP 2021], 2021-9-24
-
Scalable and Efficient MoE Training for Multitask Multilingual Models, [ArXiv 2021], 2021-9-22
-
DEMix Layers: Disentangling Domains for Modular Language Modeling, [NAACL 2022], 2021-8-11
-
Go Wider Instead of Deeper [AAAI 2022], 2021-7-25
-
Scaling Vision with Sparse Mixture of Experts [NIPS 2021], 2021-6-10
-
Hash Layers For Large Sparse Models [NIPS 2021], 2021-6-8
-
M6-t: Exploring sparse expert models and beyond, [ArXiv 2021], 2021-5-31
-
BASE Layers: Simplifying Training of Large, Sparse Models [ICML 2021], 2021-5-30
-
FASTMOE: A FAST MIXTURE-OF-EXPERT TRAINING SYSTEM [ArXiv 2021], 2021-5-21
-
CPM-2: Large-scale Cost-effective Pre-trained Language Models, [AI Open 2021], 2021-1-20
-
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity [ArXiv 2022], 2021-1-11
-
Beyond English-Centric Multilingual Machine Translation, [JMLR 2021], 2020-10-21
-
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding [ICLR 2021], 2020-6-30
-
Modeling task relationships in multi-task learning with multi-gate mixture-of-experts, [KDD 2018], 2018-7-19
-
OUTRAGEOUSLY LARGE NEURAL NETWORKS: THE SPARSELY-GATED MIXTURE-OF-EXPERTS LAYER [ICLR 2017], 2017-1-23
This repository is actively maintained, and we welcome your contributions! If you have any questions about this list of resources, please feel free to contact me at [email protected]
.