- Introducing NVIDIA DGX A100: the Universal AI System for Enterprise S21702
- Under the Hood of the new DGX A100 System Architecture S21884
- NVIDIA Selene: Leadership-Class AI Supercomputing Infrastructure S31844
- Scheduling, Resource-Managing, and Monitoring Selene, a Supercomputer for Large-Scale DL and HPC. S31700
- Accelerating AI at-scale with Selene DGXA100 SuperPOD and Parallel Filesystem * Storage S31522
- Advanced containerized workloads in HPC environment: the Selene example S31704
HotChips
-
Hot Chips Tutorial - Scale Out Training Experiences – Megatron Language Model YouTube
-
Part I: Scale Out Systems
- DGX A100 SuperPOD, Michael Houston, NVIDIA
- Google TPU Pod, Sameer Kumar and Dehao Chen, Google
- Cerebras System, Natalia Vassilieva, Cerebras
-
Part II: Scale Out Training Experiences
- Megatron Language Model, Mohammad Shoeybi, NVIDIA
- Distributed Parameter Server for Massive Recommender System; Weijie Zhao, Baidu
- GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding; Zhifeng Chen, Google
-
-
Hot Chips Session - NVIDIA’s A100 GPU: Performance and Innovation for GPU Computing
Distributed HPC Applications with Unprivileged Containers