NVIDIA DGX Servers and Supercomputers

Introducing NVIDIA DGX A100: the Universal AI System for Enterprise S21702
Under the Hood of the new DGX A100 System Architecture S21884
NVIDIA Selene: Leadership-Class AI Supercomputing Infrastructure S31844
Scheduling, Resource-Managing, and Monitoring Selene, a Supercomputer for Large-Scale DL and HPC. S31700
Accelerating AI at-scale with Selene DGXA100 SuperPOD and Parallel Filesystem * Storage S31522
Advanced containerized workloads in HPC environment: the Selene example S31704

HotChips

Hot Chips Tutorial - Scale Out Training Experiences – Megatron Language Model YouTube
- Part I: Scale Out Systems
  - DGX A100 SuperPOD, Michael Houston, NVIDIA
  - Google TPU Pod, Sameer Kumar and Dehao Chen, Google
  - Cerebras System, Natalia Vassilieva, Cerebras
- Part II: Scale Out Training Experiences
  - Megatron Language Model, Mohammad Shoeybi, NVIDIA
  - Distributed Parameter Server for Massive Recommender System; Weijie Zhao, Baidu
  - GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding; Zhifeng Chen, Google
Hot Chips Session - NVIDIA’s A100 GPU: Performance and Innovation for GPU Computing

Distributed HPC Applications with Unprivileged Containers

Provide feedback