Schedule

Module 1 - Introduction

Module 2 - Systems implications of the Transformer architecture

Module 3 - Hardware Infrastructure for Machine Learning

Sep 8
Multi-GPU servers and interconnects
GPU architecture, NVLinks, NVSwitches
Sep 10
ML-centric Datacenters
Datacenter clos, TPU torus and rail-optimized datacenters for ML
Sep 15
Training an LLM (hands-on activity)
Bring your laptops to class
Sep 17
Communication infrastructure
RDMA, IB

Module 4 - Distributed Training with Data Parallelism

Module 5 - Distributed Training with Tensor/Pipeline/Sequence/Expert Parallelisms

Module 6 - Advanced Topics

Nov 10
Parameter Effecient Fine-Tuning
Nov 12
Nov 17
LLM Agents
Nov 19
Nov 24

Module 7 - Group Mid-point Presentations

Dec 1
Dec 03
Dec 08