Schedule
Module 1 - Introduction
- Aug 25
- Introduction to the course
- Post lecture activity: Make an account on the Perlmutter supercomputer
- Aug 27
- Introduction to transformers
- Optional reading: (1) Attention is all you need (2) The Illustrated Transformer
Module 2 - Systems implications of the Transformer architecture
- Sep 1
- No Class (Labor Day)
- Sep 3
Module 3 - Hardware Infrastructure for Machine Learning
- Sep 8
- Multi-GPU servers and interconnects
- GPU architecture, NVLinks, NVSwitches
- Sep 10
- ML-centric Datacenters
- Datacenter clos, TPU torus and rail-optimized datacenters for ML
- Sep 15
- Training an LLM (hands-on activity)
- Bring your laptops to class
- Sep 17
- Communication infrastructure
- RDMA, IB
Module 4 - Distributed Training with Data Parallelism
- Sep 22
- Class cancelled
- Sep 24
- Introduction to distributed training
- Sep 29
- Data Parallelism with ZeRO
- Oct 06
- Data Parallelism with ZeRO-3 or FSDP
- Oct 08
- Hands-on session on supervised finetuning
- Oct 13
- Fall Break
- Oct 15
- Introduction to CUDA programming
Module 5 - Distributed Training with Tensor/Pipeline/Sequence/Expert Parallelisms
Module 6 - Advanced Topics
- Nov 10
- Parameter Effecient Fine-Tuning
- Nov 12
- Nov 17
- LLM Agents
- Nov 19
- Nov 24