Cornell CS5470: Systems for Large-scale ML

This course explores the systems challenges of training and serving large-scale ML models like GPT, LLaMA, and DeepSeek. You will learn how to design and operate distributed training and inference on multi-accelerator hardware, with attention to performance, memory, communication, and fault tolerance. The emphasis is on both theory and practice so we will combine with hands-on programming sessions, assignments and projects. By the end, you will have practical experience tackling the core bottlenecks of modern ML systems.

Acknowledgement: this course is supported by a NERSC Education Allocation Award.

Access to GPU Compute resources

If you are enrolled in the class or you are on the waitlist, create a NERSC account by following the instructions here. Each created account undergoes vetting so please do this as soon as you can.

Course Policies

You are allowed to use generative AI tools of your choice for graded components of the class. The submission may ask for which tools you used and the corresponding prompts.

Academic Honesty Policies

You are not allowed to share any code and text (including reports, summaries and prompts) that you use to complete projects and assignments.

Grading Policy:

  • Class Participation (10%)
  • Paper Presentations (15%)
  • Programming Assignments (30%)
  • Course Project (40%)
  • End of semester survey (5%)