Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

WaferLLM - Large Language Model Inference at Wafer Scale

USENIX via YouTube

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn about WaferLLM, the first wafer-scale large language model inference system designed to fully exploit emerging AI accelerators with wafer-scale manufacturing technologies in this 17-minute conference presentation from OSDI '25. Discover how researchers from the University of Edinburgh and Microsoft Research developed a novel PLMR model that captures the unique hardware characteristics of wafer-scale architectures featuring hundreds of thousands of AI cores in mesh configurations with distributed on-chip memory and ultra-high bandwidth. Explore the innovative wafer-scale LLM parallelism approach that optimizes utilization across massive numbers of on-chip cores, along with the groundbreaking MeshGEMM and MeshGEMV implementations specifically designed for wafer-scale accelerators. Examine the impressive performance results showing up to 200× higher accelerator utilization compared to state-of-the-art methods, with GEMV operations running 606× faster and 16× more energy-efficiently than NVIDIA A100 GPUs, and full LLM inference achieving 10-20× speedups over A100 GPU clusters running SGLang and vLLM. Understand how this open-source system addresses the limitations of current LLM inference systems optimized for shared memory architectures like GPUs and represents a significant advancement in scaling AI inference on next-generation wafer-scale hardware.

Syllabus

OSDI '25 - WaferLLM: Large Language Model Inference at Wafer Scale

Taught by

USENIX

Reviews

Start your review of WaferLLM - Large Language Model Inference at Wafer Scale

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.