WaferLLM - Large Language Model Inference at Wafer Scale

Learn about WaferLLM, the first wafer-scale large language model inference system designed to fully exploit emerging AI accelerators with wafer-scale manufacturing technologies in this 17-minute conference presentation from OSDI '25. Discover how researchers from the University of Edinburgh and Microsoft Research developed a novel PLMR model that captures the unique hardware characteristics of wafer-scale architectures featuring hundreds of thousands of AI cores in mesh configurations with distributed on-chip memory and ultra-high bandwidth. Explore the innovative wafer-scale LLM parallelism approach that optimizes utilization across massive numbers of on-chip cores, along with the groundbreaking MeshGEMM and MeshGEMV implementations specifically designed for wafer-scale accelerators. Examine the impressive performance results showing up to 200× higher accelerator utilization compared to state-of-the-art methods, with GEMV operations running 606× faster and 16× more energy-efficiently than NVIDIA A100 GPUs, and full LLM inference achieving 10-20× speedups over A100 GPU clusters running SGLang and vLLM. Understand how this open-source system addresses the limitations of current LLM inference systems optimized for shared memory architectures like GPUs and represents a significant advancement in scaling AI inference on next-generation wafer-scale hardware.