LLM Inference Performance Projection

This conference talk by Intel Fellow Mohan J Kumar and Principal Engineer Chuan Song introduces MESA, an online platform for evaluating LLM inference performance across different hardware configurations. Discover how this tool addresses the growing market need for AI inference performance prediction, allowing users to evaluate various models on hardware from multiple vendors. Learn how MESA breaks down inference latency by operation types (GEMM, MatMul) and phases (prefill and regression), providing detailed visual graphs for performance analysis. The presentation demonstrates how context length adjustments impact inference latency through an intuitive web UI. The speakers discuss their plans to contribute this tool to open source through the Open Compute Project, aiming to bring transparency and foster growth in the critical area of AI inference performance projection.