Gain a Splash of New Skills - Coursera+ Annual Just ₹7,999
Foundations for Product Management Success
Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn how to accelerate Generative AI workloads on Arm® Neoverse™ in this 23-minute talk that presents an end-to-end solution combining Arm's software-level AI acceleration with KleidiAI's optimizations. Discover the integration of KleidiAI's highly optimized 4-bit weight-only kernels with dynamic activation quantization directly into PyTorch, making advanced quantization techniques accessible through official PyTorch distribution. Explore the new TorchAO quantizer API that provides a standardized solution for quantizing any PyTorch model, including large language models and other GenAI models. When coupled with TorchChat for LLM serving, this approach enables developers to deploy resource-efficient, high-performance LLMs at scale. The presentation demonstrates significant performance improvements, achieving generation speeds of over 66 tokens per second on models like Llama 2 (7B), compared to 12 tokens per second in their non-quantized state—far exceeding human reading speed of 5-7 tokens per second. This performance boost makes running GenAI models on Arm not just viable but highly competitive for cloud applications, reducing computational costs and energy consumption while enabling real-time, interactive AI applications that can efficiently serve multiple requests in large-scale deployments.
Syllabus
LIS25 117 Supercharging Generative AI KleidiAI, PyTorch and Arm Neoverse
Taught by
LinaroOrg