Supercharging Generative AI with PyTorch and Arm Neoverse

Learn how to accelerate Generative AI workloads on Arm® Neoverse™ in this 23-minute talk that presents an end-to-end solution combining Arm's software-level AI acceleration with KleidiAI's optimizations. Discover the integration of KleidiAI's highly optimized 4-bit weight-only kernels with dynamic activation quantization directly into PyTorch, making advanced quantization techniques accessible through official PyTorch distribution. Explore the new TorchAO quantizer API that provides a standardized solution for quantizing any PyTorch model, including large language models and other GenAI models. When coupled with TorchChat for LLM serving, this approach enables developers to deploy resource-efficient, high-performance LLMs at scale. The presentation demonstrates significant performance improvements, achieving generation speeds of over 66 tokens per second on models like Llama 2 (7B), compared to 12 tokens per second in their non-quantized state—far exceeding human reading speed of 5-7 tokens per second. This performance boost makes running GenAI models on Arm not just viable but highly competitive for cloud applications, reducing computational costs and energy consumption while enabling real-time, interactive AI applications that can efficiently serve multiple requests in large-scale deployments.