Taming LLMs on Mobile SoCs - Disaggregated NPU-GPU Inference for Generative Edge AI

Explore a practical approach to running large language models efficiently on mobile devices by leveraging disaggregated inference across Apple Silicon's Neural Processing Unit and GPU. Learn how to split LLM workloads strategically, using the Neural Engine via Core ML for prefill operations and the GPU via MLX for token generation, achieving faster time-to-first-token and consistent decode performance without relying on cloud infrastructure. Discover the technical foundations of on-device AI benefits including enhanced privacy, reduced costs, and improved app responsiveness during network instability. Examine the architectural advantages of MLX for dynamic token-by-token decoding through flexible scheduling, and understand how to optimize Core ML's Neural Engine for prefill operations using pre-compiled sequence length buckets. Analyze benchmark results from Llama 3.2B and Qwen 3.6B models demonstrating stabilized time-to-first-token performance on the NPU while maintaining high decode throughput on GPU-backed MLX. Master the concept of disaggregated inference adapted from data center implementations to mobile environments, understanding how prefill's compute-bound nature and decode's memory-bound characteristics can be optimized separately on Apple Silicon's unified memory architecture. Learn about the Yetter engine's orchestration capabilities for managing model conversion, multi-level quantization across weights, activations, and key-value caches, and seamless coordination between Core ML and MLX frameworks. Gain insights from real-world implementation in large-scale messaging applications where on-device AI agents provide background suggestions and search functionality with predictable latency. Acquire a comprehensive blueprint for edge AI optimization focusing on bucketed prefill strategies, flexible decode operations, unified memory utilization, and proper measurement of time-to-first-token and tokens-per-output-token metrics as experienced by end applications.