Scaling LLMs at Apple - Ray Serve + vLLM Deep Dive

Learn how Apple engineers built an enterprise-grade serverless model-hosting platform using Ray Serve and vLLM for scalable LLM inference across internal teams. Discover the design principles behind Apple's self-service platform that abstracts operational complexity while enabling seamless model deployment and management. Explore critical capabilities including robust multi-tenancy for workload isolation, dynamic autoscaling for unpredictable traffic patterns, token-level budgeting and metering for usage constraints and cost transparency, deep request-level observability for debugging and performance tuning, and fine-grained resource controls for optimal cluster utilization. Understand the architectural challenges faced during development and the solutions implemented to ensure reliable, efficient, and secure LLM inference in enterprise environments. Gain practical patterns and insights for combining Ray Serve and vLLM to build production-grade model serving platforms suitable for both internal developers and external customers, with actionable strategies for operating LLM inference at scale.