Learn Generative AI, Prompt Engineering, and LLMs for Free
Pass the PMP® Exam on Your First Try — Expert-Led Training
Overview
Google, IBM & Meta Certificates — All 10,000+ Courses at 40% Off
One annual plan covers every course and certificate on Coursera. 40% off for a limited time.
Get Full Access
Learn how Apple engineers built an enterprise-grade serverless model-hosting platform using Ray Serve and vLLM for scalable LLM inference across internal teams. Discover the design principles behind Apple's self-service platform that abstracts operational complexity while enabling seamless model deployment and management. Explore critical capabilities including robust multi-tenancy for workload isolation, dynamic autoscaling for unpredictable traffic patterns, token-level budgeting and metering for usage constraints and cost transparency, deep request-level observability for debugging and performance tuning, and fine-grained resource controls for optimal cluster utilization. Understand the architectural challenges faced during development and the solutions implemented to ensure reliable, efficient, and secure LLM inference in enterprise environments. Gain practical patterns and insights for combining Ray Serve and vLLM to build production-grade model serving platforms suitable for both internal developers and external customers, with actionable strategies for operating LLM inference at scale.
Syllabus
Scaling LLMs at Apple: Ray Serve + vLLM Deep Dive | Ray Summit 2025
Taught by
Anyscale