Ray Serve - Advancing Scalability and Flexibility

Watch this 29-minute conference talk from Ray Summit 2025 where Anyscale engineers Abrar Sheikh and Alexander Yang present the major advancements in Ray Serve, one of the most widely adopted libraries for powering modern AI applications across industries. Discover why Ray Serve distinguishes itself from traditional online inference frameworks through its native multi-model serving capabilities, universal hardware and accelerator support, and seamless integration with any inference engine including vLLM, TensorRT-LLM, and custom model runtimes. Explore the most significant improvements delivered over the past year, including greater flexibility for complex inference patterns with expanded APIs and routing capabilities that simplify serving multi-stage pipelines, ensemble models, agentic systems, and inference graphs. Learn about performance enhancements at scale through under-the-hood optimizations, improved scheduling, and faster data movement that enable Ray Serve to handle massive request volumes with reduced latency and increased throughput. Understand the new multi-cloud inference support features that facilitate deploying Ray Serve clusters across multiple cloud providers, supporting hybrid inference, failover strategies, and portable deployment architectures. See demonstrations of how Ray Serve continues evolving to meet the demands of cutting-edge AI systems from large language models to multimodal workloads, plus get insights into the framework's future roadmap. Gain practical knowledge on leveraging Ray Serve's newest capabilities to build flexible, high-performance, and cloud-agnostic inference platforms at scale for your AI applications.