Agentic Workload Inference at Scale - ByteDance's AIBrix and DeerFlow

Learn how ByteDance engineers are revolutionizing large-scale LLM inference infrastructure through their open-source AIBrix control plane and DeerFlow framework in this 31-minute conference talk from Ray Summit 2025. Discover the critical infrastructure challenges facing production-grade language model deployments, where performance, scalability, and cost efficiency must be simultaneously optimized for real-world agentic systems. Explore AIBrix's comprehensive suite of LLM-focused capabilities developed in collaboration with the vLLM community, including workload-aware autoscaling that efficiently manages resources, sophisticated KVCache management utilizing multi-level caching and prefix-aware reuse to reduce memory pressure, and intelligent load balancing with cache-aware routing for adaptive traffic distribution under varying load patterns. Examine cutting-edge innovations such as dynamic LoRA orchestration and heterogeneous hardware support designed to maximize cost effectiveness across diverse cluster environments. Witness practical demonstrations of how AIBrix enables advanced agentic workloads through DeerFlow, an open-source deep research framework, including real-world applications like building personal research assistants on open-source LLMs. Gain comprehensive insights into AIBrix's architecture, performance breakthroughs, and its pivotal role in shaping the future of enterprise-grade LLM infrastructure for next-generation AI agents requiring reliable, scalable, and low-latency execution.