CoServe - Max Performance, Minimal Compute

Learn how to build a scalable, efficient, and enterprise-ready AI platform through this 27-minute conference talk from Ray Summit 2025. Discover Cohere's approach to combining vLLM with state-of-the-art optimizations in quantization, kernel performance, and data communication to deliver low-latency, high-throughput inference at minimal compute cost. Explore Cohere's mission of providing private, secure, and high-performance AI solutions for enterprises through an inference stack that preserves model accuracy while dramatically reducing hardware requirements, especially as context lengths grow and enterprise workloads demand predictable, low-latency behavior. Examine core innovations including accuracy-preserving low-bit quantization techniques that cut memory footprint and compute overhead without degrading model output quality, extensive kernel optimizations built on top of vLLM to accelerate attention, sampling, and IO-heavy inference operations, and high-efficiency data communication paths that reduce inter-GPU overhead and latency for large-context inference. Understand how to engineer a serving pipeline for both low cost and high reliability, tailored for enterprise environments, and see real-world impact through Cohere's Command A model series, which can be served on a single H100 GPU while supporting context lengths beyond 128K tokens with low-latency performance suitable for production applications like retrieval-augmented generation, agentic workflows, and enterprise assistants. Gain detailed insights into combining quantization, kernel engineering, and vLLM-based optimization to deliver secure, cost-efficient, production-grade LLM inference at scale.