RDMA P2P Deep Dive - KvCache Transfer, Weight Updates and MoE Routing at Perplexity

Explore RDMA-based peer-to-peer communication strategies for modern Large Language Model systems in this 31-minute conference talk from Ray Summit 2025. Learn how current LLM systems rely on collective communication through torch.distributed and NCCL APIs, which create unnecessary constraints on peer-to-peer data movement as models scale and Mixture-of-Experts architectures become more prevalent. Discover how inter-node communication becomes a critical bottleneck when workloads diversify across training and inference boundaries. Examine essential RDMA primitives and understand the design of Perplexity's new communication library API through three high-impact use cases: KvCache transfer for disaggregated inference, weight transfer between training and inference nodes during reinforcement learning rollouts, and MoE dispatch-combine all-to-all kernels. Gain insights into alternative communication strategies that can unlock new efficiencies in next-generation LLM and MoE systems, moving beyond traditional SPMD-based approaches to enable more flexible and efficient data movement patterns.