Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn how Meta scaled BPF (Berkeley Packet Filter) technology to detect RDMA (Remote Direct Memory Access) device driver bugs in real-time production environments in this 30-minute conference talk from the Linux Plumbers Conference. Discover the critical challenge Meta faced where 17% of large-scale model training jobs failed due to RDMA-related syscall errors caused by driver bugs, and how these failures significantly prolonged training times for resource-intensive GPU workloads. Explore the opacity issues with RDMA syscalls that create mismatched views between applications and kernel hardware resources, making traditional observability tools inadequate for effective troubleshooting. Understand why direct approaches like kernel call tracing proved prohibitively expensive for production use. Examine the specific optimizations and map-based systems Meta developed to efficiently track kernel state and export relevant debugging information without impacting production workload performance, providing DevOps teams with the visibility needed to effectively triage RDMA-related failures in large-scale distributed training environments.