Lessons from Scaling BPF to Detect RDMA Device Driver Bugs in Real Time
Linux Plumbers Conference via YouTube
Python, Prompt Engineering, Data Science — Build the Skills Employers Want Now
Get 20% off all career paths from fullstack to AI
Overview
Google, IBM & Meta Certificates — All 10,000+ Courses at 40% Off
One annual plan covers every course and certificate on Coursera. 40% off for a limited time.
Get Full Access
Learn how Meta scaled BPF (Berkeley Packet Filter) technology to detect RDMA (Remote Direct Memory Access) device driver bugs in real-time production environments in this 30-minute conference talk from the Linux Plumbers Conference. Discover the critical challenge Meta faced where 17% of large-scale model training jobs failed due to RDMA-related syscall errors caused by driver bugs, and how these failures significantly prolonged training times for resource-intensive GPU workloads. Explore the opacity issues with RDMA syscalls that create mismatched views between applications and kernel hardware resources, making traditional observability tools inadequate for effective troubleshooting. Understand why direct approaches like kernel call tracing proved prohibitively expensive for production use. Examine the specific optimizations and map-based systems Meta developed to efficiently track kernel state and export relevant debugging information without impacting production workload performance, providing DevOps teams with the visibility needed to effectively triage RDMA-related failures in large-scale distributed training environments.
Syllabus
Lessons from scaling BPF to detect RDMA Device Drivers Bugs in real time - Prankur Gupta (Meta)
Taught by
Linux Plumbers Conference