Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Explore how to enhance AI workload observability by combining traditional GPU telemetry with comprehensive packet-level network insights in this 18-minute conference presentation. Learn why conventional monitoring tools struggle with high-performance, low-latency GPU clusters and discover how to correlate job scheduling, retransmissions, queue depth, and tensor-core utilization in real-time environments. Understand the emerging challenges of AI factories moving into enterprise settings, particularly the unique traffic patterns of inference workloads compared to traditional AI training flows in hyperscale data centers. Examine how inference workloads present distinct characteristics driven by user interactions, varying query-response ratios, and KV cache management policies that demand optimal GPU utilization without compromising latency. Discover the critical importance of north-south network visibility that connects AI clusters to enterprise infrastructure, enabling precise identification of latency sources whether from clusters, switches, or storage systems. Master techniques for detecting microbursts that internal switch telemetry might miss and understanding session-level characteristics that impact AI performance. Learn how to establish performance baselines, implement auto-triggered mitigations, integrate with SRE dashboards, and continuously tune network topologies for maximum AI throughput and resource efficiency. Gain insights into proactive anomaly identification and the integration of packet insights, session metrics, and AI-driven analytics into existing NetOps workflows to minimize costly AI downtime and optimize enterprise GPU investments.
Syllabus
cPacket Observability for AI
Taught by
Tech Field Day