Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn how Cisco's comprehensive reference architectures enable scalable AI networking from 96-GPU clusters to massive 32,000-GPU deployments through this 18-minute conference presentation. Discover vendor-agnostic designs supporting Nvidia, AMD, and Intel hardware that prioritize operational simplicity and automation. Explore how the Nexus Dashboard platform streamlines AI infrastructure management by enabling rapid AI fabric creation with routed or VXLAN EVPN options, automatic application of best-practice configurations including QoS, ECN, and PFC for lossless fabrics, and easy activation of advanced features like Dynamic Load Balancing. Master switch discovery and onboarding processes, learn to organize infrastructure into scalable units with built-in guardrails against misconfigurations, and understand unified management capabilities for AI clusters alongside traditional data center and storage fabrics. Gain insights into critical visibility features including integration with workload managers like Slurm for AI job monitoring, correlation of network performance with GPU and NIC issues, and comprehensive analytics covering Ethernet interface drops, CRC errors, and GPU-specific metrics such as temperature, utilization, and power consumption. Examine job-specific topology generation, anomaly identification capabilities down to individual links and GPUs, and actionable insights for root-cause analysis, plus explore API integration options for multi-vendor environments and custom automation workflows within Cisco's broader AI Canvas ecosystem.
Syllabus
Cisco Reference Architectures for AI Networking with the Nexus Dashboard
Taught by
Tech Field Day