Scaling Observability for the AI Era - From GPUs to LLMs

Explore how to scale observability strategies for AI workloads that generate massive amounts of telemetry data through spiky inference patterns, large GPU fleets, and complex orchestration pipelines in this AWS re:Invent 2025 conference talk. Discover why traditional cloud-native observability approaches often fail under AI workload demands and learn practical strategies to avoid costly trade-offs between data fidelity, system performance, and operational costs. Examine how observability requirements evolve across different segments of the AI ecosystem, including GPU providers, large language model builders, AI-native platforms, and organizations adopting AI features. Gain actionable insights for ensuring system reliability, controlling observability expenses, and maintaining comprehensive visibility across diverse AI infrastructure environments. Learn from real-world examples and best practices for implementing observability solutions that can handle the unique challenges of modern AI applications and services.