Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Discover essential strategies for scaling generative AI inference from research to production in this 25-minute conference talk. Learn about critical model-level optimization techniques including quantization, batching, caching, and hardware-aware optimizations that bridge the performance gap between experimental results and real-world deployment. Explore system-level practices such as redundancy implementation, automated failover mechanisms, and multi-cloud operations that strengthen infrastructure reliability and ensure continuous service availability during hardware failures, network fluctuations, and sudden traffic spikes. Gain insights into creating a resilient, dependable, and production-ready foundation for scaling AI systems that can handle enterprise-level demands while maintaining consistent performance and reliability.
Syllabus
Scaling Inference for Generative AI by Byung-Gon Chun
Taught by
Open Data Science