35% Off Finance Skills That Get You Hired - Code CFI35
AI Adoption - Drive Business Value and Organizational Impact
Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn practical strategies for scaling generative AI models from research prototypes to production-grade systems in this 16-minute conference talk. Discover how to overcome core obstacles in GenAI inference while reducing latency and controlling costs without sacrificing model performance. Explore advanced optimization techniques including batching strategies, model quantization methods, parallelism approaches, KV cache management, and speculative decoding implementations. Gain insights from real-world experience as the session unpacks critical trade-offs, common pitfalls, and essential lessons learned from successfully scaling inference systems in production environments.
Syllabus
Scaling GenAI inference: Techniques, optimizations, and real-world lessons
Taught by
Weights & Biases