Architecting an AI Inference Stack

This course is designed for developers looking to build an optimized AI inference stack on Google Cloud. Whether you’re working with GPUs or TPUs, you’ll explore the fundamental components of an inference stack, learn design principles for maximizing performance and reliability, and explore practical techniques to take your workloads from 0 to 1.

Syllabus

Foundational concepts

Introduction: Architecting an AI inference stack (with GPUs or TPUs)
What is inference?
Differentiate between popular AI/ML frameworks and understand their roles in defining, training, and serving models
Identify the four common performance bottlenecks in AI and understand how they apply to different model architectures
Compare the available orchestration options for Kubernetes and Slurm
Exploring your orchestration options
Quiz

Inference concepts

Use vLLM to increase throughput and reduce latency when serving large AI models
Reviewing important inference concepts
Deploy scalable and reliable AI inference workloads on Google Cloud by applying principles like multi-region support and leveraging the GKE Inference Gateway
Reviewing the best practices for architecting an inference stack on GKE
Guided tutorial: GKE Inference Quickstart
Conclusion
Quiz