Reinforcement Learning and Fine-Tuning on TPUs - The Agent Factory Podcast

Explore the infrastructure powering next-generation AI agents through this 23-minute podcast episode that dives deep into reinforcement learning and fine-tuning on Google's TPU architecture. Learn the fundamentals of when to choose fine-tuning over prompt engineering, focusing on specialization, privacy, and cost considerations. Discover the complete model lifecycle with clear breakdowns of pre-training versus post-training processes including supervised fine-tuning (SFT) and reinforcement learning, illustrated through Andrej Karpathy's chemistry textbook analogy. Understand when and why to implement reinforcement learning, its added value in model alignment and safety, and the latest advancements driving 2025 as the year of RL with examples from DeepSeek-R1, Grok 4, and Gemini 3. Examine how TPU pods and Inter-Chip Interconnect (ICI) solve critical bottlenecks in large-scale fine-tuning, addressing the challenges of infrastructure, algorithms, and orchestration in RL implementations. Watch a hands-on demonstration of MaxText 2.0 running a GRPO (Group Relative Policy Optimization) job on TPU infrastructure, showcasing practical reinforcement learning deployment. Gain insights into scaling to 1000+ chips and understand how Google's TPU architecture offers unmatched efficiency for complex AI workloads, with expert commentary from Google TPU Training Team Product Manager Kyle Meggs alongside hosts Shir Meir Lador and Don McCasland.

Syllabus

- Introduction: Gemini 3 and the rise of TPUs
- Why fine-tune? Specialization and privacy
- What is fine-tuning? SFT and RL explained
- What is RL and why do we need it?
- The added value in RL
- Industry pulse: Why 2025 is the year of RL DeepSeek-R1, Grok 4, Gemini 3
- The challenges of RL: Infrastructure, algorithms, and orchestration
- Factory floor: How TPUs are designed for scale
- [Demo] Reinforcement Learning GRPO with MaxText 2.0 on TPUs
- Scaling to 1000+ chips and season wrap up