Benchmarking Google's TPUs vs Nvidia GPUs for AI Inference

This 34-minute video from Trelis Research provides a comprehensive comparison between Google's TPUs and Nvidia GPUs for AI inference workloads. Learn about the hardware specifications of H100 SXM, H200 SXM, and v6e TPUs, and understand the benchmarking methodology using vLLM and llmperf. Explore the differences between Tensor Parallel and Pipeline Parallel approaches, their respective advantages and disadvantages, and discover where to test these different accelerators. Follow along with practical demonstrations of running inference on both Nvidia GPUs and Google TPUs, and examine detailed benchmarking results that compare performance and cost-efficiency. The video also mentions upcoming content on Blackwell B200 and Amazon Trainium, and concludes with resources and workshop information. Repository access is available at Trelis.com/ADVANCED-inference.

Syllabus

0:00 Benchmarking Google’s TPUs vs Nvidia GPUs
0:33 Video Overview
1:12 H100 SXM, H200 SXM and v6e hardware specs
4:47 Benchmarking Design with vLLM and llmperf
7:42 Price assumptions per hour
8:47 Tensor Parallel vs Pipeline Parallel
13:45 Pros and Cons of Tensor vs Pipeline Parallel
14:42 Where to test TPUs and GPUs
15:45 Future videos: Blackwell B200 and Amazon Trainium
16:15 Running inference on Nvidia GPUs
19:17 Running inference on Google TPUs
25:51 Running benchmarking with llmperf
28:23 Benchmarking Results: TPU vs GPU
33:21 Conclusion, Resources and Workshop