Comprehensive GPU Benchmarking and FP8 Format Analysis for LLM Performance

Learn comprehensive GPU benchmarking results and optimization techniques in this 47-minute technical video that covers extensive testing across multiple GPU models. Explore detailed performance metrics including time to first token, token generation throughput, and price per million tokens. Gain deep technical insights into FP8 format benefits, floating point format comparisons, and GPU architecture support. Watch a live demonstration using RTX 3090, understand the differences between SGLang and vLLM frameworks, and master the process of creating and evaluating FP8 models using LLM Compressor for quantization. Access accompanying resources including detailed slides, code repositories, and documentation to implement these optimization techniques in your own projects.

Syllabus

- Introduction to GPU benchmarking test
- Test methodology and parameters
- Results: Time to first token
- Results: Token generation throughput
- Results: Price per million tokens
- Analysis of results ordering and trends
- Detailed discussion topics overview
- Why FP8 format is beneficial
- Technical explanation of FP8
- Comparison of different floating point formats
- GPU architecture support for FP8
- How FP8 works on different GPU architectures
- Live demonstration with RTX 3090
- SGLang vs vLLM comparison
- Caveats and limitations of benchmarking
- How to create FP8 models
- Using LLM Compressor for quantization
- Evaluating quantized models
- Conclusion and resources