Coursera Plus Annual Nearly 45% Off
Power BI Fundamentals - Create visualizations and dashboards from scratch
Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn to set up and execute comprehensive LLM inference performance benchmarks on NVIDIA GPUs using a complete open-source toolchain in this hands-on tutorial from DevConf.US 2025. Master the entire benchmarking pipeline starting with foundational GPU setup including RPM Fusion configuration, akmod-nvidia driver installation, and hardware validation through nvidia-smi. Configure containerized GPU access by implementing Podman 5.x with NVIDIA Container Toolkit's Container Device Interface for secure rootless operations. Deploy the lightweight vLLM inference engine with locally cached models from Hugging Face, establishing OpenAI-compatible HTTP endpoints for standardized API access. Utilize GuideLLM's automated load generation capabilities to systematically sweep request rates, capture detailed latency distributions, measure throughput ceilings, and collect comprehensive token-per-second statistics with structured JSON output for analysis. Gain practical troubleshooting expertise through live demonstrations that highlight common configuration pitfalls and provide actionable checklists applicable across Red Hat-derived distributions. Acquire transferable knowledge for scaling benchmarks to larger language models and multi-GPU configurations while understanding how architectural decisions impact measurement accuracy. Receive ready-to-use scripts, configuration templates, and resource links enabling immediate implementation regardless of prior experience with containers, CUDA programming, or performance benchmarking methodologies.
Syllabus
Learn How to Run an LLM Inference Performance Benchmark on NVIDIA GPUs - DevConf.US 2025
Taught by
DevConf