Learn How to Run an LLM Inference Performance Benchmark on NVIDIA GPUs

Learn to set up and execute comprehensive LLM inference performance benchmarks on NVIDIA GPUs using a complete open-source toolchain in this hands-on tutorial from DevConf.US 2025. Master the entire benchmarking pipeline starting with foundational GPU setup including RPM Fusion configuration, akmod-nvidia driver installation, and hardware validation through nvidia-smi. Configure containerized GPU access by implementing Podman 5.x with NVIDIA Container Toolkit's Container Device Interface for secure rootless operations. Deploy the lightweight vLLM inference engine with locally cached models from Hugging Face, establishing OpenAI-compatible HTTP endpoints for standardized API access. Utilize GuideLLM's automated load generation capabilities to systematically sweep request rates, capture detailed latency distributions, measure throughput ceilings, and collect comprehensive token-per-second statistics with structured JSON output for analysis. Gain practical troubleshooting expertise through live demonstrations that highlight common configuration pitfalls and provide actionable checklists applicable across Red Hat-derived distributions. Acquire transferable knowledge for scaling benchmarks to larger language models and multi-GPU configurations while understanding how architectural decisions impact measurement accuracy. Receive ready-to-use scripts, configuration templates, and resource links enabling immediate implementation regardless of prior experience with containers, CUDA programming, or performance benchmarking methodologies.