From Picojoules to Gigawatt-hours - Energy-to-Completion and GPU DVFS for LLM Workloads

Explore energy optimization strategies for large language model workloads in this technical seminar that bridges the gap between micro-level GPU operations and system-scale training efficiency. Learn about energy-to-completion as a crucial metric for evaluating LLM workloads, moving beyond the traditional "race-to-idle" approach that may be insufficient for transformer architectures. Discover how GPU Dynamic Voltage and Frequency Scaling (DVFS) techniques can be applied to GPT and LLaMA-style decoder layers, revealing characteristic U-shaped energy-frequency curves that vary systematically with layer configuration and sequence length. Examine Pareto analysis results showing potential energy reductions of 10-20% with corresponding runtime trade-offs, while understanding why maximum boost clocks may be energetically inefficient. Investigate large-scale training time modeling approaches, including the impact of data, tensor, and pipeline parallelism on iteration time, along with pipeline bubbles and communication overhead considerations. Analyze current capabilities and limitations of graph-based simulators for predicting system performance. Conclude with discussion of open research problems in developing DVFS-aware LLM runtimes and achieving comprehensive end-to-end energy optimization for modern machine learning workloads.