Obscura - Concealing Recomputation Overhead in Training of Large Language Models with Bubble-filling Pipeline Transformation

Explore a 20-minute conference presentation introducing Obscura, a computationally efficient pipeline training system designed to optimize recomputation overhead in large language model training. Learn how pipeline parallelism distributes computational workloads across multiple nodes but faces memory bottlenecks at early stages, and discover how recomputation can mitigate this issue despite incurring additional computational overhead. Understand the key observation that bubbles following backward passes can conceal recomputation overhead in pipeline parallelism, and examine Obscura's novel pipeline transformation approach to enhance overhead concealment. Delve into the integration of swapping techniques into the pipeline and the modeling of execution time as an optimization problem to identify optimal recomputation strategies. Review the partition adjustment algorithm implemented to balance computation across stages under the transformation, and analyze evaluation results on Llama-2 and GPT-3 models of various sizes demonstrating throughput improvements of up to 1.33× compared to widely used recomputation baselines.