JENGA - Enhancing LLM Long-Context Fine-tuning with Contextual Token Sparsity

Learn about JENGA, a novel fine-tuning system that addresses memory constraints in large language model long-context applications through contextual token sparsity in this 15-minute conference presentation. Discover how researchers from Tsinghua University and Microsoft Research tackle the critical limitation of high activation memory footprints that arise when extending LLM context windows for long-context applications. Explore the concept of Contextual Token Sparsity, a new token-level sparsity mechanism that minimizes redundant token involvement while preserving model accuracy. Understand three key techniques implemented in JENGA: Token Elimination for dynamically identifying and excluding redundant tokens across varying inputs and layers, Pattern Prediction using well-trained predictors to approximate token sparsity patterns with minimal overhead, and Kernel Optimization employing permutation-free and segment-based strategies to enhance system performance. Examine comprehensive evaluation results demonstrating JENGA's ability to reduce memory consumption by up to 1.93× and achieve up to 1.36× speedups compared to state-of-the-art fine-tuning systems, while maintaining compatibility with various LLM architectures and other optimization techniques.