Aggressive LLMs Optimization - Making Them Work on Tiny Devices

Explore aggressive optimization techniques for deploying large language models on resource-constrained devices in this 25-minute conference talk from Conf42 ML 2025. Learn how to tackle the fundamental challenge of making LLMs work efficiently on tiny devices by examining GPT-2 as a case study model. Discover theoretical optimization approaches including quantization, pruning, and knowledge distillation before diving into practical research methodologies and experimental results. Understand how to combine multiple optimization methods for maximum efficiency gains, and gain insights into the trade-offs between model performance and computational requirements. Master the essential strategies for bringing powerful language models to edge devices and embedded systems where memory and processing power are severely limited.

Syllabus

00:00 Introduction and Speaker Introduction
00:31 The Challenge of Optimizing Large Language Models
01:47 Choosing the Right Model: GPT-2
03:48 Optimization Techniques: Theory
09:56 Practical Research and Experiments
18:42 Combining Optimization Methods
21:14 Conclusions and Final Takeaways