This course provides a comprehensive overview of techniques to enhance the performance of large language models (LLMs) during inference. It begins with an introduction to the principles of LLM inference optimization, focusing on the transformer architecture and various optimization strategies. Participants will explore advanced methods, including quantization and speculative decoding, to reduce model complexity and improve execution speed. The course also covers model parallelism and sharding techniques for effective deployment in real-world applications. Finally, learners will complete a project on accelerating news headline generation using LLM optimization, demonstrating practical implementations of the concepts discussed.
Overview
Syllabus
- Introduction to LLM Inference Optimization
- Discover why LLM inference optimization is crucial, identify key performance bottlenecks, measure success, and explore top optimization categories and hands-on profiling techniques.
- Transformer Architecture Optimizations
- Discover transformer optimizations: KV Caching for faster generation, GQA for smaller memory footprint, and Flash Attention for faster, efficient attention computation in large language models.
- Quantization and Speculative Decoding
- Learn how quantization, pruning, and speculative decoding shrink LLMs and speed up text generation while preserving output quality for more efficient AI deployments.
- Model Parallelism, Sharding, and Deployment
- Learn how to scale and deploy large language models using model parallelism, sharding (DeepSpeed ZeRO/FSDP), and optimized serving tools like TensorRT-LLM, Triton, vLLM, and llama.cpp.
- UdaciHeadline: Accelerating News Generation with LLM Optimization
- This project focuses on optimizing the inference performance of a pre-trained LLM fine-tuned for generating news headlines from article summaries.
Taught by
Rishabh Misra