LLM Inference Optimization

Overview

This course provides a comprehensive overview of techniques to enhance the performance of large language models (LLMs) during inference. It begins with an introduction to the principles of LLM inference optimization, focusing on the transformer architecture and various optimization strategies. Participants will explore advanced methods, including quantization and speculative decoding, to reduce model complexity and improve execution speed. The course also covers model parallelism and sharding techniques for effective deployment in real-world applications. Finally, learners will complete a project on accelerating news headline generation using LLM optimization, demonstrating practical implementations of the concepts discussed.

Syllabus

Introduction to LLM Inference Optimization

Discover why LLM inference optimization is crucial, identify key performance bottlenecks, measure success, and explore top optimization categories and hands-on profiling techniques.

Transformer Architecture Optimizations

Discover transformer optimizations: KV Caching for faster generation, GQA for smaller memory footprint, and Flash Attention for faster, efficient attention computation in large language models.

Quantization and Speculative Decoding

Learn how quantization, pruning, and speculative decoding shrink LLMs and speed up text generation while preserving output quality for more efficient AI deployments.

Model Parallelism, Sharding, and Deployment

Learn how to scale and deploy large language models using model parallelism, sharding (DeepSpeed ZeRO/FSDP), and optimized serving tools like TensorRT-LLM, Triton, vLLM, and llama.cpp.