This Nanodegree program equips learners with advanced skills and practical experience in optimizing machine learning models for performance, scalability, and real-world application. Students will explore foundational principles and techniques such as quantization, pruning, and profiling, and apply these to both traditional machine learning models and large language models (LLMs). The program delves into advanced model compression methods, including low-rank compression and knowledge distillation, and covers the design of efficient architectures for hardware acceleration using tools like TensorRT and ONNX. Ultimately, graduates will be able to optimize inference pipelines for LLMs and deploy efficient models to meet specific performance and deployment requirements.
Overview
Syllabus
- Model Optimization Foundational Principles
- This course equips learners with essential techniques to enhance machine learning models. Starting with an introductory overview, the course covers key optimization strategies, including quantization techniques that reduce model size and improve efficiency. Students will explore pruning and sparsity methods to eliminate redundancy in models. The use of profiling tools and performance analysis is emphasized, allowing students to assess and refine their models effectively. Finally, the course culminates in practical applications, featuring hands-on experience with optimizing and deploying the GPT-2 model. Students will gain a solid foundation in optimizing state-of-the-art models for real-world applications.
- Advanced Model Compression Techniques
- This course equips learners with essential methodologies to reduce the size of machine learning models without significantly impacting performance. Starting with an introduction to various techniques, tools, and real-world applications, the course delves into post-training and training-time compression methods. Participants will explore how to build collaborative compression pipelines that enhance model efficiency. In the project "UdaciSense - Optimized Mobile Object Recognition," learners apply their knowledge to develop a practical, optimized solution for mobile devices. This course is perfect for AI practitioners seeking to advance their skills in model optimization.
- Efficient Architectural Design and Hardware Acceleration
- This course explores the intersection of innovative model design and advanced hardware solutions. It begins with an introduction to efficient model architectures, focusing on optimization techniques for various applications. Participants will learn to develop mobile-friendly networks, ensuring seamless deployment on resource-constrained devices. The course emphasizes practical skills in utilizing hardware acceleration tools and libraries, followed by strategies to integrate these techniques with efficient architectures. The project showcases real-world applications of efficient medical diagnostics powered by hardware-aware model optimization, culminating in a comprehensive understanding of the entire ecosystem.
- LLM Inference Optimization
- This course provides a comprehensive overview of techniques to enhance the performance of large language models (LLMs) during inference. It begins with an introduction to the principles of LLM inference optimization, focusing on the transformer architecture and various optimization strategies. Participants will explore advanced methods, including quantization and speculative decoding, to reduce model complexity and improve execution speed. The course also covers model parallelism and sharding techniques for effective deployment in real-world applications. Finally, learners will complete a project on accelerating news headline generation using LLM optimization, demonstrating practical implementations of the concepts discussed.
Taught by
Darryl Fernandes, Samantha Guerriero and Rishabh Misra