Efficient Inference of Extremely Large Transformer Models
Toronto Machine Learning Series (TMLS) via YouTube
Build the Finance Skills That Lead to Promotions — Not Just Certificates
Get 20% off all career paths from fullstack to AI
Overview
Google, IBM & Meta Certificates — All 10,000+ Courses at 40% Off
One annual plan covers every course and certificate on Coursera. 40% off for a limited time.
Get Full Access
Explore the challenges and solutions for efficient inference of massive transformer-based language models in this 28-minute Toronto Machine Learning Series (TMLS) talk. Dive into the world of multi-billion-parameter models and learn how they are optimized for production environments. Discover key techniques for making these behemoth models faster, smaller, and more cost-effective, including model compression, efficient attention mechanisms, and optimal model parallelism strategies. Gain insights from Bharat Venkitesh, Senior Machine Learning Engineer at Cohere, as he discusses the establishment of the inference tech stack and the latest advancements in handling extremely large transformer models.
Syllabus
Efficient Inference of Extremely Large Transformer Models
Taught by
Toronto Machine Learning Series (TMLS)