Implementing Large Language Models Inference in Pure C++ - A Llama 2 Case Study
code::dive conference via YouTube
Master Windows Internals - Kernel Programming, Debugging & Architecture
Google, IBM & Meta Certificates — Less Than ₹22/Day
Overview
Google, IBM & Meta Certificates — All 10,000+ Courses at 40% Off
One annual plan covers every course and certificate on Coursera. 40% off for a limited time.
Get Full Access
Explore a comprehensive conference talk that demonstrates how to implement Llama 2 model inference using pure C++, delivered by GPU modeling engineer Filipe Mulonde at code::dive conference. Learn the fundamentals of Llama 2, a state-of-the-art language model, and discover practical techniques for implementing inference solutions without external dependencies. Understand the model's architecture through a streamlined, educational approach inspired by llama.cpp and llama2.c projects, starting with a PyTorch-trained model and progressing to a dependency-free C++ implementation. Gain insights into optimization techniques for fast model inference and practical applications ranging from chatbots to content creation. Benefit from Mulonde's extensive experience as an ARM Holdings engineer, his academic background in Software Engineering and Artificial Intelligence, and his research work at ETH Zurich in computer architecture and bioinformatics. The hour-long presentation draws from his expertise gained through working on autonomous train development and speaking at major C++ conferences like CppCon, Meeting C++, and Embo++.
Syllabus
Filipe Mulode - Implementing Large Language Model LLMs Inference in Pure C++
Taught by
code::dive conference