Implementing Large Language Models Inference in Pure C++ - A Llama 2 Case Study

Explore a comprehensive conference talk that demonstrates how to implement Llama 2 model inference using pure C++, delivered by GPU modeling engineer Filipe Mulonde at code::dive conference. Learn the fundamentals of Llama 2, a state-of-the-art language model, and discover practical techniques for implementing inference solutions without external dependencies. Understand the model's architecture through a streamlined, educational approach inspired by llama.cpp and llama2.c projects, starting with a PyTorch-trained model and progressing to a dependency-free C++ implementation. Gain insights into optimization techniques for fast model inference and practical applications ranging from chatbots to content creation. Benefit from Mulonde's extensive experience as an ARM Holdings engineer, his academic background in Software Engineering and Artificial Intelligence, and his research work at ETH Zurich in computer architecture and bioinformatics. The hour-long presentation draws from his expertise gained through working on autonomous train development and speaking at major C++ conferences like CppCon, Meeting C++, and Embo++.