Running Llama 2 with Extended Context Length - Up to 32k Tokens

Learn how to scale Llama 2 to handle 32k context length in this comprehensive 22-minute tutorial video. Discover techniques for achieving up to 16k tokens on a Colab 40 GB GPU and 32k tokens on an 80 GB A100 using platforms like RunPod, AWS, or Azure. Explore the use of Flash attention, BetterTransformer, and GPTQ quantization to optimize performance. Gain insights on running GPTQ models in Colab, streaming Llama 2 13B with various context lengths, and adjusting parameters like max token output and temperature. Access a free Jupyter notebook for implementation or consider the PRO version for advanced features like conversation saving and document analysis. Delve into theoretical aspects of extending context length, compare different models, and gather valuable tips for working with long context lengths in language models.

Syllabus

How to run Llama 2 with longer context length
Run Llama 2 with 16k context in Google Colab
How to run a GPTQ model in Colab
Run Llama 2 7B with 32k context length using RunPod
Run Llama 2 13B for better performance! 16k context length
Streaming Llama 2 13B on 16k context length
Adjusting max token output and temperature
Streaming Llama 2 13B on 16k context length and 0 temperature
STREAMING LLAMA 2 13B ON 32k CONTEXT LENGTH!
PRO NOTEBOOK - Save Chats and Files. Easily adjust context length.
THEORY BONUS: How to get longer context length?
How does GPTQ work?
How does Flash attention work?
What is the best model for long context length?
What is better Llama 2 or Code-llama or YaRN?
Tips for long context lengths