Code DeepSeek V3 From Scratch in Python - Full Course

Learn to implement DeepSeek V3 from scratch in this comprehensive 3-hour 47-minute Python course. Follow along as instructor @vukrosic provides both theoretical explanations and hands-on coding instructions for building this cutting-edge deep learning model. Master key concepts including attention mechanisms, Query-Key-Value operations, KV Cache, Multihead Latent Attention (MLA), RoPE (Rotary Position Embedding), Mixture of Experts (MoE), gating mechanisms, and transformer blocks. The course references the DeepSeek V3 paper and provides access to inference code that can be modified for training purposes. Progress through a structured curriculum that begins with fundamental attention concepts and gradually builds toward implementing complete transformer architectures, with practical coding sessions for each component.

Syllabus

⌨️ 0:00:00 Intro
⌨️ 0:01:40 Attention Mechanism
⌨️ 0:13:34 Query, Key, Value
⌨️ 0:34:11 KV Cache
⌨️ 0:39:06 Multihead Latent Attention MLA
⌨️ 0:58:53 Coding MLA
⌨️ 1:28:41 RoPE
⌨️ 1:55:44 Coding KV Cache
⌨️ 2:00:25 MLA forward
⌨️ 2:28:24 MoE, Gate
⌨️ 2:49:25 Gate code
⌨️ 3:09:10 MoE code
⌨️ 3:28:36 Transformer Blocks