Explore the mathematical foundations and technical mechanisms behind AI-generated videos and images in this comprehensive 40-minute educational video. Dive deep into the core technologies that power modern AI image and video generation, starting with CLIP (Contrastive Language-Image Pre-training) and its role in creating shared embedding spaces that connect text descriptions with visual content. Learn how diffusion models work, particularly the Denoising Diffusion Probabilistic Models (DDPM) approach, and understand the mathematical concept of learning vector fields that guide the image generation process. Discover the improvements offered by DDIM (Denoising Diffusion Implicit Models) and examine how DALL-E 2 implements these concepts in practice. Master the crucial concepts of conditioning, which allows models to generate images based on text prompts, and understand guidance techniques that improve output quality and adherence to prompts. Explore how negative prompts work to exclude unwanted elements from generated content. The video features detailed mathematical explanations, visual animations created with manim, and references to key research papers including the original DDPM and Classifier Free Guidance papers. Technical implementation details are supported by code examples and the smalldiffusion library, with additional resources including tutorials from MIT courses and comprehensive blog posts on diffusion models.

Syllabus

0:00 - Intro
3:37 - CLIP
6:25 - Shared Embedding Space
8:16 - Diffusion Models & DDPM
11:44 - Learning Vector Fields
22:00 - DDIM
25:25 Dall E 2
26:37 - Conditioning
30:02 - Guidance
33:39 - Negative Prompts
34:27 - Outro
35:32 - About guest videos + Grant’s Reaction
6:15 CLIP: Although directly minimizing cosine similarity would push our vectors 180 degrees apart on a single batch, overall in practice, we need CLIP to maximize the uniformity of concepts over the hypersphere it's operating on. For this reason, we animated these vectors as orthogonal-ish. See: https://proceedings.mlr.press/v119/wang20k/wang20k.pdf
Per Chenyang Yuan: at 10:15, the blurry image that results when removing random noise in DDPM is probably due to a mismatch in noise levels when calling the denoiser. When the denoiser is called on x_{t-1} during DDPM sampling, it is expected to have a certain noise level let's call it sigma_{t-1}. If you generate x_{t-1} from x_t without adding noise, then the noise present in x_{t-1} is always smaller than sigma_{t-1}. This causes the denoiser to remove too much noise, thus pointing towards the mean of the dataset.
For the vectors at 31:40 - Some implementations use fx, t, cat + alphafx, t, cat - fx, t, and some that do fx, t + alphafx, t, cat - fx, t, where an alpha value of 1 corresponds to no guidance. I chose the second format here to keep things simpler.
At 30:30, the unconditional t=1 vector field looks a bit different from what it did at the 17:15 mark. This is the result of different models trained for different parts of the video, and likely a result of different random initializations.