Overview

Google, IBM & Meta Certificates – 40% Off

One plan covers every Professional Certificate on Coursera.

This is Part 2 of a two-part graduate sequence in deep learning. Building on the foundations from Part 1, it focuses on advanced generative modeling. You will study autoregressive models, diffusion models, energy-based models, and normalizing flows; see how these techniques converge in multimodal text-to-image systems such as CLIP, DALL-E 2, Imagen, and Stable Diffusion; and apply generative methods to creative domains such as music generation. The course concludes by synthesizing the full arc—from discriminative foundations to advanced generative AI—and examining the ethical and societal implications of deploying these systems.

Syllabus

Autoregressive Models

Autoregressive models are built on a deceptively simple principle: the joint probability of a sequence is the product of conditional probabilities of each element given all preceding elements. You will see this chain-rule factorization applied across three concrete systems—an LSTM recipe generator, PixelCNN for image synthesis, and the path from GPT to ChatGPT through reinforcement learning from human feedback.

Diffusion Models

Diffusion models have become the dominant paradigm for high-quality image generation, powering DALL-E, Imagen, and Stable Diffusion—systems you will encounter later in this course. You will work through the full framework: forward diffusion as a Markov chain, the closed-form noise schedule, the DDPM reverse process, and the U-Net architecture used for denoising.

Energy-Based Models

Energy-Based Models offer a unified probabilistic framework rooted in statistical physics: assign a scalar energy to every configuration of variables, with low energy indicating high probability, and train a neural network to shape that landscape. You will study Langevin dynamics and contrastive divergence as approaches to training under intractable normalization, and see the framework applied to image generation.

Normalizing Flow Models

Normalizing flows complete the generative model taxonomy introduced earlier in this course. Unlike VAEs—which optimize a variational lower bound—or GANs—which use implicit density estimation—flows enable exact likelihood computation through invertible mappings between the data distribution and a simple base distribution. You will work through the change-of-variables formula, Jacobian determinants, and the RealNVP architecture, with GLOW and FFJORD surveyed as key extensions.

Multimodal Models

Multimodal models process and generate across more than one modality—text, images, audio, video—and represent the current frontier of generative AI deployment. Everything you have studied in this course converges here: Transformer-based encoders, contrastive learning objectives, and diffusion decoders combine inside systems like DALL-E 2, Imagen, and Stable Diffusion, each of which you will examine in depth.

Music Generation

Music is a domain where the generative architectures you have studied throughout this course find an unexpectedly rich application—sequential like text, spatially structured like images, and polyphonic in ways that challenge single-stream models. You will explore how Transformer-based autoregressive models generate symbolic music token-by-token, and how MuseGAN extends adversarial training to multi-track polyphonic generation in piano-roll format.

Conclusion

There are no new technical lessons here—instead, you will synthesize the full arc of the course, from discriminative foundations through the generative landscape, and engage with the ethical dimensions of deploying these systems at scale: deepfakes, non-consensual generation, copyright, bias, and the governance challenges that accompany generative AI in the real world.