Overview

AI, Data Science & Cloud Certificates from Google, IBM & Meta — 40% Off

One plan covers every Professional Certificate on Coursera. 40% off Coursera Plus Annual.

By the end of this course, you will be able to: • Explain how CLIP aligns image and text in a shared embedding space, use VLMs to perform visual question answering, image captioning, and document understanding, and navigate the Hub for multimodal models. • Build a pipeline that transcribes audio with Whisper and generates images with Diffusers, and describe how LoRA fine-tuning and multimodal RAG extend VLM capabilities. • Build an agentic workflow using smolagents with VLM support and MCP tool integration to automate multi-step tasks requiring vision and reasoning. • Apply ShieldGemma 2 to filter inputs and outputs of a VLM pipeline, test against adversarial inputs, and document failure modes for responsible deployment. AI that can only read text is already behind. This intermediate course assumes you're comfortable with the HF Transformers library and basic Gradio development. It opens with a practical challenge: 2,000 products with photos but no descriptions, and a stack of invoice PDFs that need structured data extraction. You’ll learn how CLIP aligned images and text in a shared space, then use modern vision-language models to caption products, answer questions about charts, and pull fields from invoices. Go wider: transcribe customer calls with Whisper, generate images from text briefs with Diffusers, and learn when to fine-tune a model versus when to give it better context through retrieval. Build agent workflows that can see screenshots, reason about what’s on screen, and connect to external tools through the Model Context Protocol (MCP) to act on what they find. The course closes with a deployment readiness review: your CTO wants to launch the AI pipeline next week, and you need to decide whether it’s safe to ship — with safety filtering, adversarial testing, and documented failure modes backing your recommendation.

Syllabus

Multimodal Foundations and VLMs

Most AI models see one thing at a time — text or images, never both. Vision-language models change that, and the key insight starts with CLIP: images and text can live in the same embedding space. This module builds your multimodal mental model from CLIP to modern VLMs, then puts them to work on real tasks: visual question answering, image captioning, and document AI.

Audio, Generation, and Adaptation Strategies

Multimodal AI isn’t limited to vision — audio transcription and image generation are equally practical capabilities that HF makes accessible through Whisper and Diffusers. This module covers both, then introduces the strategic decision every practitioner faces: when to fine-tune a model with LoRA versus when to use retrieval-augmented generation to give the model better context.

Agents, MCP, and Tool Use

Running a single model is useful. Building a system where a model can see, reason, pick tools, act, and iterate — that’s an agent. This module teaches you to build agentic workflows with HF smolagents, connect agents to external tools via MCP (Model Context Protocol), and give agents vision capabilities so they can reason over screenshots and visual inputs.

Responsible Deployment

A multimodal system that works in a notebook can still fail catastrophically in production — generating harmful images, misreading sensitive documents, or amplifying biases across modalities. This module teaches you to wrap VLM pipelines with safety filtering, test against adversarial inputs, and document failure modes before anyone else finds them.