Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

IBM

Build Multimodal Generative AI Applications

IBM via Coursera

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Ready to level up your GenAI skills? Step into the exciting world of multimodal AI, where language, images, and speech come together to build smarter, more interactive applications. In this hands-on course, you’ll learn how to build systems that work across multiple modalities, from creating AI-powered storytellers and meeting assistants to developing image captioning tools and video generation apps. You’ll gain experience with real-world tools like IBM’s Granite, OpenAI’s Whisper, Sora and DALL·E, Meta’s Llama, Mistral’s Mixtral, and Gradio. Plus, you'll explore multimodal search, question answering, and retrieval systems that combine text, speech, and visual data. By the end of the course, you’ll be able to design and build full-stack multimodal AI solutions using Python and frameworks like Flask and Gradio. If you’re looking to gain in-demand skills for building the next generation of AI applications, enroll today and power up your AI career!

Syllabus

  • Foundations of Multimodal AI
    • This module provides an in-depth introduction to multimodal AI, focusing on how AI systems process and integrate multiple data types, including text, speech, and images. You will explore core concepts and some of the challenges you will face in multimodal AI, gaining foundational skills with text and speech processing techniques. Through hands-on labs, you will apply AI-powered storytelling, speech-to-text transcription, and text-to-speech synthesis to real-world applications, such as AI-generated audiobooks and automated meeting assistants. 
  • Integrating Visual and Video Modalities
    • This module explores how AI processes generate visual data by integrating images and videos with text. You will examine text-to-image/image-to-text and text-to-video/video-to-text models, image captioning, and the fusion techniques necessary for effective multimodal AI systems. Through hands-on labs, you will apply state-of-the-art models like DALL·E and Sora to generate images and videos from text prompts. Additionally, you will implement an image captioning system using Meta’s Llama 4, gaining practical experience in combining vision and language models for real-world applications.
  • Advanced Multimodal Applications
    • The final module explores advanced multimodal AI applications, integrating image, text, and retrieval-based systems to build innovative solutions. You will dive into multimodal retrieval and search, multimodal Question Answering (QA), and chatbots, learning how cross-modal retrieval techniques enhance search engines and recommendation systems. Additionally, you will learn how integrating visual and textual data improves chatbot interactions. Through hands-on labs, you will build fully functional web applications with multimodal capabilities using Flask, applying state-of-the-art models and frameworks. 

Taught by

Hailey Quach and Ricky Shi

Reviews

4.8 rating at Coursera based on 49 ratings

Start your review of Build Multimodal Generative AI Applications

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.