Developing Multimodal Generative AI Applications

Overview

6-week cohort with live MIT Faculty sessions. Learn to scale AI beyond the pilot stage.

Unlock the power of multimodal AI and learn how modern systems combine text, images, speech, and video to create intelligent applications. This course teaches the foundational concepts behind multimodal GenAI applications, the challenges of integrating diverse data types, and the techniques used to build advanced, interactive systems. You’ll develop core skills in transcription, text-to-speech, image generation, video synthesis, and multimodal reasoning.

Through hands-on labs, you’ll work with Generative AI models like IBM Granite, OpenAI Whisper, DALL·E, Sora, Meta’s Llama, Mixtral, and vision-language architectures to apply multimodal AI in practical scenarios. You’ll build tools such as captioning systems, video-from-text generators, and AI-powered assistants that can process and respond across multiple data streams.

The course includes full-stack projects using Python, Flask, and Gradio, where you’ll design and deploy complete multimodal AI applications. By the end, you’ll have the technical skills needed to create next-generation AI systems used in search engines, chatbots, creative tools, and enterprise applications.

Syllabus

Build the job-ready skills you need to build multimodal generative AI applications in just a few hours
Understand the fundamental concepts and challenges in multimodal AI, including the integration of text, speech, images, and video
Build multimodal AI applications using state-of-the-art models and frameworks such as IBM Granite, Meta’s Llama, OpenAI Whisper, DALL·E, and Sora
Develop multimodal AI solutions, including chatbots and image/video generation models, using IBM watsonx.ai, Hugging Face, Flask, and Gradio
Apply multimodal search, retrieval, and question answering techniques to solve practical problems
Design and deploy full-stack multimodal systems that combine audio, vision, and language models