The Marvelous Magic of Multimodal AI - Understanding Text, Images, Audio, and Video Generation

Explore a comprehensive conference talk that delves into the fascinating world of multimodal AI, where machines can seamlessly understand and generate text, images, audio, and video content. Learn the fundamental differences between Large Language Models (LLMs) and Large Multimodal Models (LMMs), and discover how groundbreaking technologies like Molmo by Ai2 are pushing the boundaries of AI capabilities. Understand the critical role of data representation, the challenges of natural language processing, and the importance of context in AI systems. Gain insights into the inner workings of multimodal AI, with special focus on text-to-image generation and future applications. Through practical examples and expert analysis from former INDYCAR engineer and data scientist Alex Castrounis, discover how multimodal AI is revolutionizing human-computer interaction and shaping the future of technology, including the potential to autonomously create complete explainer videos with generated scripts, visuals, music, and animations.

Syllabus

Intro
LLM vs LMM
What is multimodal AI?
Molmo by Ai2
Data ≈ Representation
Natural language is hard
What about context?
The inner workngs
Text-to-images
The AI of tomorrow
Outro