Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

Google

Gemini's Multimodal Capabilities - Deep Dive into Native Multimodality and AI Vision

Google via YouTube

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Explore Gemini's native multimodal capabilities in this 44-minute technical discussion featuring Ani Baddepudi, Gemini Model Behavior Product Lead, and host Logan Kilpatrick. Discover why Gemini was architected as a multimodal model from inception and understand the fundamental technology powering multimodal AI systems. Learn about Gemini 2.5's advanced video understanding features, including the distinctions between video and image processing, token representations, and higher frame-per-second video sampling techniques. Examine the strategic decision-making process behind new feature development and explore how multimodal AI enables innovative product experiences. Delve into Google's vision for proactive AI assistants and the concept of moving toward a world where "everything is vision." Understand technical improvements in video usability through variable FPS and frame tokenization methods. Gain insights into Gemini's sophisticated document understanding capabilities and learn about the collaborative teamwork driving Gemini's development. Discover future directions for multimodal AI development and model behavior enhancements, providing a comprehensive overview of cutting-edge AI technology and its practical applications.

Syllabus

0:00 - Intro
1:12 - Why Gemini is natively multimodal
2:23 - The technology behind multimodal models
5:15 - Video understanding with Gemini 2.5
9:25 - Deciding what to build next
13:23 - Building new product experiences with multimodal AI
17:15 - The vision for proactive assistants
24:13 - Improving video usability with variable FPS and frame tokenization
27:35 - What’s next for Gemini’s multimodal development
31:47 - Deep dive on Gemini’s document understanding capabilities
37:56 - The teamwork and collaboration behind Gemini
40:56 - What’s next with model behavior

Taught by

Google Developers

Reviews

Start your review of Gemini's Multimodal Capabilities - Deep Dive into Native Multimodality and AI Vision

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.