Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

Udacity

Multimodal AI Applications

via Udacity

Overview

Learn how computers process and understand image data, then harness the power of the latest Generative AI models to create new images.

Syllabus

  • Introduction
  • Multimodal AI Fundamentals
    • Discover multimodal AI fundamentals and technologies, including models and use cases that process and generate text, images, audio, and video for richer, real-world applications.
  • Using Multimodal AI Technologies
    • Explore practical applications of multimodal AI by using APIs and open-source models for image captioning and audio transcription, with hands-on exercises and secure credential handling.
  • Transformers & Multimodal Processing
    • Explore how transformers unify text, images, audio, and video through attention, embeddings, and fusion strategies, powering state-of-the-art multimodal understanding and generation.
  • Multimodal AI Tooling
    • Explore practical tools for building multimodal AI apps, compare commercial and open-source options, and use Pydantic AI to create reliable, structured, vendor-agnostic workflows.
  • Introduction to Enterprise Visual Content Processing
    • Explore enterprise visual content processing: core computer vision tasks, digital image representation, and real-world applications for efficiency, safety, and automation.
  • Vision Pre-processing Pipelines with HuggingFace
    • Explore vision data pipelines using HuggingFace, from dataset loading to resizing and normalization, with demos and hands-on exercises for effective image pre-processing.
  • Understanding Embeddings in Computer Vision
    • Learn how embeddings convert images into compact vectors for efficient search, enable cross-modal tasks with models like CLIP, and power large-scale, robust computer vision systems.
  • Image Search Using CLIP Embeddings
    • Explore how to build text-to-image and image-to-image search using CLIP embeddings, combining theory, real-world demos, hands-on practice, and solution walkthroughs.
  • Using Multimodal Model APIs for Vision
    • Explore multimodal vision APIs: prompt design, parameter tuning, structured outputs, cost control, integration, and best practices for robust, efficient image analysis.
  • Gemini Vision API Basics
    • Explore Gemini Vision API basics by practicing image moderation, learning to analyze images and implement moderation workflows using real-world examples and guided hands-on exercises.
  • Vision Transformer Models & Architectures
    • Explore Vision Transformer models: core architecture, image tokenization, self- and cross-attention, and top models (SAM, RT-DETR, DINOv2) for segmentation, detection, and enterprise use.
  • Using Vision Transformers
    • Explore vision transformers with hands-on demos: extract image embeddings using DINOv2 and perform object detection and segmentation using RT-DETR and SAM2.1 models.
  • Vision-Language Models
    • Learn how vision-language models align images and text for tasks like search, captioning, and VQA, with focus on architectures, applications, data needs, and deploying for enterprise use.
  • Multimodal Vision Applications with CLIP
    • Explore zero-shot image classification and auto-labeling for driving scenes using CLIP, enabling efficient, scalable multimodal vision applications.
  • Diffusion Models & Image Generation
    • Explore how diffusion models generate images by reversing noise through iterative denoising, inspired by physical diffusion processes and key to modern generative AI developments.
  • Introduction to Enterprise Audio Processing
    • Discover enterprise audio processing, core speech tasks (transcription, diarization, sentiment, TTS), key use cases, and strategies for value and integration in modern businesses.
  • Audio Data Representation
    • Explore how audio is digitized for AI: sample rate, bit depth, channels, formats, and mel spectrograms for speech, plus challenges and best practices in audio preprocessing and analysis.
  • Audio Processing with librosa
    • Explore audio processing with librosa: load, resample, convert, and analyze audio files; visualize with mel spectrograms and apply techniques through hands-on exercises.
  • Sound Retrieval and Classification
    • Explore audio embeddings for efficient sound classification and retrieval, using models like CLAP to enable semantic search and robust text-based audio analysis at scale.
  • Sound Retrieval and Classification with CLAP
    • Explore using CLAP for sound retrieval, similarity, and zero-shot classification, then apply these skills to detect fan on/off states in real audio data.
  • Speech Processing
    • Discover automatic speech recognition with Whisper: a robust, multilingual, open-source model for accurate transcription, translation, and speech processing in real-world audio.
  • Implementing Speech Processing with Whisper & Gemini
    • Explore real-world speech transcription and translation with Whisper and Gemini, using Python to process, segment, and align audio with text, including multilingual support.
  • Audio Intelligence
    • Explore advances in Audio Intelligence: multimodal systems, speech recognition, TTS, enterprise controls, creative workflows, and ethics for robust, secure, and accessible audio solutions.
  • Audio Sentiment Analysis with Gemini
    • Explore audio sentiment and command analysis using Pydantic AI and Gemini; learn to extract emotions and recognize spoken commands from audio with real-world datasets and hands-on exercises.
  • Audio Classification and Moderation
    • Explore voice content moderation: real-time and batch pipelines, compliance, privacy, layered detection, and operational excellence for secure and fair audio classification.
  • Building a Basic Voice Moderation System with Gemini
    • Learn to build a voice moderation system using Gemini to transcribe audio, detect personal data disclosures, and flag policy violations in customer service recordings.
  • Introduction to Enterprise Video Processing
    • Discover how enterprise video AI overcomes temporal complexity using smart frame selection for efficient understanding, search, classification, moderation, and generation at scale.
  • AI Models for Video Understanding
    • Explore key AI models like YOLO for real-time detection, CoTracker and TimeSformer for motion and temporal understanding, enabling advanced, scalable enterprise video analytics.
  • Implementing Object Recognition & Tracking
    • Learn how to detect and track objects in videos using YOLOv9, apply multi-object tracking, handle small objects, and count items crossing boundaries in practical scenarios.
  • Video Understanding & Search
    • Explore methods for video analysis and search using foundation models and CLIP4Clip, balancing temporal understanding, cost, and retrieval accuracy for enterprise applications.
  • Video Understanding & Search with Gemini & Clip4Clip
    • Explore video understanding with Gemini and Clip4Clip: learn automated video description, key moment detection, and natural language video search using AI models and structured outputs.
  • Video Classification & Moderation
    • Learn to classify and moderate video by modeling temporal patterns, handling real-world challenges, and combining automation with human oversight for scale, accuracy, and compliance.
  • Video Classification & Moderation with Gemini
    • Learn to build automated systems for video classification and moderation with Gemini and Pydantic AI, including action recognition and safety compliance in real-world scenarios.
  • Video Generation
    • Explore generative video AI tools and workflows that turn text, images, or footage into dynamic content for marketing, training, and creative use while ensuring quality and compliance.
  • Video Generation with Veo 3
    • Learn to generate marketing videos with Veo 3 using both text-to-video and image-to-video workflows, and understand their strengths, limitations, and real-world applications.
  • Multimodal AI Deployment
    • Explore deployment of multimodal AI systems for text, images, audio, video via unified APIs, multi-API orchestration, and custom solutions, balancing speed, cost, and control.
  • Implementation Tools and Serving Strategies
    • Explore tools and strategies for implementing, serving, and monitoring AI solutions, from rapid prototyping to production, including unified APIs, orchestration, and managed platforms.
  • Using Gradio and Pydantic AI
    • Learn to build multimodal chatbots and analysis apps using Gradio and Pydantic AI, covering async programming, media inputs, rate limiting, and interface customization.
  • Multimodal AI Performance Monitoring and Logging
    • Learn to monitor and log multimodal AI systems, tracking performance, costs, and failures across modalities for optimized, reliable, and coherent production deployments.
  • Logging and Performance Monitoring with Gradio and Arize Phoenix
    • Learn to implement logging and performance monitoring for multimodal AI chatbots using Gradio and Arize Phoenix, enabling robust analytics, debugging, and cost tracking.
  • Evaluating Multimodal Applications
    • Learn how to evaluate multimodal AI apps using user feedback systems and testing methods, blending human review, automated metrics, and continuous monitoring for quality improvement.
  • Testing Multimodal Apps with Pydantic AI Evals
    • Learn to build robust testing frameworks for multimodal AI apps using Pydantic Evals, covering structured outputs, semantic evaluation, custom evaluators, and hands-on exercises.
  • Scaling Multimodal AI Architecture
    • Learn strategies to scale multimodal AI: unified APIs, multi-API pipelines, and custom deployments, focusing on performance, cost, reliability, and architectural trade-offs.
  • OmniTrainer: Multimodal Customer Service Trainer
    • In this project, students will create an AI agent that simulates customer service scenarios and specialized monitoring agents that analyze communications across text, images, videos, and audio.

Taught by

Giacomo Vianello

Reviews

Start your review of Multimodal AI Applications

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.