Learn how computers process and understand image data, then harness the power of the latest Generative AI models to create new images.
Overview
Syllabus
- Introduction
- Multimodal AI Fundamentals
- Discover multimodal AI fundamentals and technologies, including models and use cases that process and generate text, images, audio, and video for richer, real-world applications.
- Using Multimodal AI Technologies
- Explore practical applications of multimodal AI by using APIs and open-source models for image captioning and audio transcription, with hands-on exercises and secure credential handling.
- Transformers & Multimodal Processing
- Explore how transformers unify text, images, audio, and video through attention, embeddings, and fusion strategies, powering state-of-the-art multimodal understanding and generation.
- Multimodal AI Tooling
- Explore practical tools for building multimodal AI apps, compare commercial and open-source options, and use Pydantic AI to create reliable, structured, vendor-agnostic workflows.
- Introduction to Enterprise Visual Content Processing
- Explore enterprise visual content processing: core computer vision tasks, digital image representation, and real-world applications for efficiency, safety, and automation.
- Vision Pre-processing Pipelines with HuggingFace
- Explore vision data pipelines using HuggingFace, from dataset loading to resizing and normalization, with demos and hands-on exercises for effective image pre-processing.
- Understanding Embeddings in Computer Vision
- Learn how embeddings convert images into compact vectors for efficient search, enable cross-modal tasks with models like CLIP, and power large-scale, robust computer vision systems.
- Image Search Using CLIP Embeddings
- Explore how to build text-to-image and image-to-image search using CLIP embeddings, combining theory, real-world demos, hands-on practice, and solution walkthroughs.
- Using Multimodal Model APIs for Vision
- Explore multimodal vision APIs: prompt design, parameter tuning, structured outputs, cost control, integration, and best practices for robust, efficient image analysis.
- Gemini Vision API Basics
- Explore Gemini Vision API basics by practicing image moderation, learning to analyze images and implement moderation workflows using real-world examples and guided hands-on exercises.
- Vision Transformer Models & Architectures
- Explore Vision Transformer models: core architecture, image tokenization, self- and cross-attention, and top models (SAM, RT-DETR, DINOv2) for segmentation, detection, and enterprise use.
- Using Vision Transformers
- Explore vision transformers with hands-on demos: extract image embeddings using DINOv2 and perform object detection and segmentation using RT-DETR and SAM2.1 models.
- Vision-Language Models
- Learn how vision-language models align images and text for tasks like search, captioning, and VQA, with focus on architectures, applications, data needs, and deploying for enterprise use.
- Multimodal Vision Applications with CLIP
- Explore zero-shot image classification and auto-labeling for driving scenes using CLIP, enabling efficient, scalable multimodal vision applications.
- Diffusion Models & Image Generation
- Explore how diffusion models generate images by reversing noise through iterative denoising, inspired by physical diffusion processes and key to modern generative AI developments.
- Introduction to Enterprise Audio Processing
- Discover enterprise audio processing, core speech tasks (transcription, diarization, sentiment, TTS), key use cases, and strategies for value and integration in modern businesses.
- Audio Data Representation
- Explore how audio is digitized for AI: sample rate, bit depth, channels, formats, and mel spectrograms for speech, plus challenges and best practices in audio preprocessing and analysis.
- Audio Processing with librosa
- Explore audio processing with librosa: load, resample, convert, and analyze audio files; visualize with mel spectrograms and apply techniques through hands-on exercises.
- Sound Retrieval and Classification
- Explore audio embeddings for efficient sound classification and retrieval, using models like CLAP to enable semantic search and robust text-based audio analysis at scale.
- Sound Retrieval and Classification with CLAP
- Explore using CLAP for sound retrieval, similarity, and zero-shot classification, then apply these skills to detect fan on/off states in real audio data.
- Speech Processing
- Discover automatic speech recognition with Whisper: a robust, multilingual, open-source model for accurate transcription, translation, and speech processing in real-world audio.
- Implementing Speech Processing with Whisper & Gemini
- Explore real-world speech transcription and translation with Whisper and Gemini, using Python to process, segment, and align audio with text, including multilingual support.
- Audio Intelligence
- Explore advances in Audio Intelligence: multimodal systems, speech recognition, TTS, enterprise controls, creative workflows, and ethics for robust, secure, and accessible audio solutions.
- Audio Sentiment Analysis with Gemini
- Explore audio sentiment and command analysis using Pydantic AI and Gemini; learn to extract emotions and recognize spoken commands from audio with real-world datasets and hands-on exercises.
- Audio Classification and Moderation
- Explore voice content moderation: real-time and batch pipelines, compliance, privacy, layered detection, and operational excellence for secure and fair audio classification.
- Building a Basic Voice Moderation System with Gemini
- Learn to build a voice moderation system using Gemini to transcribe audio, detect personal data disclosures, and flag policy violations in customer service recordings.
- Introduction to Enterprise Video Processing
- Discover how enterprise video AI overcomes temporal complexity using smart frame selection for efficient understanding, search, classification, moderation, and generation at scale.
- AI Models for Video Understanding
- Explore key AI models like YOLO for real-time detection, CoTracker and TimeSformer for motion and temporal understanding, enabling advanced, scalable enterprise video analytics.
- Implementing Object Recognition & Tracking
- Learn how to detect and track objects in videos using YOLOv9, apply multi-object tracking, handle small objects, and count items crossing boundaries in practical scenarios.
- Video Understanding & Search
- Explore methods for video analysis and search using foundation models and CLIP4Clip, balancing temporal understanding, cost, and retrieval accuracy for enterprise applications.
- Video Understanding & Search with Gemini & Clip4Clip
- Explore video understanding with Gemini and Clip4Clip: learn automated video description, key moment detection, and natural language video search using AI models and structured outputs.
- Video Classification & Moderation
- Learn to classify and moderate video by modeling temporal patterns, handling real-world challenges, and combining automation with human oversight for scale, accuracy, and compliance.
- Video Classification & Moderation with Gemini
- Learn to build automated systems for video classification and moderation with Gemini and Pydantic AI, including action recognition and safety compliance in real-world scenarios.
- Video Generation
- Explore generative video AI tools and workflows that turn text, images, or footage into dynamic content for marketing, training, and creative use while ensuring quality and compliance.
- Video Generation with Veo 3
- Learn to generate marketing videos with Veo 3 using both text-to-video and image-to-video workflows, and understand their strengths, limitations, and real-world applications.
- Multimodal AI Deployment
- Explore deployment of multimodal AI systems for text, images, audio, video via unified APIs, multi-API orchestration, and custom solutions, balancing speed, cost, and control.
- Implementation Tools and Serving Strategies
- Explore tools and strategies for implementing, serving, and monitoring AI solutions, from rapid prototyping to production, including unified APIs, orchestration, and managed platforms.
- Using Gradio and Pydantic AI
- Learn to build multimodal chatbots and analysis apps using Gradio and Pydantic AI, covering async programming, media inputs, rate limiting, and interface customization.
- Multimodal AI Performance Monitoring and Logging
- Learn to monitor and log multimodal AI systems, tracking performance, costs, and failures across modalities for optimized, reliable, and coherent production deployments.
- Logging and Performance Monitoring with Gradio and Arize Phoenix
- Learn to implement logging and performance monitoring for multimodal AI chatbots using Gradio and Arize Phoenix, enabling robust analytics, debugging, and cost tracking.
- Evaluating Multimodal Applications
- Learn how to evaluate multimodal AI apps using user feedback systems and testing methods, blending human review, automated metrics, and continuous monitoring for quality improvement.
- Testing Multimodal Apps with Pydantic AI Evals
- Learn to build robust testing frameworks for multimodal AI apps using Pydantic Evals, covering structured outputs, semantic evaluation, custom evaluators, and hands-on exercises.
- Scaling Multimodal AI Architecture
- Learn strategies to scale multimodal AI: unified APIs, multi-API pipelines, and custom deployments, focusing on performance, cost, reliability, and architectural trade-offs.
- OmniTrainer: Multimodal Customer Service Trainer
- In this project, students will create an AI agent that simulates customer service scenarios and specialized monitoring agents that analyze communications across text, images, videos, and audio.
Taught by
Giacomo Vianello