MoCha - Towards Movie-Grade Talking Character Synthesis

Learn about MoCha, a groundbreaking model for generating full-body talking character animations directly from speech and text in this 28-minute conference talk. Discover how this innovative approach extends beyond traditional talking head generation to produce complete character portraits, addressing the crucial need for character-driven storytelling in automated film and animation production. Explore the speech-video window attention mechanism that ensures precise synchronization between audio and visual elements, and understand the joint training strategy that leverages both speech-labeled and text-labeled video datasets to improve generalization across diverse character actions. Examine the structured prompt templates with character tags that enable multi-character conversations with turn-based dialogue, allowing AI-generated characters to engage in context-aware interactions with cinematic coherence. Gain insights into the extensive qualitative and quantitative evaluations, including human preference studies and benchmark comparisons, that demonstrate MoCha's superior performance in realism, expressiveness, controllability, and generalization for AI-generated cinematic storytelling.