Orchestrating Real-Time Multimodal AI Agents with Rust

Explore building high-performance, real-time multimodal AI agent systems through a comprehensive conference talk examining server-side architecture using Rust. Discover how to create systems capable of natural, real-time conversations using open-source AI models through a detailed case study of a Rust-based server component that orchestrates communication between edge devices and AI service clusters. Learn about modular approaches utilizing distinct, swappable services for Voice Activity Detection (VAD), Automatic Speech Recognition (ASR), Large Language Models (LLM), and Text-to-Speech (TTS). Understand core orchestration patterns for managing real-time audio streams and API calls to services like Whisper and various open-source LLMs. Examine why Rust was selected for its safety and high-throughput performance, particularly when handling numerous concurrent WebSocket and HTTP/S connections. Investigate the architectural flexibility that enables mixing locally hosted models for privacy (such as LlamaEdge) with powerful cloud APIs (like Google Gemini Live). Discover agentic extensibility through tool call integration using Model Context Protocol (MCP) to provide agents with access to live internet search, online APIs, and other devices. Gain insights valuable for engineers and developers building practical AI applications requiring real-time voice interaction, flexibility, modularity, custom tools, private knowledge, and agentic capabilities.