AppWorld: Reliable Evaluation of Interactive Agents in a World of Apps and People

Watch a research talk exploring AppWorld, a groundbreaking simulation environment designed for evaluating AI agents' capabilities in performing everyday digital tasks. Dive into the development of a high-fidelity simulated world featuring nine common applications like Amazon, Gmail, and Venmo, where AI assistants must navigate complex scenarios such as splitting bills with roommates through interactive coding and API calls. Learn about the challenges of creating reliable evaluation frameworks for complex tasks with multiple solution paths, and discover how current leading language models like GPT-4 perform on these real-world challenges. Explore future research directions for developing multimodal, collaborative, and socially intelligent AI agents that can effectively learn from environmental feedback and adapt to various situations. Presented by PhD candidate Harsh Trivedi from Stony Brook University, whose work on AppWorld earned a Best Resource Paper award at ACL'24.