A Bit of Freedom Goes a Long Way - Quantum and Classical Algorithms for Online Learning of MDPs under a Generative Model

Explore a conference talk presenting novel classical and quantum online algorithms for learning Markov Decision Processes (MDPs) in both finite-horizon and infinite-horizon average-reward settings. Discover how researchers Andris Ambainis, Joao F. Doriguello, and Debbie Huey Chih Lim developed algorithms based on a hybrid exploration-generative reinforcement learning model that allows agents to interact with environments through generative sampling or "simulator" access. Learn how these approaches avoid traditional reinforcement learning paradigms like "optimism in the face of uncertainty" and "posterior sampling" by computing and using optimal policies directly, resulting in superior regret bounds compared to previous work. Understand how the quantum algorithm for finite-horizon MDPs achieves regret bounds that depend only logarithmically on the number of time steps T, breaking the classical O(√T) barrier while improving dependence on state space size S and action space size A parameters. Examine the infinite-horizon MDP results where both classical and quantum bounds maintain Õ(√T) dependence but with enhanced S and A factors, and discover the novel regret measure for infinite-horizon MDPs that enables the quantum algorithm to achieve poly-logarithmic T regret, exponentially outperforming classical algorithms. Gain insights into how these results extend to compact continuous state spaces, presented at the Quantum Techniques in Machine Learning (QTML) 2025 conference in Singapore.