Zoom: https://oregonstate.zoom.us/j/91611213801?pwd=Wm9JSkN1eW84RUpiS2JEd0E5T…
‘RL as Inference’ is a popular but flawed perspective. In this talk, we empower it with a principled, Bayesian treatment that yields efficient exploration. We first clarify how control and statistical inference - the 2 facets of RL - can be distilled into a single quantity, PΓ*, the posterior probability of each state-action being visited by the optimal policy. Previous approaches approximate PΓ* in an arbitrarily poor way that does not perform well in challenging problems. We prove that PΓ* can be used to generate a policy that explores efficiently, as measured by regret, although computing it is intractable. We thus derive a new variational Bayesian approximation yielding a tractable convex optimization problem and establish that the resulting policy also explores efficiently. We call our approach VAPOR and show that it has strong connections to Thompson sampling, K-learning, and maximum entropy exploration. We conclude with some experiments demonstrating the performance advantage of a deep RL version of VAPOR.
Jean Tarbouriech is a Research Scientist at Google DeepMind, London. His main research interest is Reinforcement Learning, with a focus on efficient exploration to improve RL agents and large language models. He obtained his PhD from Inria Lille and Meta AI Paris.