AI Seminar: Hybrid RL: Using Offline Data for Efficient Online Reinforcement Learning

Image
Wen Sun
Event Speaker
Wen Sun
Event Speaker Description
Assistant Professor
Computer Science Department
Cornell University
Event Type
Artificial Intelligence
Date
Event Location
KEC 1001 and Zoom
Event Description

Zoom: https://oregonstate.zoom.us/j/96491555190?pwd=azJHSXZ0TFQwTFFJdkZCWFhnT…

Reinforcement Learning is a framework for learning to make sequential decisions from feedback. While most of today’s powerful AI systems (e.g., AlphaGo and ChatGPT) are trained via RL, RL is still considered to be inefficient in learning, i.e., it learns very slowly. In this talk, we consider hybrid RL, a new framework that integrates offline data into online RL with the goal of vastly improving the learning efficiency of RL. In the first part of the talk, we will present a simple approach to integrating offline data into a Q-learning-style framework. This simple approach achieves strong theoretical guarantees showing that with offline data, RL can be as easy as supervised learning. Empirically, when equipped with average-quality offline data, this approach can learn 10x faster than pure online RL baselines in a notorious hard video game that requires deep strategic exploration. In the second part of the talk, we show another way of integrating offline data into online RL via reset, a condition that allows a learning system to reset to any state it visited before and holds in important RL applications such as training language models. We propose an algorithm framework called dataset reset policy optimization (DrPO), which simply resets the RL agent to states in the offline data when collecting online data. We show that when integrating DrPO into an RL from Human Feedback (RLHF) framework where the reward model is learned from offline human preference data, it guarantees to discover the best policy covered by the offline human preference data. Empirically, when fine-tuning a 3 billion size language model on a standard RLHF benchmark — the Reddit TL;DR summarization task, DrPO learns to generate summaries that are better than that from humans, and a pure online RL baseline Proximal Policy Optimization (PPO) and a pure offline RL baseline Direct Preference Optimization (DPO), when evaluated by GPT4.

Speaker Biography

Wen Sun is an Assistant Professor in the Computer Science Department at Cornell. Before that, he was a postdoctoral researcher at Microsoft Research NYC and he completed his Ph.D in 2019 from the Robotics Institute at Carnegie Mellon University. He is generally interested in machine learning, especially Reinforcement Learning. Much of his current research is about designing algorithms for efficient sequential decision-making and how to leverage human interaction for better and more efficient Reinforcement Learning.