Corruption-Robust Exploration in Episodic Reinforcement Learning

Abstract: We initiate the study of multi-stage episodic reinforcement learning under adversarial manipulations in both the rewards and the transition probabilities of the underlying system. Existing efficient algorithms heavily rely on the "optimism under uncertainty" principle which dictates their behavior and does not allow flexibility to perform corruption-robust exploration. We address this by (i) departing from the optimistic behavior, and (ii) creating a general framework that incorporates the principle of action-elimination. (This principle has been essential for corruption-robust exploration in multi-armed bandits, a degenerate special case of episodic reinforcement learning.) Despite constructing a lower bound for a straightforward implementation of action-elimination, we provide a clean and modular way to transfer it to episodic reinforcement learning. Our algorithm enjoys near-optimal guarantees in the absence of adversarial manipulations, has performance that degrades gracefully as the amount of corruption increases, and does not need to know this amount. Our results shed new light on the broader question of robust exploration, and suggest a way to address a rather daunting mismatch between optimistic algorithms and algorithms with higher flexibility. To demonstrate the applicability of our framework, we provide a second instantiation thereof, showing how it can provide efficient guarantees for the stochastic setting, despite doing almost uniform exploration across plausibly optimal actions.

Joint work with Max Simchowitz, Aleksandrs Slivkins, and Wen Sun.

Bio: Thodoris Lykouris is a postdoctoral researcher in the machine learning group of Microsoft Research NYC. His research focus is on online decision-making spanning across the disciplines of machine learning, theoretical computer science, operations research, and economics. He completed his Ph.D. in 2019 from Cornell University where he was privileged to be advised by Éva Tardos. During his Ph.D. years, his research has been generously supported by a Google Ph.D. Fellowship and a Cornell University Fellowship. He was also a finalist in the INFORMS Nicholson and Applied Probability Society best student paper competitions.