Offline Policy Evaluation for Reinforcement Learning under Unmeasured Confounding

Offline Policy Evaluation for Reinforcement Learning under Unmeasured Confounding (via Zoom)

Abstract: In the context of reinforcement learning (RL), offline policy evaluation (OPE) is the problem of evaluating the value of a candidate policy using data that was previously collected from some existing logging policy. This is of crucial importance in many application areas such as medicine, healthcare, or robotics, where the cost of actually executing a potentially bad policy could be catastrophic. Unfortunately, in many of the applications that inspire OPE, we may reasonably expect the available logged data to be affected by unmeasured confounding, in which case standard OPE methods may be arbitrarily biased.

In this talk I will present some of my recent work on OPE under unmeasured confounding. First, I will discuss an infinite-horizon stationary setting, where the confounding occurs iid at each time step. In this setting, we may correct for the effects of confounding as long as we can infer an accurate latent variable model of the confounders. Then, I will discuss an episodic setting, where the confounding may be modeled using a Partially Observed Markov Decision Process (POMDP). Even in this more challenging setting, we may still account for confounding via a sequential reduction to contextual bandit-style policy evaluation, using the recently-proposed proximal causal inference framework. Finally, I will provide a high-level discussion of the open challenges surrounding RL with unmeasured confounders.

The talk is based on joint work with Nathan Kallus, Lihong Li, and Ali Mousavi

Bio: Andrew is a fifth year PhD student at Cornell University in the Computer Science department, supervised by Nathan Kallus. His current research focus is at the intersection of causal inference, machine learning, and econometrics, with particular interest in causal inference under unmeasured confounding, reinforcement learning, and efficiently solving high-dimensional conditional moment problems. Previously, during his Masters at the University of Melbourne, Andrew conducted research in Natural Language Processing and Computational Linguistics.