CISE Seminar: Sayak Ray Chowdhury, BU SE Postdoctoral Associate

3:00PM-4:00PM

 

Online Reinforcement Learning in Large and Structured Environments

In a reinforcement learning (RL) problem, a learner takes actions to control the state of
an initially unknown environment to maximize the sum of the rewards he obtains. This has several applications in many practical sequential decision-making systems like recommendation systems, sequential investment and portfolio allocation, dynamic resource allocation in communication systems, targeted advertising, and pricing, to name a few. In these online learning settings, there is no separate time available to purely exploring the unknown environment; rather, one must carefully trade-off exploration and exploitation. The learner seeks to minimize the regret or shortfall in total reward earned from all its decisions, compared to an optimal decision. In this talk, we will discuss online RL algorithms for large and structured environments, where generalization across unseen states and actions is crucial to achieving significantly low regret.

In the first part of this talk, we will discuss non-parametric approaches to model reward structure in the multi-armed bandit problem — a stateless i.i.d. non-Markovian environment — with a very large (potentially infinite) number of arms. This problem can be formulated as learning to optimize a fixed but unknown function over a continuous domain, with the function evaluations being noisy and expensive. This is a generic problem appearing in several applications like hyper-parameter tuning in machine learning models, sensor selection, experimental design, to name a few. We model the function’s uncertainty using the reproducing kernel Hilbert space (RKHS) compatible with a kernel on the function’s domain. Adapting techniques from Gaussian process regression, we develop upper confidence bound and Thompson sampling-based algorithms for continuous bandit optimization and derive corresponding regret guarantees assuming the reward distribution to be sub-Gaussian.

In the second part of this talk, we will discuss regret minimization strategies for episodic, average-reward Markov decision processes (MDPs) with infinite states and actions. We consider both parametric and non-parametric MDPs. In the non-parametric setting, taking inspiration from kernelized bandits, we model uncertainty in mean reward and transitions using RKHSs compatible with kernels defined jointly on state-action spaces. In the parametric setting, we introduce an MDP transition model defined by bilinear exponential families with an unknown finite-dimensional parameter. For both settings, we design variants of upper confidence RL and posterior sampling RL strategies. We prove finite-time regret guarantees for the proposed algorithms that depend on the structures of the underlying models.

Sayak Ray Chowdhury is a Postdoctoral Associate in the Division of Systems Engineering at Boston University. Previously, he was a Google PhD fellow at Indian Institute of Science, Bangalore. His research interest lies in sequential decision making, specifically in multi-armed bandits and reinforcement learning. He is also interested multi-agent learning problems and learning under privacy and fairness constraints.

Faculty Host: Venkatesh Saligrama
Student Host: Salomón Wollenstein Betech