Lectures in Active Sequential Hypothesis Testing and Adaptive Exploration in Reinforcement Learning

In this lecture series we study the pure exploration problem. This problem is also more commonly known as “Sequential Design of Experiments” or “Active Sequential Hypothesis Testing”, and it is an active field of study in Reinforcement Learning problems.

Studying this problem allows us to understand the mechanism of information gathering: the learner’s sole objective is to reduce uncertainty to a decision, disentangling exploration from exploitation. For example, the decision might be about reaching a conclusion about a certain phenomenon. Studying this problem has several applications, ranging from medical diagnosis; faulty sensor detection; A/B testing and scientific discovery.

In this course we develop an information-theoretic view of pure exploration in Multi-Armed Bandit problems and tabular Reinforcement Learning. Students will learn to (i) formalize fixed-confidence Best Arm Identification (BAI) problems; (ii) derive instance-dependent lower bounds on the sample complexity via change of measure techniques; (iii) design provably correct algorithms that are instance-optimal.

Learning objectives

By the end, students will be able to:

Specify fixed-confidence BAI problem statements and correctness criteria.
Derive KL-based change-of-measure lower bounds and interpret instance complexity terms.
Construct sequential tests (stopping + recommendation) and prove correctness.
Analyze sample complexity (upper bounds) for canonical algorithms and compare to lower bounds.

Attendance to all lectures is not mandatory, but highly suggested. Lecture notes will be provided before each lecture.

Schedule

The course will run in person over the last weeks of November.

Alessio Russo

Prerequisites: knowledge of probability theory (e.g., having taken a class on multi-armed bandit problems).

About the Organizer: Alessio Russo is a Postdoc at Boston University, where he works with Prof. Aldo Pacchiano (plaia.ai) . He has a Ph.D. in Electrical Engineering from KTH Royal Insitute of Technology (Stockholm), where he was supervised by Prof. Proutiere Alexandre. In his research, he studies problems of adaptive exploration in RL/Deep-RL and Bandit problems, Statistical Learning in sequential decision making problems and Robust Control/RL. Website: alessi orusso.net

Lecture 1: Sample complexity lower bounds for iid models

Date: Fri., Nov. 14
Time and room: 4-6pm, CDS 164
Registration form: https://forms.gle/ohC8KJPPbBt6jrQH8

Abstract: We formalize the pure exploration problem for Multi-Armed bandit models. We formalize the fixed-confidence Best Arm Identification (BAI) problem and derive information-theoretic lower bounds on the sample complexity via change-of-measure arguments. You’ll see how KL divergences between candidate instances yield the characteristic time and the optimal allocation vector. The characteristic time characterizes the complexity of learning, and we will see how to compute it for simple models.
Lecture notes will be provided in advance.

Lecture 2: Stopping rules and design of optimal algorithms

Date: Mon., Nov. 17
Time and room: 4-6pm, PHO 117
Registration form: https://forms.gle/ohC8KJPPbBt6jrQH8

Abstract: This lecture focuses on the design of optimal algorithms for the BAI problem that attain the sample complexity lower bound. We will see how to derive anytime confidence intervals, and design algorithms that are asymptotically optimal in the confidence parameter. We will derive guarantees that hold in high probability, and in expectation.
Lecture notes will be provided in advance.

Lecture 3: Stopping rules and design of optimal algorithms

Date: Tues., Nov. 18
Time and room: 4-6pm, STH 319
Registration form: https://forms.gle/ohC8KJPPbBt6jrQH8

Lecture 4: Instance-dependent lower bounds for Markov Decision Processes (MDPs) and algorithm design

Date: Wed., Nov. 19
Time and room: 4-6pm, PHO 117
Registration form: https://forms.gle/ohC8KJPPbBt6jrQH8

Abstract: In this lecture we extend the BAI problem to tabular RL: identifying a target property (e.g., the best policy) in a controlled Markov chain. Using change-of-measure over transition kernels, we derive KL-based bounds that depend on state-action occupancy measures rather than arm pulls. The resulting characteristic time shows how mixing, reachability, and visitation constraints shape fundamental sample complexity.
Lecture notes will be provided in advance.

Lecture 5: Instance-dependent lower bounds for Markov Decision Processes (MDPs) and algorithm design

Date: Mon., Nov. 24
Time and room: 4-6pm, PHO 117
Registration form: https://forms.gle/ohC8KJPPbBt6jrQH8

Abstract: Building on the sample complexity bounds derived in the previous lecture, we will design an algorithm that is asymptotically optimal in the confidence parameter. We will see how the proof for Markov Decision Processes is more challenging than classical Bandit models, and what are the critical aspects in the proof of the sample complexity upper bound.
Lecture notes will be provided in advance.

Lecture 6: Bayesian Active Sequential Hypothesis Testing

Date: Tues., Nov. 25
Time and room: 4-6pm, CAS 424
Registration form: https://forms.gle/ohC8KJPPbBt6jrQH8

Abstract: In this last lecture we study the Bayesian Active Sequential Hypothesis Testing problem. This is a pure exploration problem where the learner has a prior over the model of the environment. We will see how to derive exploration policies that are optimal in a Bayesian sense, and how this methodology can be translated in practice to implement adaptive exploration strategies using Transformer models.
Lecture notes will be provided in advance.