Papers
arxiv:2010.03531

Episodic Reinforcement Learning in Finite MDPs: Minimax Lower Bounds Revisited

Published on Oct 7, 2020
Authors:
,
,
,

Abstract

Novel lower bounds on sample complexity and regret for best policy identification in non-stationary MDPs are proposed, utilizing a different construction of "hard MDPs" compared to previous literature.

AI-generated summary

In this paper, we propose new problem-independent lower bounds on the sample complexity and regret in episodic MDPs, with a particular focus on the non-stationary case in which the transition kernel is allowed to change in each stage of the episode. Our main contribution is a novel lower bound of Ω((H^3SA/ε^2)log(1/δ)) on the sample complexity of an (varepsilon,δ)-PAC algorithm for best policy identification in a non-stationary MDP. This lower bound relies on a construction of "hard MDPs" which is different from the ones previously used in the literature. Using this same class of MDPs, we also provide a rigorous proof of the Ω(H^3SAT) regret bound for non-stationary MDPs. Finally, we discuss connections to PAC-MDP lower bounds.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2010.03531 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2010.03531 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.