arxiv:2310.18186

Model-free Posterior Sampling via Learning Rate Randomization

Published on Oct 27, 2023

Authors:

Michal Valko ,

Abstract

Randomized Q-learning, a novel model-free algorithm for regret minimization in episodic MDPs, achieves optimal regret bounds in both tabular and metric spaces without using exploration bonuses.

AI-generated summary

In this paper, we introduce Randomized Q-learning (RandQL), a novel randomized model-free algorithm for regret minimization in episodic Markov Decision Processes (MDPs). To the best of our knowledge, RandQL is the first tractable model-free posterior sampling-based algorithm. We analyze the performance of RandQL in both tabular and non-tabular metric space settings. In tabular MDPs, RandQL achieves a regret bound of order O(H^{5SAT}), where H is the planning horizon, S is the number of states, A is the number of actions, and T is the number of episodes. For a metric state-action space, RandQL enjoys a regret bound of order O(H^{5/2} T^{(d_z+1)/(d_z+2)}), where d_z denotes the zooming dimension. Notably, RandQL achieves optimistic exploration without using bonuses, relying instead on a novel idea of learning rate randomization. Our empirical study shows that RandQL outperforms existing approaches on baseline exploration environments.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2310.18186 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2310.18186 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.