Gradient-based Reinforcement Planning in Policy-Search Methods
Author: | Ivo Kwee,
Marcus Hutter,
Juergen Schmidhuber
(2001) |
Comments: | Extended version: 9 pages |
Subj-class: | Artificial Intelligence; Learning; |
ACM-class: | I.2; I.2.6; I.2.8; |
Reference: | Proceedings of the 5th European Workshop on Reinforcement Learning
(EWRL-5) 27-29, Onderwijsinsituut CKI |
Keywords: Artificial intelligence, reinforcement learning,
direct policy search, planning, gradient decent.
Abstract: We introduce a learning method called "gradient-based reinforcement
planning" (GREP). Unlike traditional DP methods that improve their
policy backwards in time, GREP is a gradient-based method that plans
ahead and improves its policy before it actually acts in the
environment. We derive formulas for the exact policy gradient that
maximizes the expected future reward and confirm our ideas
with numerical experiments.
Table of Contents
- Introduction
- Derivation of the policy gradient
- Computation of the optimal policy
- Numerical experiments
- Conclusions
- Implicit policies
- Monte Carlo gradient sampling
BibTeX Entry
@Article{Hutter:01grep,
author = "Ivo Kwee and Marcus Hutter and Juergen Schmidhuber",
institution = "Istituto Dalle Molle di Studi sull'Intelligenza Artificiale (IDSIA)",
title = "Gradient-based Reinforcement Planning in Policy-Search Methods",
month = oct,
year = "2001",
pages = "27--29",
address = "Manno(Lugano), CH",
journal = "Proceedings of the 5th European Workshop on Reinforcement Learning (EWRL-5)",
number = "27",
editor = "Marco A. Wiering",
publisher = "Onderwijsinsituut CKI - Utrecht University",
series = "Cognitieve Kunstmatige Intelligentie",
ISBN = "90-393-2874-9",
ISSN = "1389-5184",
keywords = "Artificial intelligence, reinforcement learning, direct policy search,
planning, gradient decent.",
url = "http://www.hutter1.net/ai/pgrep.htm",
categories = "I.2. [Artificial Intelligence],
I.2.6. [Learning],
I.2.8. [Problem Solving, Control Methods and Search]",
abstract = "We introduce a learning method called ``gradient-based reinforcement
planning'' (GREP). Unlike traditional DP methods that improve their
policy backwards in time, GREP is a gradient-based method that plans
ahead and improves its policy {\em before} it actually acts in the
environment. We derive formulas for the exact policy gradient that
maximizes the expected future reward and confirm our ideas
with numerical experiments.",
}