%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Self-Optimizing and Pareto-Optimal Policies in %%
%% General Environments based on Bayes-Mixtures %%
%% Marcus Hutter: Start: 11.10.01 LastEdit: 10.04.02 %%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%-------------------------------%
% Document-Style %
%-------------------------------%
\documentclass[12pt,twoside]{article}
\usepackage{latexsym}
\parskip=1.5ex plus 1ex minus 1ex \parindent=0ex
\pagestyle{myheadings}
\markboth{\sc Marcus Hutter, TR IDSIA-04-02 \& COLT-2002
}{\sc Self-Optimizing and Pareto-Optimal Bayes-Policies}
\topmargin=0cm \oddsidemargin=0cm \evensidemargin=0cm
\textwidth=16cm \textheight=23cm
\def\baselinestretch{0.98}
%-------------------------------%
% My Math-Environments %
%-------------------------------%
\def\,{\mskip 3mu} \def\>{\mskip 4mu plus 2mu minus 4mu} \def\;{\mskip 5mu plus 5mu} \def\!{\mskip-3mu}
\def\dispmuskip{\thinmuskip= 3mu plus 0mu minus 2mu \medmuskip= 4mu plus 2mu minus 2mu \thickmuskip=5mu plus 5mu minus 2mu}
\def\textmuskip{\thinmuskip= 0mu \medmuskip= 1mu plus 1mu minus 1mu \thickmuskip=2mu plus 3mu minus 1mu}
\textmuskip
\def\beq{\dispmuskip\begin{equation}} \def\eeq{\end{equation}\textmuskip}
\def\beqn{\dispmuskip\begin{displaymath}}\def\eeqn{\end{displaymath}\textmuskip}
\def\bqa{\dispmuskip\begin{eqnarray}} \def\eqa{\end{eqnarray}\textmuskip}
\def\bqan{\dispmuskip\begin{eqnarray*}} \def\eqan{\end{eqnarray*}\textmuskip}
%-------------------------------%
% Macro-Definitions %
%-------------------------------%
\newenvironment{keywords}{\centerline{\small\bf
Keywords}\vspace{0.5ex}\begin{quote}\small}{\par\end{quote}\vskip
1ex}
\newtheorem{theorem}{Theorem}
\newtheorem{corollary}{Corollary}
\newtheorem{lemma}{Lemma}
\newtheorem{definition}{Definition}
\def\ftheorem#1#2#3{\begin{theorem}[#2]\label{#1} #3 \end{theorem} }
\def\fcorollary#1#2#3{\begin{corollary}[#2]\label{#1} #3 \end{corollary} }
\def\flemma#1#2#3{\begin{lemma}[#2]\label{#1} #3 \end{lemma} }
\def\fdefinition#1#2#3{\begin{definition}[#2]\label{#1} #3 \end{definition} }
\def\toinfty#1{\stackrel{#1\to\infty}{\longrightarrow}}
\def\qed{\sqcap\!\!\!\!\sqcup}
\def\odm{{\textstyle{1\over m}}}
\def\eps{\varepsilon}
\def\pb#1{\def\pb{\underline}\underline{#1}} % probability notation
\def\best{*} % or {best}
\def\xiai{{\xi}}
\def\muai{{\mu}}
\def\rhoai{{\rho}}
\def\M{{\cal M}}
\def\X{{\cal X}}
\def\Y{{\cal Y}}
\def\E{{\bf E}}
\begin{document}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% T i t l e - P a g e %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{titlepage}
\title{\vspace*{-2cm}
{\small Technical Report IDSIA-04-02 \hfil$-$\hfil
17 April 2002 \hfil$-$\hfil
Proceedings of COLT-2002\\[5mm]}
\Large\sc\hrule height1pt \vskip 2mm
Self-Optimizing and Pareto-Optimal Policies
in General Environments based on Bayes-Mixtures
\vskip 5mm \hrule height1pt}
\author{Marcus Hutter
\\[3ex]
\small IDSIA, Galleria 2, CH-6928 Manno-Lugano, Switzerland%
\footnote{This work was supported by SNF grant 2000-61847.00 to J\"urgen
Schmidhuber.} \\
\small marcus@idsia.ch $\quad-\quad$ http://www.idsia.ch/$^{_{_\sim}}\!$marcus
}
\date{}
\maketitle % typeset the title of the contribution
\hfill
\begin{keywords}
Rational agents, sequential decision theory, reinforcement
learning, value function, Bayes mixtures, self-optimizing
policies, Pareto-optimality, unbounded effective horizon,
(non) Markov decision processes.
\end{keywords}
\begin{abstract}
The problem of making sequential decisions in unknown
pro\-ba\-bi\-listic environments is studied. In cycle $t$ action
$y_t$ results in perception $x_t$ and reward $r_t$, where all
quantities in general may depend on the complete history. The
perception $x_t$ and reward $r_t$ are sampled from the (reactive)
environmental probability distribution $\mu$. This very general
setting includes, but is not limited to, (partial observable, k-th
order) Markov decision processes. Sequential decision theory tells
us how to act in order to maximize the total expected reward,
called value, if $\mu$ is known. Reinforcement learning is usually
used if $\mu$ is unknown. In the Bayesian approach one defines a
mixture distribution $\xi$ as a weighted sum of distributions
$\nu\in\M$, where $\M$ is any class of distributions including the
true environment $\mu$. We show that the Bayes-optimal policy
$p^\xi$ based on the mixture $\xi$ is self-optimizing in the sense
that the average value converges asymptotically for all $\mu\in\M$
to the optimal value achieved by the (infeasible) Bayes-optimal
policy $p^\mu$ which knows $\mu$ in advance. We show that the
necessary condition that $\M$ admits self-optimizing policies at
all, is also sufficient. No other structural assumptions are made
on $\M$. As an example application, we discuss ergodic Markov
decision processes, which allow for self-optimizing policies.
Furthermore, we show that $p^\xi$ is Pareto-optimal in the sense
that there is no other policy yielding higher or equal value in
{\em all} environments $\nu\in\M$ and a strictly higher value in
at least one.
\end{abstract}
\end{titlepage}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Introduction}\label{secInt}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%------------------------------%
\paragraph{Reinforcement learning:}
%------------------------------%
There exists a well developed theory for reinforcement learning
agents in known probabilistic environments (like Blackjack) called
sequential decision theory \cite{Bellman:57,Bertsekas:95}.
The optimal agent is the one which maximizes the future expected
reward sum. This setup also includes deterministic environments
(like static mazes). Even adversarial environments (like Chess or
Backgammon) may be seen as special cases in some sense
\cite[ch.6]{Hutter:00kcunai} (the reverse is also true
\cite{Brafman:00}). Sequential decision theory deals with a wide
range of problems, and provides a general formal solution in the
sense that it is mathematically rigorous and (uniquely) specifies
the optimal solution (leaving aside computational issues).
%
The theory breaks down when the environment is unknown (like when
driving a car in the real world). Reinforcement learning
algorithms exist for unknown Markov decision processes ({\sc
mdp}$s$) with small state space, and for other restricted classes
\cite{Kaelbling:96,Sutton:98,Bertsekas:95,Kumar:86}, but even in
these cases their learning rate is usually far from optimum.
%------------------------------%
\paragraph{Performance measures:}
%------------------------------%
In this work we are interested in general (probabilistic)
environmental classes $\M$. We assume $\M$ is given, and that the
true environment $\mu$ is in $\M$, but is otherwise unknown.
%
The expected reward sum (value) $V_\mu^p$ when following policy
$p$ is of central interest. We are interested in policies $\tilde
p$ which perform well (have high value) independent of what the
true environment $\mu\in\M$ is. A natural demand from an optimal
policy is that there is no other policy yielding higher or equal
value in {\em all} environments $\nu\in\M$ and a strictly higher
value in one $\nu\in\M$. We call such a property {\em
Pareto-optimality}. The other quantity of interest is how close
$V_\mu^{\tilde p}$ is to the value $V_\mu^\best$ of the optimal
(but infeasible) policy $p^\mu$ which knows $\mu$ in advance. We
call a policy whose average value converges asymptotically for all
$\mu\in\M$ to the optimal value $V_\mu^\best$ if $\mu$ is the true
environment, {\em self-optimizing}.
%------------------------------%
\paragraph{Main new results for Bayes-mixtures:}
%------------------------------%
%Specifically, we consider probabilistic environments $\nu\in\M$.
We define the Bayes-mixture $\xi$ as a weighted average of the
environments $\nu\in\M$ and analyze the properties of the
Bayes-optimal policy $p^\xi$ which maximizes the mixture value
$V_\xi$.
%
One can show that not all environmental classes $\M$ admit
self-optimizing policies. One way to proceed is to search for and
prove weaker properties than self-optimizingness
\cite{Hutter:00kcunai}. Here we follow a different approach:
Obviously, the least we must demand from $\M$ to have a chance of
finding a self-optimizing policy is that there exists some
self-optimizing policy $\tilde p$ at all.
%------------------------------%
%\paragraph{What's new?:}
%------------------------------%
The main new result of this work is that this necessary condition
is also sufficient for $p^\xi$ to be self-optimizing. No other
properties need to be imposed on $\M$. The other new result is
that $p^\xi$ is always Pareto-optimal, with no conditions at
all imposed on $\M$.
%------------------------------%
\paragraph{Contents:}
%------------------------------%
Section \ref{secSetup} defines the model of agents acting in
general probabilistic environments and defines the finite horizon
value of a policy and the optimal value-maximizing policy.
Furthermore, the mixture-distribution is introduced and the
fundamental linearity and convexity properties of the
mixture-values is stated.
%
Section \ref{secPareto} defines and proves
Pareto-optimality of $p^\xi$. The concept is refined to balanced
Pareto-optimality, showing that a small increase of the value for
some environments only leaves room for a small decrease in others.
%
Section \ref{secSelfOptA} shows that $p^\xi$ is self-optimizing if
$\M$ admits self-optimizing policies, and also gives the speed of
convergence in the case of finite $\M$.
%
The finite horizon model has several disadvantages. For this
reason Section \ref{secDiscount} defines the discounted (infinite
horizon) future value function, and the corresponding optimal
value-maximizing policy. Pareto-optimality and
self-optimizingness of $p^\xi$ are shown shown for this model.
%
As an application we show in Section \ref{secMDP} that the class
of ergodic {\sc mdp}$s$ admits self-optimizing policies w.r.t.\ the
undiscounted model and w.r.t.\ the discounted model if the
effective horizon tends to infinity. Together with the results
from the previous sections this shows that $p^\xi$ is
self-optimizing for erdodic {\sc mdp}$s$.
%
Conclusions and outlook can be found in Section \ref{secConc}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Rational Agents in Probabilistic Environments}\label{secSetup}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%------------------------------%
\paragraph{The agent model:}
%------------------------------%
A very general framework for intelligent systems is that of
rational agents \cite{Russell:95}. In cycle $k$, an agent performs
{\em action} $y_k\!\in\!\Y$ (output) which results in a {\em
perception} or {\em observation} $x_k\!\in\!\X$ (input), followed
by cycle $k\!+\!1$ and so on. We assume that the action and
perception spaces $\X$ and $\Y$ are finite. We write
$p(x_{0\quad\forall\nu\in\M
\eeq
The weights $w_\nu$ may be interpreted as the prior degree of
belief that the true environment is $\nu$. Then $\xi(y\!\pb
x_{1:m})$ could be interpreted as the prior subjective belief
probability in observing $x_{1:m}$, given actions $y_{1:m}$. It
is, hence, natural to follow the policy $p^\xi$ which maximizes
$V_\xi^p$. If $\mu$ is the true environment the expected reward
when following policy $p^\xi$ will be $V_\mu^{p^\xi}$. The optimal
(but infeasible) policy $p^\mu$ yields reward $V_\mu^{p^\mu}\equiv
V_\mu^{\best}$. It is now of interest $(a)$ whether there are
policies with uniformly larger value than $V_\mu^{p^\xi}$ and
$(b)$ how close $V_\mu^{p^\xi}$ is to $V_\mu^{\best}$. These are
the main issues of the remainder of this work.
%The side remark in the following paragraph may be skipped.
%------------------------------%
\paragraph{A universal choice of $\xi$ and $\M$:}
%------------------------------%
One may also ask what the most general class $\M$ and weights
$w_\nu$ could be. Without any prior knowledge we should include
{\em all} environments in $\M$. In this generality this approach
leads at best to negative results. More useful is the assumption
that the environment possesses some structure, we just don't know
which. From a computational point of view we can only unravel
effective structures which are describable by (semi)computable
probability distributions. So we may include {\em all}
(semi)computable (semi)distribu\-tions in $\M$. Occam's razor
tells us to assign high prior belief to simple environments. Using
Kolmogorov's universal complexity measure $K(\nu)$ for
environments $\nu$ one should set $w_\nu\sim 2^{-K(\nu)}$, where
$K(\nu)$ is the length of the shortest program on a universal
Turing machine computing $\nu$. The resulting policy $p^\xi$ has
been developed and intensively discussed in
\cite{Hutter:00kcunai}. It is a unification of sequential decision
theory \cite{Bellman:57,Bertsekas:95} and Solomonoff's celebrated
universal induction scheme \cite{Solomonoff:78,Li:97}. In the
following we consider generic $\M$ and $w_\nu$.
%
The following property of $V_\rho$ is crucial.
%------------------------------%
\ftheorem{thVlinconv}{Linearity and convexity of $V_\rho$ in $\rho$}{
%------------------------------%
$V_\rho^p$ is a linear function in $\rho$ and $V_\rho^\best$
is a convex function in $\rho$ in the sense that
\beqn%\label{thV1lin}\label{thV1conv}
V_\xi^p \;= \sum_{\nu\in\M}w_\nu V_\nu^p
\quad\mbox{and}\quad
V_\xi^\best \;\leq \sum_{\nu\in\M}w_\nu V_\nu^\best
\quad\mbox{where}\quad
\xi(y\!\pb x_{1:m})= \sum_{\nu\in\M}w_\nu \nu(y\!\pb x_{1:m})
\eeqn\vspace{-1ex}
}%------------------------------%
\noindent{\bf Proof:} Linearity is obvious from the definition
of $V_\rho^p$. Convexity follows from
$V_\xi^\best\equiv V_\xi^{p^\xi}=\sum_\nu w_\nu V_\nu^{p^\xi}\leq
\sum_\nu w_\nu V_\nu^\best$, where the identity is definition
(\ref{defAIrhoF}), the equality uses linearity of $V_\rho^{p^\xi}$
just proven, and the last inequality follows from the dominance
(\ref{Vdom}) and non-negativity of the weights $w_\nu$. $\qed$
One loose interpretation of the convexity is that a mixture can
never increase performance. In the remainder of this work $\mu$
denotes the true environment, $\rho$ any distribution, and $\xi$
the Bayes-mixture of distributions $\nu\in\M$.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Pareto Optimality of policy $p^\xi$}\label{secPareto}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
The total $\mu$-expected reward $V_\mu^{p^\xi}$ of policy $p^\xi$
is of central interest in judging the
performance of policy $p^\xi$. We know that there {\em are} policies
(e.g.\ $p^\mu$) with higher $\mu$-value
($V_\mu^\best\geq V_\mu^{p^\xi}$). In general, every policy based
on an estimate $\rho$ of $\mu$ which is closer to $\mu$ than $\xi$
is, outperforms $p^\xi$ in environment $\mu$, simply because it is
more taylored toward $\mu$. On the other hand, such a system
probably performs worse than $p^\xi$ in other environments.
%
Since we do not know $\mu$ in advance we may ask
whether there exists a policy $p$ with better or equal performance
than $p^\xi$ in {\em all} environments $\nu \in\M$ and a strictly
better performance for one $\nu \in\M$. This would clearly
render $p^\xi$ suboptimal. We show that there is no such $p$.
%------------------------------%
\ftheorem{thaiPareto}{Pareto optimality}{
%------------------------------%
Policy $p^\xi$ is Pareto-optimal in the sense that there is no
other policy $p$ with $V_{\nu}^p\geq V_{\nu}^{p^\xi}$ for all $\nu
\in \M$ and strict inequality for at least one $\nu$.
}%------------------------------%
\noindent{\bf Proof:} We want to arrive at a contradiction by
assuming that $p^\xi$ is not Pareto-optimal, i.e.\ by assuming the
existence of a policy $p$ with $V_\nu^p\geq V_\nu^{p^\xi}$ for all
$\nu \in \M$ and strict inequality for at least one $\nu$:
\beqn
V_\xi^p \;=\;
\sum_\nu w_\nu V_\nu^p \;>\;
\sum_\nu w_\nu V_\nu^{p^\xi} \;=\;
V_\xi^{p^\xi} \;\equiv\;
V_\xi^\best \;\geq\;
V_\xi^p
\eeqn
The two equalities follow from linearity of $V_\rho$ (Theorem
\ref{thVlinconv}). The strict inequality follows from the
assumption and from $w_\nu>0$. The identity is just Definition
\ref{defValp}(\ref{defAIrhoF}). The last inequality follows from
the fact that $p^\xi$ maximizes by definition the universal value
(\ref{Vdom}). The contradiction $V_\xi^p > V_\xi^p$ proves
Pareto-optimality of policy $p^\xi$.$\qed$
Pareto-optimality should be regarded as a necessary condition for
an agent aiming to be optimal. From a practical point
of view a significant increase of $V$ for many environments
$\nu$ may be desirable even if this causes a small decrease of
$V$ for a few other $\nu$. The impossibility of such a
``balanced'' improvement is a more demanding condition on $p^\xi$
than pure Pareto-optimality. The next theorem
shows that $p^\xi$ is also balanced-Pareto-optimal in
the following sense:
%------------------------------%
\ftheorem{thaiElPareto}{Balanced Pareto optimality}{
%------------------------------%
\beqn
\Delta_\nu:=V_\nu^{p^\xi}-V_\nu^{\tilde p} ,\quad
\Delta:=\sum_{\nu\in\M}w_\nu\Delta_\nu \quad\Rightarrow\quad
\Delta\geq 0.
\eeqn
This implies the following: Assume $\tilde p$ has lower value than
$p^\xi$ on environments $\cal L$ by a
total weighted amount of $\Delta_{\cal
L} := \sum_{\lambda\in\cal L}w_\lambda\Delta_\lambda$. Then
$\tilde p$ can have higher value on $\eta \in {\cal
H} := \M \setminus \cal L$, but the improvement is bounded by
$\Delta_{\cal H} := |\sum_{\eta\in\cal
H}w_\eta\Delta_\eta| \leq \Delta_{\cal L}$. Especially
$|\Delta_\eta| \leq w_\eta^{-1}\max_{\lambda\in\cal
L}\Delta_\lambda$.
}%------------------------------%
%
This means that a weighted value increase $\Delta_{\cal H}$ by
using $\tilde p$ instead of $p^\xi$ is compensated by an at least
as large weighted decrease $\Delta_{\cal L}$ on other
environments. If the decrease is small, the increase can also only
be small. In the special case of only a single environment with
decreased value $\Delta_\lambda$, the increase is bound by
$\Delta_\eta \leq {w_\lambda\over w_\eta}|\Delta_\lambda|$, i.e.\
a decrease by an amount $\Delta_\lambda$ can only cause an
increase by at most the same amount times a factor
${w_\lambda\over w_\eta}$. For the choice of the weights
$w_\nu\sim 2^{-K(\nu)}$, a decrease can only cause a smaller
increase in simpler environments, but a scaled increase in more
complex environments. Finally note that pure Pareto-optimality
(Theorem \ref{thaiPareto}) follows from balanced Pareto-optimality
in the special case of no decrease $\Delta_{\cal L} \equiv 0$.
\noindent{\bf Proof:} $\Delta \geq 0$ follows from
$\Delta=\sum_\nu w_\nu[V_\nu^{p^\xi} - V_\nu^{\tilde
p}]=V_\xi^{p^\xi} - V_\xi^{\tilde p}\geq 0$, where we have used
linearity of $V_\rho$ (Theorem \ref{thVlinconv}) and dominance
$V_\xi^{p^\xi} \geq V_\xi^p$ (\ref{Vdom}). The remainder of
Theorem \ref{thaiElPareto} is obvious from
$0 \leq \Delta = \Delta_{\cal L} - \Delta_{\cal H}$ and by
bounding the weighted average $\Delta_\eta$ by its maximum.$\qed$
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Self-optimizing Policy $p^\xi$ w.r.t.\ Average Value}\label{secSelfOptA}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
In the following we study under which circumstances\footnote{%
Here and elsewhere we interpret $a_m\to b_m$ as an abbreviation
for $a_m-b_m\to 0$. $\lim_{m\to\infty}b_m$ may not exist.}
\beq\label{VxitoV}
\odm V_{1m}^{p^\xi\nu}\to\odm V_{1m}^{\best\nu}
\quad\mbox{for}\quad m\to\infty
\quad\mbox{for {\em all}}\quad \nu\in\M.
\eeq
The least we must demand from $\M$ to have a chance that
(\ref{VxitoV}) is true is that there exists some policy $\tilde p$ at all with this
property, i.e.\
\beq\label{VptoV}
\exists\tilde p:\;\odm V_{1m}^{\tilde p\nu}\to\odm V_{1m}^{\best\nu}
\quad\mbox{for}\quad m\to\infty
\quad\mbox{for {\em all}}\quad \nu\in\M.
\eeq
Luckily, this necessary condition will also be sufficient. This is
another (asymptotic) optimality property of policy $p^\xi$. {\em
If} universal convergence in the sense of (\ref{VptoV}) is
possible at all in a class of environments $\M$, then policy
$p^\xi$ converges in the sense of (\ref{VxitoV}). We will call
policies $\tilde p$ with a property like (\ref{VptoV}) {\em
self-optimizing} \cite{Kumar:86}.
%
The following two Lemmas pave the way for proving the convergence
Theorem.
%------------------------------%
\flemma{lemVDconv}{Value difference relation}{
%------------------------------%
\beqn
0 \;\leq\; V_\nu^\best-V_\nu^{\tilde p} \;=:\; \Delta_\nu
\quad\Rightarrow\quad
0 \;\leq\; V_\nu^\best-V_\nu^{p^\xi} \;\leq\; {\textstyle{1\over w_\nu}}\Delta
\quad\mbox{with}\quad
\Delta:=\sum_{\nu \in\M}w_\nu\Delta_\nu
\eeqn\vspace{-3ex}
}%------------------------------%
\noindent{\bf Proof:} The following sequence of inequalities proves the
lemma:
\beqn\textstyle
0 \;\leq\; w_\nu[V_\nu^\best\!-\!V_\nu^{p^\xi}] \;\leq\;
% \sum_\nu \underbrace{w_\nu}_{\geq 0}
% [\underbrace{V_\nu^\best\!-\!V_\nu^{p^\xi}}_{\geq 0}] \;\leq\;
\sum_\nu w_\nu [V_\nu^\best\!-\!V_\nu^{p^\xi}] \;\leq\;
\sum_\nu w_\nu[V_\nu^\best\!-\!V_\nu^{\tilde p}] \;=\;
\sum_\nu w_\nu\Delta_\nu \;\equiv\; \Delta
\eeqn
In the first and second inequality we used $w_\nu \geq 0$ and
$V_\nu^\best - V_\nu^{p^\xi} \geq 0$. The last inequality
follows from
$
\sum_\nu w_\nu V_\nu^{p^\xi} =
V_\xi^{p^\xi} \equiv V_\xi^\best \geq V_\xi^{\tilde p} =
\sum_\nu w_\nu V_\nu^{\tilde p}
$. $\qed$
We also need some results for averages of functions
$\delta_\nu(m)\geq 0$ converging to zero.
%------------------------------%
\flemma{lemVdconv}{Convergence of averages}{
%------------------------------%
For $\delta(m):=\sum_{\nu\in\M}w_\nu\delta_\nu(m)$ the following
holds (we only need $\sum_\nu w_\nu\leq 1$):
\beqn
\begin{array}{rllllll}
i) & \displaystyle
\delta_\nu(m)\leq f(m)
& \forall\nu
& \mbox{implies}
& \delta(m) \leq f(m). \\[1ex]
ii) & \displaystyle
\delta_\nu(m)\toinfty{m} 0
& \forall\nu
& \mbox{implies}
& \delta(m)\toinfty{m} 0
& \mbox{if}
& 0\leq\delta_\nu(m)\leq c. \\[1ex]
% iii) & \displaystyle
% \delta(m)\leq\max_\nu\delta_\nu(m).
\end{array}
\eeqn\vspace{-3ex}
}%------------------------------%
\noindent{\bf Proof:} $(i)$ immediately follows from
$\delta(m)=\sum_\nu w_\nu\delta_\nu(m) \leq
\sum_\nu w_\nu f(m) \leq f(m)$.
For $(ii)$ we choose some order on $\M$ and some $\nu_0 \in \M$
large enough such that $\sum_{\nu\geq\nu_0}w_\nu\leq{\eps\over c}$.
Using $\delta_\nu(m) \leq c$ this implies
\beqn
\sum_{\nu\geq\nu_0}w_\nu\delta_\nu(m) \;\leq\;
\sum_{\nu\geq\nu_0}w_\nu c \;\leq\; \eps.
\eeqn
Furthermore, the assumption $\delta_\nu(m) \to 0$ means
that there is an $m_{\nu\eps}$ depending on $\nu$ and $\eps$ such
that $\delta_\nu(m)\leq\eps$ for all $m \geq m_{\nu\eps}$. This
implies
\beqn
\sum_{\nu\leq\nu_0}w_\nu\delta_\nu(m) \;\leq\;
\sum_{\nu\leq\nu_0}w_\nu \eps \;\leq\; \eps
\quad\mbox{for all}\quad
m \;\geq\; \max_{\nu\leq\nu_0}\{m_{\nu\eps}\} \;=\; :m_\eps.
\vspace{-1ex}
\eeqn
$m_\eps < \infty$, since the maximum is over a finite set.
Together we have
\beqn
\delta(m) \;\equiv\;
\sum_{\nu\!\in\!\M}w_\nu\delta_\nu(m) \;\leq\; 2\eps
\quad\mbox{for}\quad
m \;\geq\; m_\eps
\quad\Rightarrow\quad
\delta(m)\to 0
\quad\mbox{for}\quad
m\to\infty\vspace{-1ex}
\eeqn
since $\eps$ was arbitrary and $\delta(m) \geq 0$. $\qed$
%------------------------------%
\ftheorem{thVcconv}{Self-optimizing policy $p^\xi$ w.r.t.\ average value}{
%------------------------------%
There exists a sequence of policies $\tilde p_m$, $m=1,2,3,...$
with value within $\Delta(m)$ to optimum for all environments
$\nu\in\M$, then, save for a constant factor, this also holds for
the sequence of universal policies $p^\xi_m$, i.e.\
\beqn
i)\quad\mbox{If}\quad
\exists\tilde p_m\forall\nu:
V_{1m}^{\best\nu}-V_{1m}^{\tilde p_m\nu} \leq \Delta(m)
\quad\Longrightarrow\quad
V_{1m}^{\best\mu}-V_{1m}^{p^\xi_m\mu} \leq
\textstyle{1\over w_\mu}\Delta(m).
\eeqn
If there exists a sequence of self-optimizing
policies $\tilde p_m$ in the sense that their expected average
reward $\odm V_{1m}^{\tilde p_m\nu}$ converges to the optimal
average $\odm V_{1m}^{\best\nu}$ for all environments $\nu\in\M$,
then this also holds for the sequence of universal policies
$p^\xi_m$, i.e.\
\beqn
ii) \quad\mbox{If}\quad
\exists\tilde p_m\forall\nu:
\odm V_{1m}^{\tilde p_m\nu} \;\toinfty{m}\; \odm V_{1m}^{\best\nu}
\quad\Longrightarrow\quad
\odm V_{1m}^{p^\xi_m\mu} \;\toinfty{m}\; \odm V_{1m}^{\best\mu}.
\rule{2ex}{0ex}
\eeqn\vspace{-3ex}
}%------------------------------%
The beauty of this theorem is that if universal convergence in the
sense of (\ref{VptoV}) is possible at all in a class of
environments $\M$, then policy $p^\xi$ converges (in the sense of
(\ref{VxitoV})). The necessary condition of convergence is also
sufficient. The unattractive point is that this is not an
asymptotic convergence statement for $V_{km}^{p^\xi \mu}$ of a
single policy $p^\xi$ for $k\to\infty$ for some fixed $m$, and in
fact no such theorem could be true, since always $k\leq m$. The
theorem merely says that under the stated conditions the average
value of $p^\xi_m$ can be arbitrarily close to optimum for
sufficiently large (pre-chosen) horizon $m$. This weakness will be
resolved in the next subsection.
\noindent{\bf Proof:} $(i)$ $\Delta_\nu(m)=f(m)$ implies
$\Delta(m)=f(m)$ by Lemma \ref{lemVdconv}$(i)$.
Inserting this in Lemma \ref{lemVDconv} proves Theorem
\ref{thVcconv}$(i)$ (recovering the $m$ dependence and finally
renaming $f \leadsto \Delta$).
$(ii)$ We define $\delta_\nu(m) := \odm\Delta_\nu(m) = \odm
[V_\nu^\best-V_\nu^{\tilde p}]$. Since we assumed bounded rewards
$0 \leq r \leq r_{max}$ we have
\beqn
V_\nu^* \leq m r_{max}
\quad\mbox{and}\quad
V_\nu^{\tilde p} \geq 0
\quad\Rightarrow\quad
\Delta_\nu \leq m r_{max}
\quad\Rightarrow\quad
0 \leq \delta_\nu(m) \leq c:=r_{max}.
\eeqn
The premise in Theorem \ref{thVcconv}$(ii)$ is that
$
\delta_\nu(m)=\odm[V_{1m}^{\best\nu} - V_{1m}^{\tilde
p\nu}]\to 0
$
which implies
\beqn\textstyle
0 \;\leq\; \odm[V_{1m}^{\best\nu}-V_{1m}^{p^\xi\nu}]
\;\leq\;
{1\over w_\nu}{\Delta(m)\over m} \;=\; {1\over w_\nu}\delta(m)
\;\to\; 0.
\eeqn
The inequalities follow from Lemma \ref{lemVDconv} and convergence
to zero from Lemma \ref{lemVdconv}$(ii)$.
This proves Theorem \ref{thVcconv}$(ii)$. $\qed$.
In Section \ref{secMDP} we show that a converging $\tilde p$
exists for ergodic {\sc mdp}$s$, and hence $p^\xi$ converges in
this environmental class too (in the sense of Theorem
\ref{thVcconv}).
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Discounted Future Value Function}\label{secDiscount}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
We now shift our focus from the total value $V_{1m}$, $m\to\infty$
to the future value (value-to-go) $V_{k?}$, $k\to\infty$. The main
reason is that we want to get rid of the horizon parameter $m$. In
the last subsection we have shown a convergence theorem for
$m\to\infty$, but a specific policy $p^\xi$ is defined for all
times relative to a fixed horizon $m$. Current time $k$ is moving,
but $m$ is fixed\footnote{A dynamic horizon like $m\leadsto
m_k=k^2$ can lead to policies with very poor performance
\cite[Ch.4]{Hutter:00kcunai}.}. Actually, to use $k\to\infty$
arguments we {\em have} to get rid of $m$, since $k\leq m$. This
is the reason for the question mark in $V_{k?}$ above.
We eliminate the horizon by discounting the rewards
$r_k\leadsto\gamma_k r_k$ with $\sum_{i=1}^\infty\gamma_i<\infty$
and letting $m\to\infty$. The analogue of $m$ is now an effective
horizon $h_k^{e\!f\!f}$ which may be defined by
$\sum_{i=k}^{k+h_k^{e\!f\!f}}\!\!\gamma_k \sim
\sum_{i=k+h_k^{e\!f\!f}}^\infty\; \gamma_k$.
See \cite[Ch.4]{Hutter:00kcunai} for a detailed discussion of the
horizon problem. Furthermore, we renormalize $V_{k\infty}$ by
$\sum_{i=k}^\infty\gamma_i$ and denote it by $V_{k\gamma}$. It can
be interpreted as a future expected weighted-average reward.
Furthermore we extend the definition to probabilistic policies
$\pi$.
%------------------------------%
\fdefinition{defAIrhoDisc}{Discounted value function and optimal policy}{
%------------------------------%
We define the $\gamma$ discounted weighted-average future {\em
value} of (probabilistic) policy $\pi$ in environment $\rho$ given
history $y\!x_{m$
gives back the undiscounted model (\ref{defValp}) with
$V_{1\gamma}^{p\rho}={1\over m}V_{1m}^{p\rho}$.
\item $V_{k\gamma}$ (and $w_k^\nu$ defined below) depend on the realized
history $y\!x_{