%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% General Loss Bounds for Universal Sequence Prediction %%
%% %%
%% Marcus Hutter: Start: 11.10.00 LastEdit: 10.04.01 %%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%-------------------------------%
% icml Document-Style %
%-------------------------------%
\documentclass[10pt,twocolumn]{article}
\setlength\headheight{0pt} \setlength\headsep{0pt}
\topmargin=0cm \oddsidemargin=-1cm \evensidemargin=-1cm
\textwidth=17.8cm \textheight=22cm %\unitlength=1mm \sloppy
%-------------------------------%
% Macro-Definitions %
%-------------------------------%
\renewenvironment{abstract}{\centerline{\bf
Abstract}\vspace{0.5ex}\begin{quote}\small}{\par\end{quote}\vskip
1ex}
\newenvironment{keywords}{\centerline{\bf
Keywords}\vspace{0.5ex}\begin{quote}\small}{\par\end{quote}\vskip
1ex}
\newtheorem{theorem}{Theorem}
\def\nq{\hspace{-1em}}
\def\qed{\sqcap\!\!\!\!\sqcup}
\def\odt{{\textstyle{1\over 2}}}
\def\odf{{\textstyle{1\over 4}}}
\def\odA{{\textstyle{1\over A}}}
\def\eps{\varepsilon}
\def\beq{\begin{equation}}
\def\eeq{\end{equation}}
\def\beqn{\begin{displaymath}}
\def\eeqn{\end{displaymath}}
\def\bqa{\begin{equation}\begin{array}{c}}
\def\eqa{\end{array}\end{equation}}
\def\bqan{\begin{displaymath}\begin{array}{c}}
\def\eqan{\end{array}\end{displaymath}}
\def\vec#1{{\bf #1}}
\title{
\vskip -.25in \hrule height1pt \vskip .25in \bf \LARGE General
Loss Bounds for Universal Sequence Prediction \vskip .22in \hrule
height1pt \vskip .3in}
{\author{ {\bf Marcus Hutter} \\[2mm]
{\small IDSIA, Galleria 2, CH-6928 Manno-Lugano, Switzerland} \\
{\small marcus@idsia.ch \qquad http://www.idsia.ch/$^{_{_\sim}}\!$marcus} \\
{\small Technical Report IDSIA-03-01, 10 April 2001}
}
\date{}
\begin{document}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% T i t l e - P a g e %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\maketitle
\begin{keywords}
Bayesian and deterministic prediction; general loss function;
Solomonoff induction; Kolmogorov complexity; leaning; universal
probability; loss bounds; games of chance; partial and delayed
prediction; classification.
\end{keywords}
\begin{abstract}
The Bayesian framework is ideally suited for induction problems.
The probability of observing $x_t$ at time $t$, given past
observations $x_1...x_{t-1}$ can be computed with Bayes' rule if
the true distribution $\mu$ of the sequences $x_1x_2x_3...$ is
known. The problem, however, is that in many cases one does not
even have a reasonable estimate of the true distribution. In order
to overcome this problem a universal distribution $\xi$ is defined
as a weighted sum of distributions $\mu_i\!\in\!M$, where $M$ is
any countable set of distributions including $\mu$. This is a
generalization of Solomonoff induction, in which $M$ is the set of
all enumerable semi-measures. Systems which predict $y_t$, given
$x_1...x_{t-1}$ and which receive loss $l_{x_t y_t}$ if $x_t$ is
the true next symbol of the sequence are considered. It is proven
that using the universal $\xi$ as a prior is nearly as good as
using the unknown true distribution $\mu$. Furthermore, games of
chance, defined as a sequence of bets, observations, and rewards
are studied. The time needed to reach the winning zone is bounded
in terms of the relative entropy of $\mu$ and $\xi$. Extensions to
arbitrary alphabets, partial and delayed prediction, and more
active systems are discussed.
\end{abstract}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Introduction}\label{secInt}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%------------------------------%
\subsection{Induction}
%------------------------------%
Many problems are of induction type, in which statements about the
future have to be made, based on past observations. What is the
probability of rain tomorrow, given the weather observations of
the last few days? Is the Dow Jones likely to rise tomorrow, given
the chart of the last years and possibly additional newspaper
information? Can we reasonably doubt that the sun will rise
tomorrow? Indeed, one definition of science is to predict the
future, where, as an intermediate step, one tries to understand
the past by developing theories and, as a consequence of
prediction, one tries to manipulate the future. All induction
problems may be studied in the Bayesian framework. The probability
of observing $x_t$ at time $t$, given the observations
$x_1...x_{t-1}$ can be computed with Bayes' rule, if we know the
true probability distribution of observation sequences
$x_1x_2x_3...$. The problem is that in many cases we do not even
have a reasonable guess of the true distribution $\mu$. What is
the true probability of weather sequences, stock charts, or
sunrises?
%------------------------------%
\subsection{Universal Sequence Prediction}
%------------------------------%
Solomonoff \cite{Solomonoff:64} had the idea to define a universal
probability distribution\footnote{We use the term {\it
distribution} slightly unprecisely for a {\it probability
measure}.} $\xi$ as a weighted average over all possible
computable probability distributions. Lower weights were assigned
to more complex distributions. He unified Epicurus' principle of
multiple explanations, Occams' razor, and Bayes' rule into an
elegant formal theory. For a binary alphabet, the universal
conditional probability used for predicting $x_t$ converges to the
true conditional probability for $t\!\to\!\infty$ with probability
1. The convergence serves as a justification of using $\xi$ as a
substitution for the usually unknown $\mu$. The framework can
easily be generalized to other probability classes and weights
\cite{Solomonoff:78}.
%------------------------------%
\subsection{Contents}
%------------------------------%
The main aim of this work is to prove expected loss bounds for
general loss functions which measure the performance of $\xi$
relative to $\mu$, and to apply the results to games of chance.
Details and proofs can be found in \cite{Hutter:01op}.
%
There are good introductions and surveys of Solomonoff sequence
prediction \cite{Li:97}, inductive inference
\cite{Angluin:83,Solomonoff:97}, reasoning under uncertainty
\cite{Gruenwald:98}, and competitive online statistics
\cite{Vovk:99} with interesting relations to this work. See
\cite{Hutter:01op} and subsection \ref{subsecWM} for details.
{\bf Section \ref{secSetup}} explains notation and defines the
generalized universal distribution $\xi$ as the $w_{\mu_i}$
weighted sum of probability distributions $\mu_i$ of a set $M$,
which must include the true distribution $\mu$. This
generalization is straightforward and causes no problems. $\xi$
multiplicatively dominates all $\mu_i\!\in\!M$, and the relative
entropy between $\mu$ and $\xi$ is bounded by $\ln{1\over w_\mu}$.
Convergence of $\xi$ to $\mu$ is shown in Theorem \ref{thConv}.
{\bf Section \ref{secLoss}} considers the case where a prediction
or action $y_t\!\in\!\cal Y$ results in a loss $l_{x_t y_t}$ if
$x_t$ is the next symbol of the sequence. Optimal universal
$\Lambda_\xi$ and optimal informed $\Lambda_\mu$ prediction
schemes are defined for this case and loss bounds are proved.
Theorems \ref{thULoss} and \ref{thGLoss} bound the total loss
$L_\xi$ of $\Lambda_\xi$ by the total loss $L_\mu$ of
$\Lambda_\mu$ {\em plus} $O(\sqrt{L_\mu})$ terms.
{\bf Section \ref{secGames}} applies Theorem \ref{thGLoss} to
games of chance, defined as a sequence of bets, observations, and
rewards. The average profit $\bar p_{n\Lambda_\xi}$ achieved by the
$\Lambda_\xi$ scheme rapidly converges to the best possible
average profit $\bar p_{n\Lambda_\mu}$ achieved by the $\Lambda_\mu$
scheme ($\bar p_{n\Lambda_\xi}\!-\!\bar p_{n\Lambda_\mu}\!=\!O(n^{-1/2})$).
If there is a profitable scheme at all, asymptotically the universal
$\Lambda_\xi$ scheme will also become profitable. Theorem
\ref{thWin} lower bounds the time needed to reach the winning
zone in terms of the relative entropy of $\mu$ and $\xi$.
An attempt is made to give an information
theoretic interpretation of the result.
{\bf Section \ref{secOut}} outlines possible extensions of the
presented theory and results. They include arbitrary alphabets,
partial, delayed and probabilistic prediction, classification,
even more general loss functions, active systems influencing the
environment, learning aspects, and a comparison to the weighted
majority algorithm(s) and loss bounds.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Setup and Convergence}\label{secSetup}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%------------------------------%
\subsection{Strings and Probability Distributions}
%------------------------------%
We denote binary strings by $x_1x_2...x_n$ with
$x_t\!\in\!\{0,1\}$. We further use the abbreviations
$x_{n:m}:=x_nx_{n+1}...x_{m-1}x_m$ and $x_{0.
\eqa
It is easy to see that $\xi$ is a probability distribution as the
weights $w_{\mu_i}$ are positive and normalized to 1 and the $\mu_i\!\in\!M$ are
probabilities. For finite $M$ a possible choice for the $w$ is
to give all $\mu_i$ equal weight ($w_{\mu_i}={1\over|M|}$). We call $\xi$
universal relative to $M$, as it multiplicatively dominates all
distributions in $M$
\beq\label{unixi}
\xi( x_{1:n}) \;\geq\;
w_{\mu_i}\!\cdot\!\mu_i( x_{1:n}) \quad\mbox{for all}\quad
\mu_i\in M.
\eeq
In the following, we assume that $M$ is known and contains
the true distribution, i.e. $\mu\!\in\!M$.
%
This is not a serious constraint if we include {\it all}
computable probability distributions in $M$ with a high weight
assigned to simple $\mu_i$. Solomonoff's universal semi-measure is
obtained if we include all enumerable semi-measures in $M$ with
weights $w_{\mu_i}\!\sim\!2^{-K(\mu_i)}$, where $K(\mu_i)$ is the
length of the shortest program for $\mu_i$
\cite{Solomonoff:64,Solomonoff:78,Li:97}. A detailed discussion of
various general purpose choices for $M$ is given in
\cite{Hutter:01op}.
Furthermore, we need the relative entropy between $\mu$ and $\xi$:
\beq\label{hn}
h_t(x_{\,\gamma$, where
$\gamma\!:=\!{l_{01}-l_{00}\over l_{01}-l_{00}+l_{10}-l_{11}}$. As
the true distribution is $\mu$, the actual $\mu$ expected loss
when $\Lambda_\rho$ predicts the $t^{th}$ symbol and the total
$\mu$-expected loss in the first $n$ predictions are
\bqa\label{rholoss}\displaystyle
l_{t\Lambda_\rho}(x_{\!0$ and $B\!>\!0$ that satisfy the linear
inequality
\beq\label{Eineq3}
L_{n\Lambda_\xi} \;\leq\; (A+1)L_{n\Lambda_\mu} + (B+1)H_n.
\eeq
If we could show
\beq\label{eineq3}
l_{t\Lambda_\xi}(x_{\!l_{is}\forall i$ and $l_{is}\!>\!l_{im}\forall
i$ contradict the first/second inequality (\ref{lcnstr}).
Hence we can assume $l_{0m}\!\geq\!l_{0s}$ and
$l_{1m}\!\leq\!l_{1s}$. The symmetric case $l_{0m}\!\leq\!l_{0s}$ and
$l_{1m}\!\geq\!l_{1s}$ is proved analogously or can be reduced to the
first case by renumbering the indices ($0\leftrightarrow 1$).
Using the abbreviations $a\!:=\!l_{0m}\!-\!l_{0s}$, $b\!:=\!l_{1s}\!-\!l_{1m}$,
$c\!:=\!y_1l_{1m}\!+\!y_0l_{0s}$, $y\!=\!y_1\!=\!1\!-\!y_0$ and
$z\!=\!z_1\!=\!1\!-\!z_0$ we can write (\ref{lossineqf2}) as
\beq\label{lossineqf3}
f(y,z) \;:=\;
\eeq
$$\textstyle
B'[ y\ln{y\over z}+(1\!-\!y)\ln{1-y\over 1-z}]
+ A'(1\!-\!y)a-yb+Ac \;\stackrel?\geq\; 0
$$ for $zb\!\leq\!(1-z)a$ and $0\!\leq\!a,b,c,y,z\!\leq\!1$. The
constraint (\ref{lcnstr}) on $y$ has been dropped since
(\ref{lossineqf3}) will turn out to be true for all $y$.
Furthermore, we can assume that $d\!:=\!A'(1-y)a-yb\!\leq\!0$
since for $d\!>\!0$, $f$ is trivially positive ($h_t\!\geq\!0$).
Multiplying $d$ with a constant $\geq\!1$ will decrease $f$. Let
us first consider the case $z\!\leq\!\odt$. We multiply the $d$
term by $1/b\!\geq 1$, i.e. replace it with $A'(1-y){a\over b}-y$.
From the constraint on $z$ we known that ${a\over
b}\!\geq\!{z\over 1-z}$. We can decrease $f$ further by replacing
${a\over b}$ by ${z\over 1-z}$ and by dropping $Ac$. Hence,
(\ref{lossineqf3}) is proved for $z\!\leq\!\odt$ if we can prove
\beq\label{lossineq1}\textstyle
B'[...
% y\ln{y\over z}+(1\!-\!y)\ln{1-y\over 1-z}
]
+ A'(1\!-\!y){z\over 1-z}-y \;\stackrel?\geq\; 0 \quad\mbox{for}\quad
z\leq\odt.
\eeq
The case $z\!\geq\!\odt$ is treated similarly.
We scale $d$ with $1/a\!\geq 1$, i.e. replace it with $A'(1-y)-y{b\over a}$.
From the constraint on $z$ we know that ${b\over
a}\!\leq\!{1-z\over z}$. We decrease $f$ further by replacing
${b\over a}$ by ${1-z\over z}$ and by dropping $Ac$.
Hence (\ref{lossineqf3}) is proved for $z\!\geq\!\odt$ if we can prove
\beq\label{lossineq2}\textstyle
B'[...
%y\ln{y\over z}+(1\!-\!y)\ln{1-y\over 1-z}
]
+ A'(1\!-\!y)-y{1-z\over z} \;\stackrel?\geq\; 0 \quad\mbox{for}\quad
z\geq\odt.
\eeq
In \cite{Hutter:01op} we prove that (\ref{lossineq1}) and
(\ref{lossineq2}) indeed hold for $B\!\geq\!\odf A+\odA$. The
cautious reader may check the inequalities numerically. So in
summary we proved that (\ref{Eineq3}) holds for $B\!\geq\!\odf
A+\odA$. Inserting $B\!=\!\odf A+\odA$ into (\ref{Eineq3}) and
minimizing the r.h.s.\ with respect to $A$ leads to the bound of
Theorem \ref{thULoss} (with
$A^2\!=\!H_n/(L_{n\Lambda_\mu}\!+\!\odf H_n)$) $\qed$.
%------------------------------%
\subsection{General Loss}
%------------------------------%
There are only very few restrictions imposed on the loss
$l_{x_t y_t}$ in Theorem \ref{thULoss}, namely that it is static
and in the unit interval $[0,1]$. If we look at the proof of
Theorem \ref{thULoss}, we see that the time-independence has not
been used at all. The proof is still valid for an individual loss
function $l_{x_t y_t}^t\!\in\![0,1]$ for each step $t$. The loss
might even depend on the actual history $x_{\!0$ (otherwise there
is no winning strategy at all, since
$P_{n\Lambda_\mu}\!\geq\!P_{n\Lambda_\rho}\,\forall\rho$). Often
we are not in the favorable position of knowing $\mu$, but we know
(or assume) that $\mu\!\in\!M$ for some $M$, for instance that
$\mu$ is a computable probability distribution. From Theorem
\ref{thGLoss} we see that the average profit per round $\bar
p_{n\Lambda_\xi}\!:=\!{1\over n}P_{n\Lambda_\xi}$ of the universal
$\Lambda_\xi$ scheme converges to the average profit per round
$\bar p_{n\Lambda_\mu}\!:=\!{1\over n}P_{n\Lambda_\mu}$ of the
optimal informed scheme, i.e.\ asymptotically we can make the same
money even without knowing $\mu$, by just using the universal
$\Lambda_\xi$ scheme. Theorem \ref{thGLoss} allows us to lower
bound the universal profit $P_{n\Lambda_\xi}$ \beq\label{pbnd}
P_{n\Lambda_\xi} \!\geq\! P_{n\Lambda_\mu} \!-\!
p_\Delta H_n\!-\!\sqrt{4(n p_{max}\!-\!P_{n\Lambda_\mu})
p_\Delta H_n\!+\!p_\Delta^2H_n^2}
\eeq
where $p_{max}$ is the maximal profit per round and
$p_\Delta$ the profit range. The time needed for
$\Lambda_\xi$ to perform well can also be estimated. An interesting
quantity is the expected number of rounds needed to reach the
winning zone. Using $P_{n\Lambda_\mu}\!>\!0$ one can show that the
r.h.s.\ of (\ref{pbnd}) is positive if, and only if
\beq\label{pwin}
n \;>\; {2p_\Delta(2p_{max}\!-\!\bar p_{n\Lambda_\mu}) \over
\bar p_{n\Lambda_\mu}^2} \!\cdot\! H_n.
\eeq
%------------------------------%
\addcontentsline{toc}{paragraph}{Theorem \ref{thWin} (Time to Win)}
%------------------------------%
\begin{theorem}[Time to Win]\label{thWin}
Let there be binary sequences $x_1x_2...$ drawn with probability
$\mu( x_{1:n})$ for the first $n$ symbols. In step $t$ we make
a bet, depending on the history $x_{\! \Big({2p_\Delta\over
\bar p_{n\Lambda_\mu}}\Big)^{\!2}
\!\cdot\!d_\mu \quad\!\mbox{and}\quad\! \bar p_{n\Lambda_\mu}\!>\!0
\,\Longrightarrow\, \bar p_{n\Lambda_\xi}\!>\!0
\end{array}
$$
where $w_\mu=e^{-d_\mu}$ is the weight (\ref{xidef}) of $\mu$ in $\xi$.
\end{theorem}
By dividing (\ref{pbnd}) by $n$ and using $H_n\!\leq\!d_\mu$
(\ref{entropy}) we see that the leading order of
$\bar p_{n\Lambda_\xi}\!-\!\bar p_{n\Lambda_\mu}$ is bounded by
$\sqrt{4p_\Delta p_{max}d_\mu/n}$, which proves {\sl$(i)$}. The
condition in {\sl$(ii)$} is actually a weakening of (\ref{pwin}).
$P_{n\Lambda_\xi}$ is trivially positive for $p_{min}\!>\!0$, since
in this wonderful case {\it all} profits are positive. For
negative $p_{min}$ the condition of {\sl$(ii)$} implies
(\ref{pwin}), since $p_\Delta\!>\!p_{max}$, and (\ref{pwin}) implies
positive (\ref{pbnd}), i.e. $P_{n\Lambda_\xi}\!>\!0$, which proves
{\sl$(ii)$}.
If a winning strategy $\Lambda_\rho$ with
$\bar p_{n\Lambda_\rho}\!>\!\eps\!>0$ exists, then $\Lambda_\xi$ is
asymptotically also a winning strategy with the same average profit.
%------------------------------%
\subsection{Information-Theoretic Interpretation}
%------------------------------%
We try to give an intuitive explanation of Theorem
\ref{thWin}{\sl$(ii)$}. We know that $\xi( x_t|x_{