%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Convergence and Error Bounds for Universal Prediction %%
%% of Nonbinary Sequences %%
%% %%
%% Marcus Hutter: Start: 11.10.00 LastEdit: 14.06.01 %%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%-------------------------------%
% Document-Style %
%-------------------------------%
\documentclass[12pt]{article}
\topmargin=0cm \oddsidemargin=0cm \evensidemargin=0cm
\textwidth=16cm \textheight=23cm \unitlength=1mm \sloppy
%-------------------------------%
% Macro-Definitions %
%-------------------------------%
\renewenvironment{abstract}{\centerline{\bf
Abstract}\vspace{0.5ex}\begin{quote}\small}{\par\end{quote}\vskip
1ex}
\newenvironment{keywords}{\centerline{\bf
Keywords}\vspace{0.5ex}\begin{quote}\small}{\par\end{quote}\vskip
1ex}
\newtheorem{theorem}{Theorem}
\def\eqd{\stackrel{\bullet}{=}}
\def\ff{\Longrightarrow}
\def\gdw{\Longleftrightarrow}
\def\toinfty#1{\stackrel{#1\to\infty}{\longrightarrow}}
\def\gtapprox{\buildrel{\lower.7ex\hbox{$>$}}\over
{\lower.7ex\hbox{$\sim$}}}
\def\nq{\hspace{-1em}}
\def\look{\(\uparrow\)}
\def\ignore#1{}
\def\deltabar{{\delta\!\!\!^{-}}}
\def\qed{\sqcap\!\!\!\!\sqcup}
\def\odt{{\textstyle{1\over 2}}}
\def\odf{{\textstyle{1\over 4}}}
\def\odA{{\textstyle{1\over A}}}
\def\hbar{h\!\!\!\!^{-}\,}
\def\dbar{d\!\!^{-}\!}
\def\eps{\varepsilon}
\def\beq{\begin{equation}}
\def\eeq{\end{equation}}
\def\beqn{\begin{displaymath}}
\def\eeqn{\end{displaymath}}
\def\bqa{\begin{equation}\begin{array}{c}}
\def\eqa{\end{array}\end{equation}}
\def\bqan{\begin{displaymath}\begin{array}{c}}
\def\eqan{\end{array}\end{displaymath}}
\def\pb{\underline} % probability notation
\def\pb#1{\underline{#1}} % probability notation
\def\blank{{\,_\sqcup\,}} % blank position
\def\maxarg{\mathop{\rm maxarg}} % maxarg
\def\minarg{\mathop{\rm minarg}} % minarg
\def\hh#1{{\dot{#1}}} % historic I/O
\def\best{*} % or {best}
\def\vec#1{{\bf #1}}
\begin{document}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% T i t l e - P a g e %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{titlepage}
%\hfill Manno, 18.10.2000
%\hfill{\Large Preliminary Version 1}
\begin{center}\vspace*{-1cm}
{\small Technical Report IDSIA-07-01, 26. February 2001} \\[1cm]
{\LARGE\sc\hrule height1pt \vskip 4mm Convergence and Error Bounds \\
for Universal Prediction \\
of Nonbinary Sequences\vskip 3mm \hrule height1pt} \vspace{1.5cm}
{\bf Marcus Hutter} \\[1cm]
{\rm IDSIA, Galleria 2, CH-6928 Manno-Lugano, Switzerland} \\
{\rm\footnotesize marcus@idsia.ch
\footnote{This work was supported by SNF grant 2000-61847.00 to J\"urgen Schmidhuber.}
\qquad http://www.idsia.ch/$^{_{_\sim}}\!$marcus} \\[1cm]
\end{center}
%?? class: Artificial Intelligence; Learning
%ACM class: F.2.3.; ???
\begin{keywords}
Bayesian sequence prediction; Solomonoff induction; Kolmogorov
complexity; learning; universal probability; finite non-binary
alphabet; convergence; error bounds;
games of chance; partial and delayed prediction; classification.
\end{keywords}
\begin{abstract}
Solomonoff's uncomputable universal prediction scheme $\xi$ allows
to predict the next symbol $x_k$ of a sequence $x_1...x_{k-1}$ for
any Turing computable, but otherwise unknown, probabilistic
environment $\mu$. This scheme will be generalized to arbitrary
environmental classes, which, among others, allows the
construction of computable universal prediction schemes $\xi$.
Convergence of $\xi$ to $\mu$ in a conditional mean squared sense
and with $\mu$ probability $1$ is proven. It is shown that the
average number of prediction errors made by the universal $\xi$
scheme rapidly converges to those made by the best possible
informed $\mu$ scheme. The schemes, theorems and proofs are given
for general finite alphabet, which results in additional
complications as compared to the binary case. Several extensions
of the presented theory and results are outlined. They include
general loss functions and bounds, games of chance, infinite
alphabet, partial and delayed prediction, classification, and more
active systems.
\end{abstract}
\end{titlepage}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Introduction}\label{secInt}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%------------------------------%
%\paragraph{Induction}
%------------------------------%
The Bayesian framework is ideally suited for studying induction
problems. The probability of observing $x_k$ at time $k$, given
past observations $x_1...x_{k-1}$, can be computed with Bayes'
rule if the generating probability distribution $\mu$, from which
sequences $x_1x_2x_3...$ are drawn, is known. The problem,
however, is that in many cases one does not even have a reasonable
estimate of the true generating distribution. What is the true
probability of weather sequences or stock charts?
%------------------------------%
%\paragraph{Universal Sequence Prediction}
%------------------------------%
In order to overcome this problem we define a universal
distribution $\xi$ as a weighted sum of distributions
$\mu_i\!\in\!M$, where $M$ is any finite or countable set of
distributions including $\mu$. This is a generalization of
Solomonoff induction, in which $M$ is the set of all enumerable
semi-measures \cite{Solomonoff:64,Solomonoff:78}.
%------------------------------%
%\paragraph{Optimality: Convergence and Error Bounds}
%------------------------------%
We show that using the universal $\xi$ as a prior is nearly as
good as using the unknown generating distribution $\mu$. In a sense,
this solves the problem, that the generating distribution $\mu$ is not
known, in a universal way. All results are obtained for general
finite alphabet.
%
Convergence of $\xi$ to $\mu$ in a conditional mean squared sense
and with $\mu$ probability $1$ is proven. The number of
errors $E_{\Theta_\xi}$ made by the universal prediction
scheme $\Theta_\xi$ based on $\xi$ minus the number of errors
$E_{\Theta_\mu}$ of the optimal informed prediction scheme
$\Theta_\mu$ based on $\mu$ is proven to be bounded by
$O(\sqrt{E_{\Theta_\mu}})$.
%------------------------------%
%\paragraph{Contents}
%------------------------------%
Extensions to arbitrary loss functions, games of chance, infinite
alphabet, partial and delayed prediction, classification, and
more active systems are discussed (Section \ref{secOut}).
%
The main new results are a generalization of the universal
probability $\xi$ \cite{Solomonoff:64} to arbitrary probability
classes and weights (Section \ref{secSetup}), a generalization of
the convergence \cite{Solomonoff:78} $\xi\to\mu$ (Section
\ref{secConv}) and the error bounds \cite{Hutter:99} to arbitrary
alphabet (Section \ref{secErr}). The non-binary setting causes
substantial additional complications. Non-binary prediction cannot
be (easily) reduced to the binary case. One may have in mind a
binary coding of the symbols $x_k$ in the sequence
$x_1x_2...$. But this makes it necessary to predict a block of
bits $x_k$, before receiving the true block of bits $x_k$, which
differs from the bit-by-bit prediction considered in
\cite{Solomonoff:78,Li:97,Hutter:99}.
%------------------------------%
%\paragraph{Introductory references}
%------------------------------%
For an excellent introduction to Kolmogorov complexity and
Solomonoff induction one should consult the book of Li and
Vit\'anyi \cite{Li:97} or the article \cite{Li:92b} for a short
course. Historical surveys of inductive reasoning and inference
can be found in \cite{Angluin:83,Solomonoff:97}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Setup}\label{secSetup}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%------------------------------%
\subsection{Strings and Probability Distributions}
%------------------------------%
We denote strings over a finite alphabet $\cal A$ by
$x_1x_2...x_n$ with $x_k\!\in\!\cal A$. We further use the
abbreviations $x_{n:m}:=x_nx_{n+1}...x_{m-1}x_m$ and
$x_{0.
\eeq
%
It is easy to see that $\xi$ is a probability distribution as
the weights $w_{\mu_i}$ are positive and normalized to 1 and the
$\mu_i\!\in\!M$ are probabilities. For finite $M$ a possible
choice for the $w$ is to give all $\mu_i$ equal weight
($w_{\mu_i}={1\over|M|}$). We call $\xi$ universal relative to
$M$, as it multiplicatively dominates all distributions in $M$
%
\beq\label{unixi}
\xi(\pb x_{1:n}) \;\geq\;
w_{\mu_i}\!\cdot\!\mu_i(\pb x_{1:n}) \quad\mbox{for all}\quad
\mu_i\in M.
\eeq
%
In the following, we assume that $M$ is known and
contains\footnote{Actually all theorems remain valid for $\mu$ being a finite linear
combination of $\mu_i\in L\subseteq M$ and
$w_\mu:=\min_{\mu_i\in L}w_{\mu_i}$ \cite{Hutter:01op}.}
the true generating distribution, i.e. $\mu\!\in\!M$. We will see that this
is not a serious constraint as we can always chose $M$ to be
sufficiently large. In the next section we show the important
property of $\xi$ converging to the generating distribution $\mu$ in a
sense and, hence, might being a useful substitute for the true generating,
but in general, unknown distribution $\mu$.
%------------------------------%
\subsection{Probability Classes}
%------------------------------%
We get a rather wide class $M$ if we include {\it all} computable
probability distributions in $M$. In this case, the assumption
$\mu\!\in\!M$ is very weak, as it only assumes that the strings
are drawn from {\it any computable} distribution; and all valid
physical theories (and, hence, all environments) {\it are}
computable (in a probabilistic sense).
We will see that it is favorable to assign high weights
$w_{\mu_i}$ to the $\mu_i$. Simplicity should be favored over
complexity, according to Occam's razor. In our context this means
that a high weight should be assigned to simple $\mu_i$. The
prefix Kolmogorov complexity $K(\mu_i)$ is a universal complexity
measure \cite{Kolmogorov:65,Zvonkin:70,Li:97}. It is defined as
the length of the shortest self-delimiting program (on a universal
Turing machine) computing $\mu_i(x_{1:n})$ given $x_{1:n}$. If we
define \beqn
w_{\mu_i} \;:=\; {1\over\Omega}2^{-K(\mu_i)} \quad,\quad
\Omega \;:=\; \sum_{\mu_i\in M}2^{-K(\mu_i)}
\eeqn then, distributions which can be calculated by short
programs, have high weights. Besides ensuring correct
normalization, $\Omega$ (sometimes called the number of wisdom)
has interesting properties in itself
\cite{Calude:98,Chaitin:91}. If we enlarge $M$ to
include all enumerable semi-measures, we attain Solomonoff's
universal probability, apart from normalization, which has to be
treated differently in this case
\cite{Solomonoff:64,Solomonoff:78}. Recently, $M$ has been further
enlarged to include all cumulatively enumerable semi-measures
\cite{Schmidhuber:01}. In all cases, $\xi$ is not finitely
computable, but can still be approximated to arbitrary but not
pre-specifiable precision. If we consider {\it all} approximable
(i.e.\ asymptotically computable) distributions, then the
universal distribution $\xi$, although still well defined, is not
even approximable \cite{Schmidhuber:01}. An interesting and
quickly approximable distribution is the Speed prior $S$ defined
in \cite{Schmidhuber:01}. It is related to Levin complexity and
Levin search \cite{Levin:73,Levin:84}, but it is unclear for now
which distributions are dominated by $S$. If one considers only
finite-state automata instead of general Turing machines, one can
attain a quickly computable, universal finite-state prediction
scheme related to that of Feder et al. \cite{Feder:92}, which
itself is related to the famous Lempel-Ziv data compression
algorithm. If one has extra knowledge on the source generating the
sequence, one might further reduce $M$ and increase $w$. A
detailed analysis of these and other specific classes $M$ will be
given elsewhere. Note that $\xi\!\in\!M$ in the enumerable and
cumulatively enumerable case, but $\xi\!\not\in\!M$ in the
computable, approximable and finite-state case. If $\xi$ is itself
in $M$, it is called a universal element of $M$ \cite{Li:97}. As
we do not need this property here, $M$ may be {\it any} finite or
countable set of distributions. In the following we consider
generic $M$ and $w$.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Convergence}\label{secConv}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%------------------------------%
\subsection{Upper Bound for the Relative Entropy}
%------------------------------%
Let us define the relative entropy (also called Kullback Leibler
divergence \cite{Kullback:59}) between $\mu$ and $\xi$:
%
\beq\label{hn}
h_k(x_{\!2$
to the case $N\!=\!2$. We do this by a partition
$\{1,...,N\}=G^+\cup G^-$, $G^+\cap G^-=\{\}$, and define
$\displaystyle y^\pm\!:=\!\sum_{i\in G^\pm}y_i$ and $\displaystyle
z^\pm\!:=\!\sum_{i\in G^\pm}z_i$. It is well known that the
relative entropy is positive, i.e.
%
\beq\label{entropos}
\sum_{i\in G^\pm}p_i\ln{p_i\over q_i}\;\geq\; 0 \quad\mbox{for}\quad
p_i\geq 0, \quad q_i\geq 0, \quad
\sum_{i\in G^\pm} p_i=1=\sum_{i\in G^\pm} q_i.
\eeq
%but for $y_i$ and $z_i$ the constraints are not satisfied on $G^\pm$.
Note that there are 4 probability distributions ($p_i$ and $q_i$
for $i\!\in\!G^+$ and $i\!\in\!G^-$). For $i\!\in\!G^\pm$,
$p_i:=y_i/y^\pm$ and $q_i:=z_i/z^\pm$ satisfy the conditions on
$p$ and $q$. Inserting this into (\ref{entropos}) and rearranging
the terms we get
$\sum_{i\in G^\pm}y_i\ln{y_i\over z_i}\!\geq\!y^\pm\ln{y^\pm\over
z^\pm}.$
If we sum this over $\pm$ and define $y\equiv y^+=1\!-\!y^-$ and
$z\equiv z^+=1\!-\!z^-$ we get
%
\beq\label{eirb}
\sum_{i=1}^N y_i\ln{y_i\over z_i}\;\geq\;
\sum_\pm y^\pm\ln{y^\pm\over z^\pm} \;=\;
y\ln{y\over z}+(1\!-\!y)\ln{1\!-\!y\over 1\!-\!z}.
\eeq
%
For the special choice $G^\pm\!:=\!\{i:y_i{>\atop\leq}z_i\}$,
we can upper bound the quadratic term by
%
\beqn
\sum_{i\in G^\pm}(y_i\!-\!z_i)^2 \;\leq\;
\Big(\sum_{i\in G^\pm}|y_i\!-\!z_i|\Big)^2 \;=\;
\Big(\sum_{i\in G^\pm}y_i\!-\!z_i\Big)^2 \;=\;
(y^\pm\!-\!z^\pm)^2.
\eeqn
%
The first equality is true, since all $y_i\!\!-\!\!z_i$ are
positive/negative for $i\!\in\!G^\pm$ due to the special choice of
$G^\pm$. Summation over $\pm$ gives
%
\beq\label{sqineq}
\sum_{i=1}^N (y_i\!-\!z_i)^2 \;\leq\;
\sum_\pm (y^\pm\!-\!z^\pm)^2 \;=\;
2(y\!-\!z)^2
\eeq
%
Chaining the inequalities (\ref{sqineq}), (\ref{entroineq2})
and (\ref{eirb}) proves (\ref{entroineqn}).
If we identify
\beq\label{xydef}
{\cal A}=\{1,...,N\},\quad
N=|{\cal A}|, \quad
i=x_k, \quad
y_i=\mu(x_{0,\quad (\Rightarrow B\geq 1).
\eeq
Inequality (\ref{Eineq2}) therefore holds for any $A\!>\!0$,
provided we insert $B\!=\!{1\over 4}A+\odA$. Thus we might
minimize the r.h.s.\ of (\ref{Eineq2}) w.r.t.\ $A$ leading to the
upper bound $$ E_{n\Theta_\xi} \;\leq\; E_{n\Theta_\mu} +
H_n+\sqrt{4E_{n\mu}H_n+H_n^2}
\qquad\mbox{for}\qquad A^2={H_n\over E_{n\Theta_\mu}+ {1\over
4}H_n} $$ which completes the proof of Theorem \ref{thErrBnd}
$\qed$.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Generalizations}\label{secOut}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
In the following we discuss several directions in which the
findings of this work may be extended.
%------------------------------%
\subsection{General Loss Function}\label{ssecGLoss}
%------------------------------%
A prediction is very often the basis for some decision. The
decision results in an action, which itself leads to some reward
or loss. To stay in the framework of (passive) prediction we have
to assume that the action itself does not influence the
environment. Let
$l^k_{x_ky_k}(x_{