%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% New Error Bounds for Solomonoff Prediction %%
%% Marcus Hutter: Start: 01.10.99 LastEdit: 14.11.00 %%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%-------------------------------%
% Document-Style %
%-------------------------------%
\documentclass[12pt]{article}
\parskip=1.5ex plus 1ex minus 1ex \parindent=0ex
\pagestyle{myheadings}
\markboth{MARCUS HUTTER, TECHNICAL REPORT, IDSIA-11-10
}{MARCUS HUTTER, TECHNICAL REPORT, IDSIA-11-10}
\setcounter{tocdepth}{4} \setcounter{secnumdepth}{1}
\topmargin=0cm \oddsidemargin=0cm \evensidemargin=0cm
\textwidth=16cm \textheight=23cm \unitlength=1mm \sloppy
\makeindex
%-------------------------------%
% Macro-Definitions %
%-------------------------------%
\def\keywords#1{\small\centerline{\bf Keywords}\vspace{5mm}\centerline{\parbox{14cm}{#1}}}
\def\eqd{\stackrel{\bullet}{=}}
\def\ff{\Longrightarrow}
\def\gdw{\Longleftrightarrow}
\def\toinfty#1{\stackrel{#1\to\infty}{\longrightarrow}}
\def\gtapprox{\buildrel{\lower.7ex\hbox{$>$}}\over
{\lower.7ex\hbox{$\sim$}}}
\def\nq{\hspace{-1em}}
\def\look{\(\uparrow\)}
\def\ignore#1{}
\def\deltabar{{\delta\!\!\!^{-}}}
\def\qed{\sqcap\!\!\!\!\sqcup}
\def\odt{{\textstyle{1\over 2}}}
\def\odf{{\textstyle{1\over 4}}}
\def\odA{{\textstyle{1\over A}}}
\def\hbar{h\!\!\!\!^{-}\,}
\def\dbar{d\!\!^{-}\!}
\def\eps{\varepsilon}
\def\beq{\begin{equation}}
\def\eeq{\end{equation}}
\def\beqn{\begin{displaymath}}
\def\eeqn{\end{displaymath}}
\def\bqa{\begin{equation}\begin{array}{c}}
\def\eqa{\end{array}\end{equation}}
\def\bqan{\begin{displaymath}\begin{array}{c}}
\def\eqan{\end{array}\end{displaymath}}
\def\pb{\underline} % probability notation
\def\pb#1{\underline{#1}} % probability notation
\def\blank{{\,_\sqcup\,}} % blank position
\def\maxarg{\mathop{\rm maxarg}} % maxarg
\def\minarg{\mathop{\rm minarg}} % minarg
\def\hh#1{{\dot{#1}}} % historic I/O
\def\best{*} % or {best}
\def\vec#1{{\bf #1}}
\begin{document}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% T i t l e - P a g e %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{titlepage}
\begin{center} %\vspace*{-2cm}
{\small Technical Report IDSIA-11-00, 14. November 2000}\\[2cm]
{\Large\sc\hrule height1pt \vskip 2mm
New Error Bounds for Solomonoff Prediction
\vskip 5mm \hrule height1pt } \vspace{2cm}
{\bf Marcus Hutter} \\[1cm]
{\rm IDSIA, Galleria 2, CH-6928 Manno-Lugano, Switzerland} \\
{\rm\footnotesize marcus@idsia.ch \qquad http://www.idsia.ch/$^{_{_\sim}}\!$marcus} \\[2cm]
\end{center}
%arXiv class: Artificial Intelligence; Learning
%ACM class: F.2.3.; I.2.6;
\keywords{Induction; Solomonoff, Bayesian, deterministic prediction;
algorithmic probability, Kolmogorov complexity.}
\begin{abstract}
Solomonoff sequence prediction is a scheme to predict
digits of binary strings without knowing the underlying
probability distribution. We call a prediction scheme informed
when it knows the true probability distribution of the sequence.
Several new relations between universal Solomonoff sequence
prediction and informed prediction and general probabilistic
prediction schemes will be proved. Among others, they show that
the number of errors in Solomonoff prediction is finite for
computable distributions, if finite in the informed case.
Deterministic variants will also be studied. The most interesting
result is that the deterministic variant of Solomonoff prediction
is optimal compared to any other probabilistic or deterministic
prediction scheme apart from additive square root corrections
only. This makes it well suited even for difficult prediction
problems, where it does not suffice when the number of errors is
minimal to within some factor greater than one. Solomonoff's
original bound and the ones presented here complement each other
in a useful way.
\end{abstract}
\end{titlepage}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Introduction}\label{int}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%------------------------------%
%\paragraph{History:}
%------------------------------%
Induction is the process of predicting the future from the past
or, more precisely, it is the process of finding rules in (past)
data and using these rules to guess future data. The induction
principle has been subject to long philosophical controversies.
Highlights are Epicurus' principle of multiple explanations,
Occams' razor (simplicity) principle and Bayes' rule for
conditional probabilities \cite{Bay63}. In 1964, Solomonoff
\cite{Sol64} elegantly unified all these aspects into one formal
theory of inductive inference. The theory allows the prediction of digits
of binary sequences without knowing their true probability
distribution in contrast to what we call an informed scheme, where
the true distribution is known. A first error estimate was also
given by Solomonoff 14 years later in \cite{Sol78}. It states that
the total means squared distance of the prediction probabilities
of Solomonoff and informed prediction is bounded by the Kolmogorov
complexity of the true distribution. As a corollary, this theorem
ensures that Solomonoff prediction converges to informed
prediction for computable sequences in the limit. This is the key
result justifying the use of Solomonoff prediction for long
sequences of low complexity.
Another natural question is to ask for relations between the total
number of expected errors $E_\xi$ in Solomonoff prediction and the
total number of prediction errors $E_\mu$ in the informed scheme.
Unfortunately \cite{Sol78} does not bound $E_\xi$ in terms of
$E_\mu$ in a satisfactory way. For example it does not exclude the
possibility of an infinite $E_\xi$ even if $E_\mu$ is finite. Here
we want to prove upper bounds to $E_\xi$ in terms of $E_\mu$
ensuring as a corollary that the above case cannot happen. On the
other hand, our theorem does not say much about the convergence of
Solomonoff to informed prediction. So Solomonoff's and our bounds
complement each other in a nice way.
%------------------------------%
%\paragraph{Contents:}
%------------------------------%
In the preliminary Section \ref{pre} we give some notations for
strings and conditional probability distributions on strings.
Furthermore, we introduce Kolmogorov complexity and the universal
probability, where we take care to make the latter a true
probability measure.
In Section \ref{ProbSP} we define the general probabilistic
prediction scheme ($\rho$) and Solo\-mo\-noff ($\xi$) and informed
($\mu$) prediction as special cases. We will give several error
relations between these prediction schemes. A bound for the error
difference $|E_\xi\!-\!E_\mu|$ between Solomonoff and informed
prediction is the central result. All other relations are then
simple, but interesting consequences or known results such as the
Euclidean bound.
In Section \ref{DetSP} we study deterministic variants of
Solomonoff ($\Theta_\xi$) and informed ($\Theta_\mu$) prediction.
We will give similar error relations as in the probabilistic case
between these prediction schemes. The most interesting consequence
is that the $\Theta_\xi$ system is optimal compared to any
other probabilistic or deterministic prediction scheme apart from
additive square root corrections only.
In the Appendices \ref{AppBasic2}, \ref{AppBasic1} and
\ref{AppBasic4} we prove the inequalities (\ref{basic2}),
(\ref{basic1}) and (\ref{basic4}), which are the central parts for
the proofs of the Theorems 1 and 2.
For an excellent introduction to Kolmogorov complexity and
Solomonoff induction one should consult the book of Li and
Vit\'anyi \cite{LiVi97} or the article \cite{LiVi92} for a short
course. Historical surveys of inductive reasoning/inference can be
found in \cite{Ang83,Sol97}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Preliminaries}\label{pre}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%------------------------------%
%\paragraph{Strings and probability distributions:}
%------------------------------%
Throughout the paper we will consider binary sequences/strings and
conditional probability measures on strings.
We will denote strings over the binary alphabet $\{0,1\}$ by
$s\!=\!x_1x_2...x_n$ with $x_k\!\in\!\{0,1\}$ and their lengths
with $l(s)\!=\!n$. $\epsilon$ is the empty string,
$x_{n:m}:=x_nx_{n+1}...x_{m-1}x_m$ for $n\leq m$ and $\epsilon$
for $n>m$. Furthermore, $x_{\quad \Delta_n^{(2)}+ \odt E_{n\mu} \\
iv) & E_{n\xi} \quad>\quad E_{n\mu}+H_n-\sqrt{2E_{n\mu}H_n}
\quad>\; H_n \quad\quad\mbox{for}\quad E_{n\mu}
\!>\!2H_n \\
v) & E_{n\mu} \quad\leq\quad 2E_{n\rho} \quad,\quad
e_{n\mu} \;\leq\; 2e_{n\rho}
\quad\;\mbox{for any }\rho \\
vi) & E_{n\xi} \quad<\quad 2E_{n\rho}+H_n+\sqrt{4E_{n\rho}H_n}
\quad\mbox{for any }\rho,
\end{array}
\eeqn
where $H_n\!<\!\ln2\!\cdot\!K(\mu)\!+\!O(1)$ is the relative entropy
(\ref{entropy}) and $K(\mu)$ is the Kolmogorov complexity of $\mu$
(\ref{kolmodef}).}\\
%------------------------------%
{\sc Corollary 1.}
%------------------------------%
{\sl For computable $\mu$, i.e. for $K(\mu)<\infty$, the following
statements immediately follow from Theorem 1:
$$
\begin{array}{rl}
vii) & \mbox{if $E_{\infty\mu}$ is finite, then
$E_{\infty\xi}$ is finite}
\qquad\qquad\qquad\qquad\qquad\qquad\quad \\
viii) & E_{n\xi}/E_{n\mu} \quad\!=\; 1+O(E_{n\mu}^{-1/2})
\stackrel{E_{n\mu}\to\infty}
{\longrightarrow} 1 \\
ix) & E_{n\xi}-E_{n\mu} = \qquad O(\sqrt{E_{n\mu}}) \\
% x) & \mbox{if } E_{n\mu}\sim n, \mbox{ then }
% E_{n\xi}-E_{n\mu}=O(\sqrt{n})
x) & E_{n\xi}/E_{n\rho} \quad\!\leq\; 2+O(E_{n\rho}^{-1/2}).
\end{array}
$$
}
%------------------------------%
%\paragraph{Comments:}
%------------------------------%
Relation {\sl$(i)$} is the central new result. It is best illustrated for
computable $\mu$ by the corollary. Statements {\sl$(vii)$}, {\sl$(viii)$} and
{\sl$(ix)$} follow directly from {\sl$(i)$} and the finiteness of
$H_\infty$. Statement {\sl$(x)$} follows from {\sl$(vi)$}.
First of all, {\sl$(vii)$} ensures finiteness of the number of
errors of Solomonoff prediction, if the informed prediction makes
only a finite number of errors. This is especially the case for
deterministic $\mu$, as $E_{n\mu}\!=\!0$ in this case\footnote{We
call a probability measure deterministic if it is 1 for
exactly one sequence and 0 for all others.}. Solomonoff prediction
makes only a finite number of errors on computable sequences. For
more complicated probabilistic environments, where even the ideal
informed system makes an infinite number of errors, {\sl$(ix)$}
ensures that the error excess of Solomonoff prediction is only of
order $\sqrt{E_{n\mu}}$. This ensures that the error densities
$E_n/n$ of both systems converge to each other, but
{\sl$(ix)$} actually says more than this. It ensures that the quotient
converges to 1 and also gives the speed of convergence
{\sl$(viii)$}.
Relation {\sl$(ii)$} is the well-known Euclidean bound
\cite{Sol78}. It is the only upper bound in Theorem 1 which
remains finite for $E_{n\mu/\rho}\!\to\!\infty$. It ensures
convergence of the individual prediction probabilities
$\xi(x_{\!0$.
Inequality (\ref{Eineq2}) therefore holds for any $A\!>\!0$,
provided we insert $B\!=\!{1\over 2A}\!+\!1$. Thus we might
minimize the r.h.s of (\ref{Eineq2}) w.r.t $A$. The minimum is at
$A=\sqrt{H_n/2E_{n\mu}}$ leading to the upper bound
$$
\Delta_n^{(1)} \;<\; H_n+\sqrt{2E_{n\mu}H_n}
$$
which completes the proof of {\sl$(i)$}.
Bound {\sl$(ii)$} is well known \cite{Sol78}. It is already linear and
is proved by showing $d_n^2\!\leq\!\odt h_n$. Inserting the
abbreviations (\ref{abbr}) we get
%
\beq\label{kulbound}
2(y-z)^2\;\leq\;y\ln{y\over z}+(1-y)\ln{1-y\over 1-z}
\eeq
%
This lower bound for the Kullback Leibler distance is well known
\cite{Kul59}.
Relation {\sl$(iii)$} does not involve $H_n$ at all and is elementary. It
is reduced to $e_{n\xi}\!>\!d_n^2\!+\!\odt e_{n\mu}$, equivalent
to $z(1-y)+y(1-z)>(y-z)^2+y(1-y)$, equivalent to $z(1-z)>0$, which
is obviously true.
The second inequality of {\sl$(iv)$} is trivial and the first is
proved similarly to {\sl$(i)$}. Again we start with a linear
inequality $ -E_{n\xi}\!<\!(A- 1)E_{n\mu}+(B- 1)H_n $, which is
further reduced to $-e_{k\xi}\!<\!(A-1)e_{k\mu}+(B-1)h_k$.
Inserting the abbreviations (\ref{abbr}) we get
\beq\label{basic1}
-y(1-z)-z(1-y) \;<\; (A-1)2y(1-y)+
(B-1)\left[y\ln{y\over z}+(1-y)\ln{1-y\over 1-z}\right].
\eeq
In Appendix \ref{AppBasic1} this
inequality is shown to hold for $2AB\!\geq\!1$, when $B\!>\!1$.
If we insert $B=1/2A$ and minimize w.r.t.\ $A$, the
minimum is again at $A=\sqrt{H_n/2E_{n\mu}}$ leading to the upper
bound
$
-E_{n\xi} \!\leq\! -E_{n\mu}-H_n+\sqrt{2E_{n\mu}H_n}
$
restricted to $E_{n\mu}\!>\!2H_n$, which
completes the proof of {\sl$(iv)$}.
Statement {\sl$(v)$} is satisfied because $2y(1-y)\leq
2[y(1-r)+(1-y)r]$. Statement {\sl$(vi)$} is a direct consequence of
{\sl$(i)$} and {\sl$(v)$}. This completes the proof of Theorem 1.
$\qed$
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Deterministic Sequence Prediction}\label{DetSP}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%------------------------------%
%\paragraph{Definitions:}
%------------------------------%
In the last section several relations were derived between the
number of errors of the universal $\xi$-system, the informed
$\mu$-system and arbitrary $\rho$-systems. All of them were
probabilistic predictors in the sense that given $x_{\odt.
\end{array} \right.
$$
Note that every deterministic predictor can be written in the form
$\Theta_\rho$ for some $\rho$ and that although
$\Theta_\rho(\pb{x_1...x_n})$, defined via Bayes' rule
(\ref{bayes}), takes only values in $\{0,1\}$, it may still be
interpreted as a probability measure. Deterministic
prediction is just a special case of probabilistic prediction. The
two models $\Theta_\mu$ and $\Theta_\xi$ will be studied now.
Analogously to the last section we draw binary strings randomly with distribution $\mu$
and define the probability that the $\Theta_\rho$ system makes an erroneous
prediction in the $n^{th}$ step and the total $\mu$-expected number of
errors in the first $n$ predictions as
\bqa\label{dtoterr}
\displaystyle{
e_{n\Theta_\rho}(x_{\!\odt$ and 0 otherwise. If
$e_{n\rho}(x_{\!0$ is reached if
$E_{n\rho}/n\!<\!1\!-\!s/r\!=\!{2\over 5}$.
If we knew $\mu$, we could use the best possible prediction scheme
$\Theta_\mu$. The error (\ref{dtoterr}) and profit (\ref{profit})
expectations per round in this case are
\beq\label{exethmu}
e_{\Theta_\mu} \;:=\; e_{n\Theta_\mu}(x_{ 0
\eeq
so we can make money from this game. If we predict according
to the probabilistic $\mu$ prediction scheme (\ref{rhoerr})
we would lose money in the long run:
\beqn
e_{n\mu}(x_{\; {2\over 5}
\quad,\quad
{P_{n\mu}\over n} \;=\; -{2\over 9}\$ < 0
\eeqn
In the more interesting case where we do not know $\mu$ we can use
Solomonoff prediction $\xi$ or its deterministic variant
$\Theta_\xi$. From {\sl$(viii)$} of
Corollaries 1 and 2 we know that
\beqn
P_{n\xi}/P_{n\mu} \;=\; 1+O(n^{-1/2}) \;=\;
P_{n\Theta_\xi}/P_{n\Theta_\mu},
\eeqn
so asymptotically the $\xi$ system provides the same profit as the $\mu$ system and
the $\Theta_\xi$ system the same as the $\Theta_\mu$ system. Using the $\xi$
system is a losing strategy, while using the $\Theta_\xi$ system is a winning strategy.
Let us estimate the number of rounds we have to play before reaching the winning
zone with the $\Theta_\xi$ system. $P_{n\Theta_\xi}\!>\!0$ if
$E_{n\Theta_\xi}\!<\!(1\!-\!s/r)n$ if
\beqn
E_{n\Theta_\mu} + H_n + \sqrt{4E_{n\Theta_\mu}H_n+H_n^2}
\;<\; (1-s/r)\!\cdot\!n
\eeqn
by Theorem 2 {\sl$(i)$}. Solving w.r.t.\ $H_n$ we get
\beqn
H_n \;<\; {(1-s/r-E_{n\Theta_\mu}/n)^2 \over
2\!\cdot\!\!(1-s/r+E_{n\Theta_\mu}/n)} \cdot n.
\eeqn
Using $H_n\!<\!\ln 2\!\cdot\!K(\mu)\!+\!O(1)$ and
(\ref{exethmu}) we expect to be in the winning zone for
\beqn
n \;>\; {2\!\cdot\!\!(1-s/r+e_{\Theta_\mu}) \over
(1-s/r-e_{\Theta_\mu})^2} \cdot\!\ln 2\!\cdot\!K(\mu)+O(1)
\;=\; 330\ln 2\!\cdot\!K(\mu)+O(1).
\eeqn
If the die selection strategy reflected in $\mu$ is not too
complicated, the $\Theta_\xi$ prediction system reaches the
winning zone after a few thousand rounds. The number of rounds is
not really small because the expected profit per round is one
order of magnitude smaller than the return. This leads to a
constant of two orders of magnitude size in front of $K(\mu)$.
Stated otherwise, it is due to the large stochastic noise, which
makes it difficult to extract the signal, i.e. the structure of
the rule $\mu$. Furthermore, this is only a bound for the turnaround
value of $n$. The true expected turnaround $n$ might be smaller.
However, every game for which there exists a winning strategy $\rho$ with
$P_{n\rho}\!\sim\!n$, $\Theta_\xi$ is guaranteed to get into
the winning zone for some $n\!\sim\!K(\mu)$, i.e.
$P_{n\Theta_\xi}\!>\!0$ for sufficiently large $n$. This is {\sl
not} guaranteed for the $\xi$-system, due to the factor $2$ in the
bound {\sl$(x)$} of Corollary 1.
%------------------------------%
{\it Proof of Theorem 2.}
%------------------------------%
The method of proof is the same as in the previous section, so
we will keep it short. With the abbreviations
(\ref{yzdef}) we can write $e_{k\Theta_\xi}$ and $e_{k\Theta_\mu}$
in the forms
\beq\label{dabbr}
\begin{array}{rcl}
e_{k\Theta_\xi} & = & y(1-\Theta(z-\odt))+(1-y)\Theta(z-\odt)
\;=\; |y-\Theta(z-\odt)| \\
e_{k\Theta_\mu} & = & y(1-\Theta(y-\odt))+(1-y)\Theta(y-\odt)
\;=\; \min\{y,1-y\}. \\
\end{array}
\eeq
With these abbreviations,
{\sl$(ii)$} is equivalent to $\min\{y,1-y\}\!\leq\!y(1-r)+(1-y)r$,
which is true, because the minimum of two numbers is always
smaller than their weighted average.
The first inequality and equality of {\sl$(i)$} follow directly
from {\sl$(ii)$}. To prove the last inequality, we start once
again with a linear model
\beq\label{eineq4}
E_{n\Theta_\xi} \;<\; (A+1)E_{n\Theta_\mu} + (B+1)H_n.
\eeq
Inserting the definition of $E_n$ and $H_n$, using (\ref{dabbr}),
and omitting the sums we have to find $A$ and $B$, which satisfy
\beq\label{basic4}
|y-\Theta(z-\odt)| \;<\; (A+1)\min\{y,1-y\}
+(B+1)\left[y\ln{y\over z}+(1-y)\ln{1-y\over 1-z}\right].
\eeq
In Appendix \ref{AppBasic4}
we will show that the inequality is
satisfied for $B\!\geq\!{1\over 4}A+{1\over A}$ and $A>0$.
Inserting $B\!=\!{1\over 4}A+{1\over A}$ into (\ref{eineq4})
and minimizing the r.h.s. w.r.t.\ $A$, we get the upper bound
$$
E_{n\Theta_\xi} \;<\; E_{n\Theta_\mu} +
H_n+\sqrt{4E_{n\mu}H_n+H_n^2}
\qquad\mbox{for}\qquad A^2={H_n\over E_{n\Theta_\mu}+ {1\over 4}H_n}.
$$
Statement {\sl$(iii)$} is a direct consequence of {\sl$(i)$} and
{\sl$(ii)$}. This completes the proof of Theorem 2. $\qed$
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Conclusions}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
We have proved several new error bounds for Solomonoff prediction
in terms of informed prediction and in terms of general prediction
schemes. Theorem 1 and Corollary 1 summarize the results in the
probabilistic case and Theorem 2 and Corollary 2 for the
deterministic case. We have shown that in the probabilistic case
$E_{n\xi}$ is asymptotically bounded by twice the number of errors
of any other prediction scheme.
In the deterministic variant of Solomonoff prediction this factor 2 is absent.
It is well suited, even for difficult prediction problems, as the error
probability $E_{\Theta_\xi}/n$ converges rapidly to that of the
minimal possible error probability $E_{\Theta_\mu}/n$.
%------------------------------%
\paragraph{Acknowledgments:}
%------------------------------%
I thank Ray Solomonoff and J{\"u}rgen Schmidhuber for
proofreading this work and for numerous discussions.
\appendix
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Proof of Inequality (\ref{basic2})}\label{AppBasic2}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\footnote{The proofs are a bit sketchy. We will be a little sloppy
about boundary values $y=0/1$, $z=\odt$, $\Theta(0)$, $\geq$
versus $>$, and {\it approaching} versus {\it at} the boundary. All subtleties
have been checked and do not spoil the results. As $0\!<\!\xi\!<\!1$,
therefore $0\!<\!z\!<\!1$ is strict.}With the definition
\beqn
f(y,z;A,B) \;:=\;
A\!\cdot\!2y(1-y) +
B\!\cdot\!\left[y\ln{y\over z}+(1-y)\ln{1-y\over 1-z}\right] -
|y-z|
\eeqn
we have to show $f(y,z;A,B)\!>\!0$ for $0\!<\!y\!<\!1$,
$0\!<\!z\!<\!1$ and suitable $A$ and $B$. We do this by showing
that $f\!>\!0$ at all extremal values, `at' boundaries and at
non-analytical points. $f\!\to\!+\infty$ for $z\to 0/1$, if we
choose $B>0$. Moreover, at the non-analytic point $z=y$ we have
$f(y,y;A,B)=2Ay(1-y)\!\geq\!0$ for $A\!\geq\!0$. The extremal
condition $\partial f/\partial z\!=\!0$ for $z\!\neq\!y$
(keeping $y$ fixed) leads to
$$
y \;=\; y^* \;:=\; z\!\cdot\![1-{s\over B}(1-z)],
\quad s\;:=\;\mbox{sign}(z-y)\;=\;\pm 1.
$$
Inserting $y^*$ into the definition of $f$ and omitting the
positive term $B[\ldots]$, we get
\beqn
f(y^*,z;A,B) \;>\; 2Ay^*(1-y^*)-|z-y^*| \;=\;
\textstyle{{1\over B^2}z(1-z)}\!\cdot\!g(z;A,B)
\eeqn
\beqn
g(z;A,B) \;:=\; 2A(B-s(1-z))(B+sz)-sB^.
\eeqn
We have reduced the problem to showing $g\!\geq\!0$. Since $s=\pm
1$, we have $g(z;A,B)>2A(B-1+z)(B-z)-B$ for $B>1$. The latter is
quadratic in $z$ and symmetric in $z\!\leftrightarrow\!1-z$ with a
maximum at $\odt$. Thus it is sufficient to check the boundary
values $g(0;A,B)=g(1;A,B)=2A(B-1)B-B$. They are non-negative for
$2A(B-1)\geq 1$. Putting everything together, we have proved that
$f\!>\!0$ for $B\!\geq\!{1\over 2A}+1$ and $A\!>\!0$. $\qed$
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Proof of Inequality (\ref{basic1})}\label{AppBasic1}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
The proof of this inequality is similar to the previous one. With the definition
\beqn
f(y,z;A,B) \;:=\;
(A- 1)2y(1-y) +
(B- 1)\left[y\ln{y\over z}+(1-y)\ln{1-y\over 1-z}\right] +
y(1-z)+ z(1-y)
\eeqn
we have to show $f(y,z;A,B)\!>\!0$ for $0\!<\!y\!<\!1$,
$0\!<\!z\!<\!1$ and suitable $A$ and $B$. Again, we do this by showing
that $f\!>\!0$ at all extremal values and `at' the boundary.
$f\!\to\!+\infty$ for $z\to 0,1$, if we choose $B>1$. The
extremal condition $\partial f/\partial z\!=\!0$ (keeping $y$
fixed) leads to
$$
y \;=\; y^* \;:=\; z\!\cdot\!{z- B\over 1- B-2z(1-z)},
\quad 0\; 2Ay^*(1-y^*)- (2y^*-1)(z-y^*) \;=\;
\textstyle{{z(1-z)\over[1- B-2z(1-z)]^2}}\!\cdot\!g(z;A,B)
\eeqn
\beqn
g(z;A,B) \;:=\; 2A(z- B)(1-z- B)-(B- 1)(2z-1)^2.
\eeqn
We have reduced the problem to showing $g\!\geq\!0$. This is easy,
since $g$ is quadratic in $z$ and symmetric in
$z\!\leftrightarrow\!1-z$. The extremal value $g({1\over
2};A,B)$$=$$2A(B-\odt)^2$ is positive for $A\!>\!0$. The
boundary values $g(0;A,B)$$=$$g(1;A,B)$$=$$(2AB-1)(B- 1)$ are
$\geq 0$ for $2AB\geq 1$. Putting everything together, we have
proved that $f\!>\!0$ for $2AB\!\geq\! 1$ and $B\!>\!1$. $\qed$
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Proof of Inequality (\ref{basic4})}\label{AppBasic4}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
We want to show that
$$
\textstyle
|y-\Theta(z-\odt)| \;<\;
(A+1)\min\{y,1-y\} +
(B+1)\left[y\ln{y\over z}+(1-y)\ln{1-y\over 1-z}\right]
$$
The formula is symmetric w.r.t. $y\!\leftrightarrow\!1\!-\!y$ and
$z\!\leftrightarrow\!1\!-\!z$ simultaneously, so we can restrict
ourselves to $0\!<\!y\!<\!1$ and $0\!<\!z\!<\!\odt$. Furthermore, let
$B\!>\!-1$. Using (\ref{kulbound}), it is enough to prove
$$
f(y,z;A,B) \;:=\; (A+1)\min\{y,1-y\}+(B+1)2(y-z)^2-y \;>\; 0
$$
$f$ is quadratic in $z$; thus for $y\!<\!\odt$ it takes its
minimum at $z=y$. Since $f(y,y;A,B)=Ay>0$ for $A>0$, we can
concentrate on the case $y\geq\odt$. In this case, the
minimum is reached at the boundary $z=\odt$.
$$
f(y,\odt;A,B) \;=\; (A+1)(1-y)+(B+1)2(y-\odt)^2-y
$$
This is now quadratic in $y$ with minimum at
$$
y^* \;=\; {A+2B+4\over 4(B+1)}, \quad
f(y^*,\odt;A,B) \;=\; {4AB-A^2-4\over 8(B+1)} \;\geq\; 0
$$
for $B\geq{1\over 4}A+{1\over A}$, $A>0$, $(\Rightarrow B\geq 1)$. $\qed$
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Bibliography %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\addcontentsline{toc}{section}{References}
\parskip=0ex plus 1ex minus 1ex
\begin{thebibliography}{9}
\bibitem{Ang83}
{ D. Angluin, C. H. Smith}:
{\it Inductive inference: Theory and methods};
{ Comput. Surveys, 15:3, (1983), 237--269 }.
\bibitem{Bay63}
{ T. Bayes}:
{\it An essay towards solving a problem in the doctrine of chances};
{ Philos. Trans. Roy. Soc., 53 (1763) 376--398}.
\bibitem{Kol65}
{ A.N. Kolmogorov}:
{\it Three approaches to the quantitative definition of information};
{ Problems Inform. Transmission, 1:1(1965), 1--7}.
\bibitem{Kul59}
{ S. Kullback}:
{\it Information Theory and Statistics};
{ Wiley, New York (1959)}.
\bibitem{Lev74}
{ L.A. Levin}:
{\it Laws of information conservation (non-growth) and
aspects of the foundation of probability theory};
{\rm Problems Inform. Transmission, 10 (1974), 206--210}.
\bibitem{LiVi92}
{ M. Li and P.M.B. Vit\'anyi}:
{\it Inductive reasoning and Kolmogorov complexity};
{ J. Comput. System Sci., 44:2 (1992), 343--384}.
\bibitem{LiVi97}
{ M. Li and P.M.B. Vit\'anyi}:
{\it An Introduction to Kolmogorov Complexity and its Applications};
{ Springer-Verlag, New York, 2nd Edition, 1997}.
\bibitem{Sol64}
{ R.J. Solomonoff}:
{\it A formal theory of inductive inference, Part 1 and 2};
{ Inform. Control, 7 (1964), 1--22, 224--254}.
\bibitem{Sol78}
{ R.J Solomonoff}:
{\it Complexity-based induction systems: comparisons and convergence theorems};
{ IEEE Trans. Inform. Theory, IT-24:4, (1978), 422--432}.
\bibitem{Sol97}
{ R.J. Solomonoff}:
{\it The discovery of algorithmic probability};
{\rm J. Comput. System Sci. 55 (1997), 73--88}.
\bibitem{Wil70}
{ D.G. Willis}:
{\it Computational complexity and probability constructions};
{ J. Assoc. Comput. Mach., 4 (1970), 241--259}.
\bibitem{ZvLe70}
{ A.K. Zvonkin and L.A. Levin}:
{\it The complexity of finite objects and the development of the concepts
of information and randomness by means of the theory of algorithms};
{ Russian Math. Surveys, 25:6 (1970), 83--124}.
\end{thebibliography}
\end{document}
%---------------------------------------------------------------