%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Optimality of Universal Bayesian Sequence %%
%% Prediction for General Loss and Alphabet %%
%% (c) 2000-2003 by Marcus Hutter %%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%-------------------------------%
% Document-Style %
%-------------------------------%
\documentclass[12pt,twoside]{article}
\usepackage{latexsym}
\pagestyle{myheadings}
\markboth{\sc Marcus Hutter, Technical Report IDSIA-02-02
}{\sc Optimality of Universal Bayesian Sequence Prediction}
\setcounter{tocdepth}{3}
\setcounter{secnumdepth}{2}
\topmargin=0cm \oddsidemargin=5mm \evensidemargin=5mm
\textwidth=15cm \textheight=22cm
\unitlength=1mm \sloppy
\makeindex
%-------------------------------%
% My Math-Spacings %
%-------------------------------%
\def\,{\mskip 3mu} \def\>{\mskip 4mu plus 2mu minus 4mu} \def\;{\mskip 5mu plus 5mu} \def\!{\mskip-3mu}
\def\dispmuskip{\thinmuskip= 3mu plus 0mu minus 2mu \medmuskip= 4mu plus 2mu minus 2mu \thickmuskip=5mu plus 5mu minus 2mu}
\def\textmuskip{\thinmuskip= 0mu \medmuskip= 1mu plus 1mu minus 1mu \thickmuskip=2mu plus 3mu minus 1mu}
\textmuskip
\def\beq{\dispmuskip\begin{equation}} \def\eeq{\end{equation}\textmuskip}
\def\beqn{\dispmuskip\begin{displaymath}}\def\eeqn{\end{displaymath}\textmuskip}
\def\bqa{\dispmuskip\begin{eqnarray}} \def\eqa{\end{eqnarray}\textmuskip}
\def\bqan{\dispmuskip\begin{eqnarray*}} \def\eqan{\end{eqnarray*}\textmuskip}
%-------------------------------%
% Macro-Definitions %
%-------------------------------%
\newenvironment{keywords}{\centerline{\bf\small
Keywords}\vspace{-2ex}\begin{quote}\small}{\par\end{quote}\vskip 1ex}
\newtheorem{theorem}{Theorem}
\newtheorem{corollary}[theorem]{Corollary}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{definition}[theorem]{Definition}
\def\lessgtr{{\stackrel{\displaystyle<}>}}
\def\ftheorem#1#2#3{\begin{theorem}[#2]\label{#1} #3 \end{theorem} }
\def\fcorollary#1#2#3{\begin{corollary}[#2]\label{#1} #3 \end{corollary} }
\def\flemma#1#2#3{\begin{lemma}[#2]\label{#1} #3 \end{lemma} }
\def\fdefinition#1#2#3{\begin{definition}[#2]\label{#1} #3 \end{definition} }
\def\idx#1{\index{#1}#1} %\idx{name} for also in text
\def\indxs#1#2{\index{#1!#2}\index{#2!#1}} %\idx{name} for also in text
\def\tcite{\cite}
\def\citey{\cite}
\def\citein#1{in \cite{#1}}
\def\citealt{\cite}
\def\paradot#1{\vspace{1ex}\noindent{\bf #1.}}
\def\paranodot#1{\vspace{1ex}\noindent{\bf #1}}
\def\toinfty#1{\stackrel{#1\to\infty}{\longrightarrow}}
\def\nq{\hspace{-1em}}
\def\qed{\hspace*{\fill}$\Box\quad$}
\def\odt{{\textstyle{1\over 2}}}
\def\eps{\varepsilon} % for small positive number
\def\epstr{\epsilon} % for empty string
\def\v#1{{\bf #1}}
\def\approxleq{\mbox{\raisebox{-0.8ex}{$\stackrel{\displaystyle<}\sim$}}} %% make nicer
\def\approxgeq{\mbox{\raisebox{-0.8ex}{$\stackrel{\displaystyle>}\sim$}}} %% make nicer
\def\M{{\cal M}}
\def\X{{\cal X}}
\def\Y{{\cal Y}}
\def\F{{\cal F}}
\def\E{{\bf E}}
\def\P{{\bf P}}
\def\Km{K\!m}
\def\Set#1{\if#1Q{I\!\!\!#1}\else\if#1Z{Z\!\!\!Z}\else{I\!\!#1}\fi\fi}
\def\qmbox#1{{\quad\mbox{#1}\quad}}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% T i t l e - P a g e %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{document}
\begin{titlepage}
\title{\vskip -10mm \normalsize\sc Technical Report IDSIA-02-02 \hfill February 2002 -- January 2003
\vskip 2mm\bf\LARGE\hrule height5pt \vskip 5mm
\Large\sc Optimality of Universal Bayesian Sequence \\
Prediction for General Loss and Alphabet
\vskip 6mm \hrule height2pt \vskip 4mm}
\author{{\bf Marcus Hutter}\\[3mm]
\normalsize IDSIA, Galleria 2, CH-6928\ Manno-Lugano, Switzerland\\
\normalsize marcus@idsia.ch \hspace{8.5ex} http://www.idsia.ch/$^{_{_\sim}}\!$marcus}
\date{}
\maketitle
\begin{abstract}
\noindent Various optimality properties of universal sequence
predictors based on Bayes-mixtures in general, and Solomonoff's
prediction scheme in particular, will be studied.
%
The probability of observing $x_t$ at time $t$, given past
observations $x_1...x_{t-1}$ can be computed with the chain rule
if the true generating distribution $\mu$ of the sequences
$x_1x_2x_3...$ is known. If $\mu$ is unknown, but known to belong
to a countable or continuous class $\M$ one can base ones
prediction on the Bayes-mixture $\xi$ defined as a
$w_\nu$-weighted sum or integral of distributions $\nu\in\M$. The
cumulative expected loss of the Bayes-optimal universal prediction
scheme based on $\xi$ is shown to be close to the loss of the
Bayes-optimal, but infeasible prediction scheme based on $\mu$. We
show that the bounds are tight and that no other predictor can
lead to significantly smaller bounds.
%
Furthermore, for various performance measures, we show
Pareto-optimality of $\xi$ and give an Occam's razor argument that
the choice $w_\nu\sim 2^{-K(\nu)}$ for the weights is optimal,
where $K(\nu)$ is the length of the shortest program describing
$\nu$.
%
The results are applied to games of chance, defined as a sequence
of bets, observations, and rewards.
%
The prediction schemes (and bounds) are compared to the popular
predictors based on expert advice.
%
Extensions to infinite alphabets, partial, delayed and
probabilistic prediction, classification, and more active systems
are briefly discussed.
\end{abstract}
\begin{keywords}%
Bayesian sequence prediction; mixture distributions; Solomonoff
induction; Kolmogorov complexity; learning; universal probability;
tight loss and error bounds; Pareto-optimality; games of chance;
classification.
\end{keywords}
\end{titlepage}
%------------------------------%
% Table of Contents %
%------------------------------%
\parskip=0ex\tableofcontents\newpage
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Introduction}\label{secInt}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%------------------------------%
%\subsection{Induction}\index{induction}
%------------------------------%
Many problems are of the induction type in which statements about
the future have to be made, based on past observations. What is
the probability of rain tomorrow, given the weather observations
of the last few days? Is the Dow Jones likely to rise tomorrow,
given the chart of the last years and possibly additional
newspaper information? Can we reasonably doubt that the sun will
rise tomorrow? Indeed, one definition of science is to predict the
future, where, as an intermediate step, one tries to understand
the past by developing theories and finally to use the prediction
as the basis for some decision. Most induction problems can be
studied in the Bayesian framework. The probability of observing
$x_t$ at time $t$, given the observations $x_1...x_{t-1}$ can be
computed with the chain rule, if we know the true probability
distribution, which generates the observed sequence
$x_1x_2x_3...$. The problem is that in many cases we do not even
have a reasonable guess of the true distribution $\mu$. What is
the true probability of weather sequences, stock charts, or
sunrises?
%------------------------------%
%\subsection{Universal Sequence Prediction}
%------------------------------%
\index{sequence prediction!universal}
In order to overcome the problem of the unknown true distribution,
one can define a mixture distribution $\xi$ as a weighted
sum or integral over distributions $\nu\in\cal M$, where $\cal M$
is any discrete or continuous (hypothesis) set including $\mu$.
$\M$ is assumed to be known and to contain the true distribution,
i.e.\ $\mu\in\M$.
%
Since the probability $\xi$ can be shown to converge rapidly to
the true probability $\mu$ in a conditional sense, making
decisions based on $\xi$ is often nearly as good as the infeasible
optimal decision based on the unknown $\mu$ \cite{Merhav:98}.
%
Solomonoff \citey{Solomonoff:64} had the idea to define a
universal mixture as a weighted average over deterministic
programs. Lower weights were assigned to longer programs. He
unified Epicurus' principle of multiple explanations and Occam's
razor [simplicity] principle into one formal theory (See
\citealt{Li:97} for this interpretation of
\citealt{Solomonoff:64}). Inspired by Solomonoff's idea, Levin
\citey{Zvonkin:70} defined the closely related universal prior
$\xi_U$ as a weighted average over {\em all} semi-computable
probability distributions. If the environment possesses some
effective structure at all, Solomonoff-Levin's posterior ``finds''
this structure \cite{Solomonoff:78}, and allows for a good
prediction. In a sense, this solves the induction problem in a
universal way, i.e.\ without making problem specific assumptions.
%------------------------------%
%\subsection{Contents}
%------------------------------%
\paranodot{Section \ref{secSetup}} explains notation and defines the
{\em universal or mixture distribution} $\xi$ as the
$w_\nu$-weighted sum of probability distributions $\nu$ of a set
$\M$, which includes the true distribution $\mu$. No structural
assumptions are made on the $\nu$. $\xi$ multiplicatively
dominates all $\nu \in \M$, and the relative entropy between $\mu$
and $\xi$ is bounded by $\ln{w_\mu^{-1}}$. Convergence of $\xi$ to
$\mu$ in a mean squared sense is shown in Theorem \ref{thConv}.
The representation of the universal posterior distribution and the
case $\mu\not\in\M$ are briefly discussed. Various standard sets
$\M$ of probability measures are discussed, including computable,
enumerable, cumulatively enumerable, approximable, finite-state, and
Markov (semi)measures.
\paranodot{Section \ref{secErr}} is essentially a generalization of the
deterministic error bounds found \citein{Hutter:99errbnd} from the
binary alphabet to a general finite alphabet $\X$. Theorem
\ref{thErrBnd} bounds $E^{\Theta_\xi} - E^{\Theta_\mu}$ by
$O(\sqrt{E^{\Theta_\mu}})$, where $E^{\Theta_\xi}$ is the expected
number of errors made by the optimal universal predictor
$\Theta_\xi$, and $E^{\Theta_\mu}$ is the expected number of
errors made by the optimal informed prediction scheme
$\Theta_\mu$. The non-binary setting cannot be reduced to the
binary case! One might think of a binary coding of the symbols
$x_t\in\X$ in the sequence $x_1x_2...$. But this makes it
necessary to predict a block of bits $x_t$, before one receives
the true block of bits $x_t$, which differs from the bit by bit
prediction scheme considered
\citein{Solomonoff:78,Hutter:99errbnd}.
%
The framework generalizes to the case where an action $y_t \in \Y$
results in a loss $\ell_{x_t y_t}$ if $x_t$ is the next symbol of
the sequence. Optimal universal $\Lambda_\xi$ and optimal informed
$\Lambda_\mu$ prediction schemes are defined for this case, and
loss bounds similar to the error bounds of the last section are
stated. No assumptions on $\ell$ have to be made, besides
boundedness.
\paranodot{Section \ref{secGames}} applies
the loss bounds to games of chance, defined as a sequence of bets,
observations, and rewards. The average profit $\bar
p_n^{\Lambda_\xi}$ achieved by the $\Lambda_\xi$ scheme rapidly
converges to the best possible average profit $\bar
p_n^{\Lambda_\mu}$ achieved by the $\Lambda_\mu$ scheme ($\bar
p_n^{\Lambda_\xi} - \bar p_n^{\Lambda_\mu} = O(n^{-1/2})$). If
there is a profitable scheme at all ($\bar p_n^{\Lambda_\mu} >
\eps > 0$), asymptotically the universal $\Lambda_\xi$ scheme will
also become profitable. Theorem
\ref{thWin} bounds the time needed to reach the winning zone. It
is proportional to the relative entropy of $\mu$ and $\xi$ with a
factor depending on the profit range and on $\bar
p_n^{\Lambda_\mu}$. An attempt is made to give an information
theoretic interpretation of the result.
\paranodot{Section \ref{secLower}} discusses the quality of the
universal predictor and the bounds. We show that
%for every predictor $\Theta$ not knowing $\mu$ (especially
%$\Theta_\xi$)
there are $\M$ and $\mu\in\M$ and weights $w_\nu$ such that the
derived error bounds are tight. This shows that the error bounds
cannot be improved in general. We also show Pareto-optimality of
$\xi$ in the sense that there is no other predictor which performs
at least as well in all environments $\nu\in\M$ and strictly
better in at least one. Optimal predictors can always be based on
mixture distributions $\xi$. This still leaves open how to choose
the weights. We give an Occam's razor argument that the choice
$w_\nu=2^{-K(\nu)}$, where $K(\nu)$ is the length of the shortest
program describing $\nu$ is optimal.
\paranodot{Section \ref{secMisc}} generalizes the setup to continuous
probability classes $\M = \{\mu_\theta\}$ consisting of continuously
parameterized distributions $\mu_\theta$ with parameter
$\theta \in \Set{R}^d$. Under certain smoothness and regularity
conditions a bound for the relative entropy between $\mu$ and
$\xi$, which is central for all presented results, can still be
derived. The bound depends on the Fisher information of $\mu$ and
grows only logarithmically with $n$, the intuitive reason being
the necessity to describe $\theta$ to an accuracy $O(n^{-1/2})$.
%
Furthermore, two ways of using the prediction schemes for partial
sequence prediction, where not every symbol needs to be predicted,
are described. Performing and predicting a sequence of independent
experiments and online learning of classification tasks are
special cases.
%
We also compare the universal prediction scheme studied here to
the popular predictors based on expert advice (PEA)
\cite{Littlestone:89,Vovk:92,Littlestone:94,Cesa:97,Haussler:98,Kivinen:99}.
Although the algorithms, the settings, and the proofs are quite
different, the PEA bounds and our error bound have the same
structure.
%
Finally, we outline possible extensions of the presented theory
and results, including infinite alphabets, delayed and
probabilistic prediction, active systems influencing the
environment, learning aspects, and a unification with PEA.
\paranodot{Section \ref{secSPConc}} summarizes the results.
%------------------------------%
%\paradot{Introductory References}
%------------------------------%
There are good introductions and surveys of Solomonoff sequence
prediction \cite{Li:92b,Li:97}, inductive inference in general
\cite{Angluin:83,Solomonoff:97,Merhav:98}, reasoning under
uncertainty \cite{Gruenwald:98}, and competitive online statistics
\cite{Vovk:99}, with interesting relations to this work. See
Section \ref{secWM} for some more details.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Setup and Convergence}\label{secSetup}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
In this section we show that the mixture $\xi$ converges rapidly to
the true distribution $\mu$.
%
After defining basic notation in Section \ref{secRanSeq},
%
we introduce in Section \ref{secUPPD} the {\em universal or
mixture distribution} $\xi$ as the $w_\nu$-weighted sum of
probability distributions $\nu$ of a set $\M$, which includes the
true distribution $\mu$. No structural assumptions are made on the
$\nu$. $\xi$ multiplicatively dominates all $\nu\in\M$.
%
A posterior representation of $\xi$ with incremental weight update
is presented in Section \ref{secUPost}.
%
In Section \ref{secConv} we show that the relative entropy between $\mu$
and $\xi$ is bounded by $\ln{w_\mu^{-1}}$ and that $\xi$ converges to
$\mu$ in a mean squared sense.
%
The
case $\mu\not\in\M$ is briefly discussed in Section \ref{subsecnotM}.
%
The section concludes with Section \ref{secPCM}, which
discusses various standard sets $\M$ of probability measures,
including computable, enumerable, cumulatively enumerable,
approximable, finite-state, and Markov (semi)measures.
%------------------------------%
\subsection{Random Sequences}\label{secRanSeq}\indxs{sequence}{random}
%------------------------------%
We denote strings over a finite alphabet $\X$ by $x_1x_2...x_n$
with $x_t\in\X$ and $t,n,N\in\Set N$ and $N=|\X|$. We further use
the abbreviations $\epstr$\index{string!empty} for the empty
string, $x_{t:n}:=x_tx_{t+1}...x_{n-1}x_n$ for $t\leq n$ and
$\epstr$ for $t>n$, and $x_{0.
\eeq
It is easy to see that $\xi$ is a probability distribution as the
\idx{weights} $w_\nu$ are positive and normalized to 1 and the
$\nu \in \M$ are probabilities.\footnote{The weight $w_\nu$ may be
interpreted as the initial degree of belief in $\nu$ and
$\xi(x_1...x_n)$ as the degree of belief in $x_1...x_n$. If the
existence of true randomness is rejected on philosophical grounds
one may consider $\M$ containing only deterministic environments.
$\xi$ still represents belief
probabilities\indxs{belief}{probability}.} For a finite $\M$ a
possible choice for the $w$ is to give all $\nu$ equal weight
($w_\nu={1\over|\M|}$). We call $\xi$ universal relative to $\M$,
as it multiplicatively dominates all distributions in $\M$
\index{multiplicative!domination}
\beq\label{unixi}
\xi(x_{1:n}) \;\geq\;
w_\nu\!\cdot\!\nu(x_{1:n}) \quad\mbox{for all}\quad
\nu\in\M.
\eeq
In the following, we assume that $\M$ is known and contains the
true distribution, i.e.\ $\mu\in\M$. If $\M$ is chosen
sufficiently large, then $\mu\in\M$ is not a serious
constraint.
%------------------------------%
\subsection{Universal Posterior Probability Distribution}\label{secUPost}
%------------------------------%
\index{probability distribution!posterior}\index{posterior probability}
\index{probability distribution!conditional}
\index{weights!time dependent}\index{weights!posterior}
All prediction schemes in this work are based on the conditional
probabilities $\rho(x_t|x_{ 0$ (then $w_\mu \geq \alpha$).
\index{probability!nearby distribution}\index{probability distribution!nearby}
Another possibly interesting situation is when the true generating
distribution $\mu \not\in \M$, but a ``nearby'' distribution
$\hat\mu$ with weight $w_{\hat\mu}$ is in $\M$. If we measure
the distance of $\hat\mu$ to $\mu$ with the Kullback-Leibler
divergence
$D_n(\mu||\hat\mu):=\sum_{x_{1:n}}\mu(x_{1:n})\ln{\mu(x_{1:n})\over\hat\mu(x_{1:n})}$
and assume that it is bounded by a constant $c$, then
\beqn
D_n \;=\; \E_{1:n} \ln{\mu(x_{1:n})\over\xi(x_{1:n})} \;=\;
\E_{1:n} \ln{\hat\mu(x_{1:n})\over\xi(x_{1:n})} +
\E_{1:n} \ln{\mu(x_{1:n})\over\hat\mu(x_{1:n})} \;\leq\;
\ln w_{\hat\mu}^{-1} + c.
\eeqn
So $D_n\!\leq\ln w_\mu^{-1}$ remains valid if we define
$w_\mu := w_{\hat\mu} \cdot e^{-c}$.
%------------------------------%
\subsection{Probability Classes $\M$}\label{secPCM}
%------------------------------%
\index{probability!classes}
\indxs{probability class}{discrete}
\indxs{probability class}{countable}
In the following we describe some well-known and some less known
probability classes $\M$. This relates our setting to other
works in this area, embeds it into the historical context,
illustrates the type of classes we have in mind, and discusses
computational issues.
\index{probability distribution!computable}
We get a rather wide class $\M$ if we include {\em all}
(semi)computable probability distributions in $\M$. In this case,
the assumption $\mu\in\M$ is very weak, as it only assumes
that the strings are drawn from {\em any (semi)computable} distribution;
and all valid physical theories (and, hence, all environments)
{\em are} computable to arbitrary precision (in a probabilistic sense).
\index{probability distribution!simple}
We will see that it is favorable to assign high weights $w_\nu$ to the
$\nu$. Simplicity should be favored over complexity, according to
Occam's razor. In our context this means that a high weight should
be assigned to simple $\nu$. The prefix Kolmogorov complexity
$K(\nu)$ is a universal complexity measure
\cite{Kolmogorov:65,Zvonkin:70,Li:97}. It is defined as the length of
the shortest self-delimiting program (on a universal Turing
machine) computing $\nu(x_{1:n})$ given $x_{1:n}$.
If we define
\beqn
w_\nu \;:=\; 2^{-K(\nu)} \qquad
\eeqn
then distributions which can be calculated by short programs,
have high weights. The relative entropy is bounded by the
Kolmogorov complexity of $\mu$ in this case ($D_n \leq K(\mu)
\cdot \ln 2$).
\index{probability distribution!Solomonoff}%
\indxs{probability distribution}{universal}%
\indxs{Solomonoff}{semi-measure}%
\indxs{enumerable}{semi-measure}%
Levin's universal semi-measure $\xi_U$
is obtained if we take $\M=\M_U$
to be the (multi)set enumerated by a Turing machine which
enumerates all enumerable semi-measures
\cite{Zvonkin:70,Li:97}.
Recently, $\M$ has been further enlarged to include all
cumulatively enumerable\indxs{cumulatively
enumerable}{semi-measure} semi-measures \cite{Schmidhuber:02gtm}.
In the enumerable and cumulatively enumerable cases, $\xi$ is not
finitely computable, but can still be approximated to arbitrary
but not pre-specifiable precision.%
\indxs{approximable}{probability distribution} If we consider {\em
all} approximable (i.e.\ asymptotically computable) distributions,
then the universal distribution $\xi$, although still well
defined, is not even approximable \cite{Hutter:03unipriors}. An
interesting and quickly
approximable\index{prior!Speed}\index{Speed prior} distribution is
the Speed prior $S$ defined \citein{Schmidhuber:02speed}. It is
related to Levin complexity and Levin search
\cite{Levin:73search,Levin:84}, but it is unclear for now, which
distributions are dominated by $S$. If one considers only
\idx{finite-state automata} instead of general Turing machines,
$\xi$ is related to the quickly computable, universal finite-state
prediction scheme of Feder et al.\ \citey{Feder:92}, which itself
is related to the famous Lempel-Ziv data compression
algorithm\index{Lempel-Ziv
compression}\index{compression!Lempel-Ziv}. If one has extra
knowledge on the source generating the sequence, one might further
reduce $\M$ and increase $w$. A detailed analysis of these and
other specific classes $\M$ will be given elsewhere. Note that
$\xi\in\M$ in the enumerable and cumulatively enumerable case, but
$\xi\not\in\M$ in the computable, approximable and finite-state
case. If $\xi$ is itself in $\M$, it is called a universal element
of $\M$ \cite{Li:97}. As we do not need this property here, $\M$
may be {\em any} countable set of distributions. In the following
sections we consider generic $\M$ and $w$.
We have discussed various discrete classes $\M$, which are
sufficient from a constructive or computational point of view. On
the other hand, it is convenient to also allow for continuous
classes $\M$. For instance, the class of {\em all} Bernoulli
processes with parameter $\theta\in[0,1]$ and uniform prior
$w_\theta\equiv 1$ is much easier to deal with than computable
$\theta$ only, with prior $w_\theta=2^{-K(\theta)}$. Other
important continuous classes are the class of i.i.d.\ and Markov
processes. Continuous classes $\M$ are considered in more detail
in Section \ref{secCPC}. \index{probability distribution!generic
class}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Error Bounds}\label{secErr}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\indxs{error}{bound}
In this section we prove error bounds for predictors based on
the mixture $\xi$.
%
Section \ref{secBOP} introduces the concept of Bayes-optimal
predictors $\Theta_\rho$, minimizing $\rho$-expected error.
%
In Section \ref{secTENE} we bound $E^{\Theta_\xi} -
E^{\Theta_\mu}$ by $O(\sqrt{E^{\Theta_\mu}})$, where
$E^{\Theta_\xi}$ is the expected number of errors made by the
optimal universal predictor $\Theta_\xi$, and $E^{\Theta_\mu}$ is
the expected number of errors made by the optimal informed
prediction scheme $\Theta_\mu$.
%
The proof is deferred to Section \ref{secPTErrBnd}.
%
In Section \ref{secLoss} we generalize the framework to the case
where an action $y_t \in \Y$ results in a loss $\ell_{x_t y_t}$ if
$x_t$ is the next symbol of the sequence. Optimal universal
$\Lambda_\xi$ and optimal informed $\Lambda_\mu$ prediction
schemes are defined for this case, and loss bounds similar to the
error bounds are presented. No assumptions on $\ell$ have to be
made, besides boundedness.
%------------------------------%
\subsection{Bayes-Optimal Predictors}\label{secBOP}
%------------------------------%
\indxs{Bayes-optimal}{prediction}
We start with a very simple measure: making a wrong prediction
counts as one error, making a correct prediction counts as no
error. In \cite{Hutter:99errbnd} error bounds have been proven for
the binary alphabet $\X=\{0,1\}$. The following generalization to
an arbitrary alphabet involves only minor additional
complications, but serves as an introduction to the more
complicated model with arbitrary loss function.
%
\indxs{minimize}{error}
Let $\Theta_\mu$ be the optimal prediction scheme when the strings
are drawn from the probability distribution $\mu$, i.e.\ the
probability of $x_t$ given $x_{0$ and $B>0$ that satisfy the linear
inequality
\beq\label{Eineq2}
E_n^{\Theta_\xi} - E_n^{\Theta_\mu} \;\leq\;
A Q_n + B S_n.
\eeq
If we could show
\beq\label{eineq2}
e_t^{\Theta_\xi}(x_{ 0$,
provided we insert
$B = {1\over 2A}$. Thus we might
minimize the r.h.s.\ of (\ref{Eineq2}) w.r.t.\ $A$ leading to the upper bound
\beqn\label{detineq5}
E_n^{\Theta_\xi} - E_n^{\Theta_\mu} \;\leq\;
\sqrt{2Q_nS_n}
\qquad\mbox{for}\qquad A^2={S_n\over 2Q_n}
\eeqn
which is the first bound in Theorem \ref{thErrBnd}.
%
For the second bound we have to show $Q_n\leq E_n^{\Theta_\xi} +
E_n^{\Theta_\mu}$, which follows by summation from $q_t\leq
e_t^{\Theta_\xi} + e_t^{\Theta_\mu}$, which is equivalent to
$1-\delta_{ms}\leq 1-y_s+1-y_m$, which holds for $m=s$ as well as
$m\neq s$.
%
For the third bound we have to prove
\beq\label{detineq6}
\sqrt{2(E_n^{\Theta_\xi}\!+\! E_n^{\Theta_\mu})S_n} - S_n \;\leq\;
\sqrt{4E_n^{\Theta_\mu}S_n+S_n^2}.
\eeq
If we square both sides of this expressions and simplify we just
get the second bound. Hence, the second bound implies
(\ref{detineq6}).
%
The last inequality in Theorem \ref{thErrBnd} is a simple triangle
inequality. This completes the proof of Theorem \ref{thErrBnd}.
\qed
Note that also the third bound implies the second one:
\beqn
E_n^{\Theta_\xi}-E_n^{\Theta_\mu} \;\leq\;
\sqrt{2(E_n^{\Theta_\xi}\!+\! E_n^{\Theta_\mu})S_n}
\quad\Leftrightarrow\quad
(E_n^{\Theta_\xi}\!-\!E_n^{\Theta_\mu})^2 \;\leq\;
2(E_n^{\Theta_\xi}\!+\! E_n^{\Theta_\mu})S_n
\quad\Leftrightarrow
\eeqn\vspace{-2ex}
\beqn
\Leftrightarrow\quad
(E_n^{\Theta_\xi}\!-\!E_n^{\Theta_\mu}\!-\!S_n)^2 \;\leq\;
4E_n^{\Theta_\mu}S_n+S_n^2
\quad\Leftrightarrow\quad
E_n^{\Theta_\xi}-E_n^{\Theta_\mu}-S_n \;\leq\;
\sqrt{4E_n^{\Theta_\mu}S_n+S_n^2}
\eeqn
where we only have used $E_n^{\Theta_\xi}\geq E_n^{\Theta_\mu}$.
Nevertheless the bounds are not equal.
%------------------------------%
\subsection{General Loss Function}\label{secLoss}
%------------------------------%
A prediction is very often the basis for some decision. The
decision results in an action, which itself leads to some reward
or loss. If the action itself can influence the environment we
enter the domain of acting agents which has been analyzed in the
context of universal probability \citein{Hutter:01aixi}. To stay
in the framework of (passive) prediction we have to assume that
the action itself does not influence the environment. Let
$\ell_{x_t y_t} \in \Set{R}$ be the received loss when taking
action $y_t \in \Y$ and $x_t\in\X$ is the $t^{th}$ symbol of the
sequence. We make the assumption that $\ell$ is bounded.
Without loss of generality we normalize $\ell$ by linear scaling
such that $0 \leq \ell_{x_t y_t} \leq 1$. For instance, if we make
a sequence of \idx{weather forecasts} $\X = \{$sunny, rainy$\}$
and base our decision, whether to take an umbrella or wear
sunglasses $\Y = \{$umbrella, sunglasses$\}$ on it, the action of
taking the umbrella or wearing sunglasses does not influence the
future weather (ignoring the \idx{butterfly effect}). The losses
might be
\begin{center}
\begin{tabular}{|c|c|c|}\hline
Loss & sunny & rainy \\\hline
umbrella & 0.1 & 0.3 \\\hline
sunglasses & 0.0 & 1.0 \\\hline
\end{tabular}
\end{center}
\noindent Note the loss assignment even when making the
right decision to take an umbrella when it rains because sun is
still preferable to rain.
In many cases the prediction of $x_t$ can be identified or is already
the action $y_t$. The forecast {\em sunny} can be identified
with the action {\em wear sunglasses}, and {\em rainy} with {\em
take umbrella}. $\X \equiv\Y$ in these cases. The error
assignment of the previous subsections falls into this class
together with a special loss function. It assigns unit loss to an
erroneous prediction ($\ell_{x_t y_t} = 1$ for $x_t \neq y_t$) and
no loss to a correct prediction ($\ell_{x_t x_t} = 0$).
\indxs{Bayes-optimal}{prediction} For convenience we name an
action a prediction in the following, even if $\X \neq\Y$. The
true probability of the next symbol being $x_t$, given $x_{ 0$
(otherwise there is no winning strategy at all, since
$P_n^{\Lambda_\mu} \geq P_n^\Lambda\,\forall\Lambda$). Often we
are not in the favorable position of knowing $\mu$, but we know
(or assume) that $\mu\in\M$ for some $\M$, for instance that $\mu$
is a computable probability distribution. From bound
(\ref{thGLoss}) we see that the\indxs{profit}{average} average
profit per round $\bar p_n^{\Lambda_\xi} := {1\over
n}P_n^{\Lambda_\xi}$ of the universal $\Lambda_\xi$ scheme
converges to the average profit per round $\bar p_n^{\Lambda_\mu}
:= {1\over n}P_n^{\Lambda_\mu}$ of the optimal informed scheme,
i.e.\ asymptotically we can make the same money even without
knowing $\mu$, by just using the universal $\Lambda_\xi$ scheme.
Bound (\ref{thGLoss}) allows us to lower bound the universal
profit $P_n^{\Lambda_\xi}$
\beq\label{pbnd}
P_n^{\Lambda_\xi} \;\geq\; P_n^{\Lambda_\mu} -
p_\Delta D_n-\sqrt{4(n p_{max}\!-\!P_n^{\Lambda_\mu})
p_\Delta D_n+p_\Delta^2D_n^2}.
\eeq
\indxs{maximal}{profit}The time needed for $\Lambda_\xi$ to
perform well can also be estimated. An interesting quantity is the
expected number of rounds needed to reach the winning zone. Using
$P_n^{\Lambda_\mu} > 0$ one can show that the r.h.s.\ of
(\ref{pbnd}) is positive if, and only if
\beq\label{pwin}
n \;>\; {2p_\Delta(2p_{max}\!-\!\bar p_n^{\Lambda_\mu}) \over
(\bar p_n^{\Lambda_\mu})^2} \!\cdot\! D_n.
\eeq
%------------------------------%
\ftheorem{thWin}{Time to Win}{
%------------------------------%
\index{time!to win}%
Let there be sequences $x_1x_2...$ over a finite alphabet
$\X$ drawn with probability $\mu(x_{1:n})$ for the first
$n$ symbols. In step $t$ we make a bet, depending on the history
$x_{ \Big({2p_\Delta\over
\bar p_n^{\Lambda_\mu}}\Big)^{\!2}
\!\cdot\!b_\mu \quad\wedge\quad \bar p_n^{\Lambda_\mu}>0
\quad\Longrightarrow\quad \bar p_n^{\Lambda_\xi}>0
\end{array}
\eeqn
where $b_\mu=\ln w_\mu^{-1}$ with $w_\mu$ being the weight
(\ref{xidefsp}) of $\mu$ in $\xi$ in the discrete case (and $b_\mu$
as in Theorem \ref{thCEB} in the continuous case).
}%------------------------------%
\noindent By dividing (\ref{pbnd}) by $n$ and using $D_n \leq b_\mu$
(\ref{entropy}) we see that the leading order of
$\bar p_n^{\Lambda_\xi} - \bar p_n^{\Lambda_\mu}$ is bounded by
$\sqrt{4p_\Delta p_{max}b_\mu/n}$, which proves $(i)$. The
condition in $(ii)$ is actually a weakening of (\ref{pwin}).
$P_n^{\Lambda_\xi}$ is trivially positive for $p_{min} > 0$, since
in this wonderful case {\em all} profits are positive. For
negative $p_{min}$ the condition of $(ii)$ implies
(\ref{pwin}), since $p_\Delta > p_{max}$, and (\ref{pwin}) implies
positive (\ref{pbnd}), i.e.\ $P_n^{\Lambda_\xi} > 0$, which proves
$(ii)$.
If a winning strategy $\Lambda$ with $\bar p_n^\Lambda > \eps >0$
exists, then $\Lambda_\xi$ is asymptotically also a winning
strategy with the same average profit.
%------------------------------%
\subsection{Example}\label{secGCEx}
%------------------------------%
\indxs{dice}{example}
Let us consider a game with two dice, one with two black and
four white faces, the other with four black and two white faces.
The dealer who repeatedly throws the dice uses one or the other
die according to some deterministic rule, which correlates the
throws (e.g.\ the first die could be used in round $t$ iff the $t^{th}$
digit of $\pi$ is 7). We can bet on black or white;
the stake $s$ is 3\$ in every round; our return $r$ is 5\$ for
every correct prediction.
The profit is $p_t = r\delta_{x_ty_t} - s$. The coloring of the
dice and the selection strategy of the dealer unambiguously
determine $\mu$. $\mu(x_t|x_{0$. If we don't know $\mu$ we should use
Levin's universal prior with $D_n \leq b_\mu = K(\mu) \cdot \ln
2$, where $K(\mu)$ is the length of the shortest program coding
$\mu$ (see Section \ref{secPCM}). Then we know that betting on
the outcome with higher $\xi$ probability leads asymptotically to
the same profit (Theorem \ref{thWin}$(i)$) and $\Lambda_\xi$
reaches the \indxs{winning}{threshold} winning threshold no later
than $n_{thresh} = 900\ln 2 \cdot K(\mu)$ (Theorem
\ref{thWin}$(ii)$) or sharper $n_{thresh} = 330\ln 2 \cdot K(\mu)$
from (\ref{pwin}), where $p_{max} = r-s = 2\$ $ and $p_\Delta = r
= 5\$ $ have been used.
\index{strategy!die selection}
If the die selection strategy reflected in $\mu$ is not too
complicated, the $\Lambda_\xi$ prediction system reaches the
winning zone after a few thousand rounds. The number of rounds is
not really small because the expected profit per round is one
order of magnitude smaller than the return. This leads to a
constant of two orders of magnitude size in front of $K(\mu)$.
Stated otherwise, it is due to the large stochastic noise, which
makes it difficult to extract the signal, i.e.\ the structure of
the rule $\mu$ (see next subsection). Furthermore, this is only a
bound for the turnaround value of $n_{thresh}$. The true expected
turnaround $n$ might be smaller. However, for every game for which
there exists a computable winning strategy with $\bar
p_n^\Lambda >\eps > 0$, $\Lambda_\xi$ is guaranteed to get
into the winning zone for some $n \sim K(\mu)$.
%------------------------------%
\subsection{Information-Theoretic Interpretation}\label{secGCITI}
%------------------------------%
We try to give an intuitive explanation of Theorem
\ref{thWin}$(ii)$. We know that $\xi(x_t|x_{n$. If $\mu =\mu_a$, the informed scheme
$\Theta_\mu$ always predicts the bit which has highest $\mu$-probability, i.e.\ $y_t^{\Theta_\mu} =a_t$
\beqn
\Longrightarrow\quad e_t^{\Theta_\mu} = 1-\mu_{a_t}(y_t^{\Theta_\mu})=\odt(1-\eps)
\quad\Longrightarrow\quad E_n^{\Theta_\mu} = \textstyle{n\over
2}(1-\eps).
\eeqn
Since $E_n^{\Theta_\mu}$ is the same for all $a$ we seek to
maximize $E_n^\Theta $ for a given predictor $\Theta$ in the
following. Assume $\Theta$ predicts $y_t^\Theta$ (independent of
history $x_{_a\;
\;=\; < \sum_{t=1}^n \E[e_t^\Theta(x_{_a\;
\;=\; \sum_{t=1}^n \E[< e_t^\Theta(x_{_a]
\;=\; \sum_{t=1}^n \E[\odt] = \odt n
\eeqn
whatever $\Theta$ is chosen: a sort of No-Free-Lunch theorem
\cite{Wolpert:97}, stating that on {\em uniform} average all
predictors perform equally well/bad. The expectation of
$E_n^\Theta$ w.r.t.\ $a$ can only be $\odt n$ if
$E_n^\Theta\geq\odt n$ for some $a$. Fixing such an $a$ and
choosing $\mu=\mu_a$ we get $E_n^\Theta -E_n^{\Theta_\mu} \geq
\odt n\eps = \odt[S_n + \sqrt{4E_n^{\Theta_\mu}S_n+S_n^2}]$,
and similarly $e_n^\Theta-e_n^{\Theta_\mu} \geq \odt\eps =
\odt\sqrt{2s_t(x_{\eps$ for all $t$ and $x$. $\Theta_\mu$
predicts $0$ if $\mu(0|x_{\odt$. If also
$\xi(0|x_{\odt$, then $\Theta_\xi$ makes the same prediction
as $\Theta_\mu$, for $\xi(0|x_{\eps$.
Conversely for $\mu(0|x_{\F(\nu,\xi)$ for some
$\nu\in\M$. Since we do not know $\mu$ in advance we may ask whether
there exists a $\rho$ with better or equal performance for {\em
all} $\nu \in\M$ and a strictly better performance for one
$\nu \in\M$. This would clearly render $\xi$ suboptimal w.r.t.\
to $\F$. We show that there is no such $\rho$ for most performance
measures studied in this work.
%------------------------------%
\fdefinition{defPareto}{Pareto Optimality}{
%------------------------------%
Let $\F(\mu,\rho)$ be any performance measure of $\rho$ relative to
$\mu$. The universal prior $\xi$ is called Pareto-optimal w.r.t.\
$\F$ if there is no $\rho$ with $\F(\nu,\rho)\leq\F(\nu,\xi)$
for all $\nu \in \M$ and strict inequality for at least one
$\nu$.
}%------------------------------%
%------------------------------%
\ftheorem{thPareto}{Pareto Optimality}{
%------------------------------%
The universal prior $\xi$ is Pareto-optimal w.r.t.\ the
instantaneous and total squared distances $s_t$ and $S_n$
(\ref{stmuxi}), entropy distances $d_t$ and $D_n$ (\ref{dtmuxi}),
errors $e_t$ and $E_n$ (\ref{rhoerr}), and losses $l_t$ and $L_n$
(\ref{rholoss}).
}%------------------------------%
\paradot{Proof} We first prove Theorem \ref{thPareto} for the
instantaneous expected loss $l_t$. We need the more general
$\rho$-expected instantaneous losses
\beq\label{mxexloss}
l_{t\rho}^{\Lambda}(x_{0$ iff
$w_\nu(x_{0$.
\beqn
l_{t\xi}^{\Lambda} \;=\;
\sum_\nu w_\nu(x_{0$. The last inequality follows from the fact that
$\Lambda_\xi$ minimizes by definition (\ref{xlrdef}) the
$\xi$-expected loss (similarly to (\ref{Lmuopt})). The
contradiction $l_{t\xi}^{\Lambda} < l_{t\xi}^{\Lambda}$ proves
Pareto-optimality of $\xi$ w.r.t.\ $l_t$.
In the same way we can prove Pareto-optimality of $\xi$ w.r.t.\
the total loss $L_n$ by defining the $\rho$-expected
total losses
\beqn\label{mxexloss2}
L_{n\rho}^{\Lambda}\;:=\;\sum_{t=1}^n\sum_{x_{0$ $\forall\nu$ is Pareto-optimal and leads asymptotically to
optimal predictions.
We have derived bounds for the mean squared sum
$S_{n\nu}^{\xi_w}\leq\ln w_\nu^{-1}$ and for the loss regret
$L_{n\nu}^{\Lambda_{\xi_w}}-L_{n\nu}^{\Lambda_\nu}\leq 2\,\ln
w_\nu^{-1}+2\sqrt{\ln w_\nu^{-1}L_{n\nu}^{\Lambda_\nu}}$.
All bounds monotonically decrease with increasing $w_\nu$. So it
is desirable to assign high weights to all $\nu\in\M$. Due to the
(semi)probability constraint $\sum_\nu w_\nu\leq 1$ one has to
find a compromise.%
\footnote{All results in this paper have been stated and proven
for probability measures $\mu$, $\xi$ and $w_\nu$, i.e.\
$\sum_{x_{1:t}}\xi(x_{1:t}) = \sum_{x_{1:t}}\mu(x_{1:t}) =
\sum_\nu w_\nu = 1$. On the other hand, the class $\M$ considered
here is the class of all \indxs{enumerable}{semimeasure}
enumerable semimeasures and $\sum_\nu
w_\nu<1$. In general, each of the following 4 items could be semi
($<$) or not ($=$): ($\xi$, $\mu$, $\M$, $w_\nu$), where $\M$ is
semi if some elements are semi. Six out of the $2^4$ combinations
make sense. Convergence (Theorem \ref{thConv}), the error bound (Theorem
\ref{thErrBnd}), the loss bound (\ref{th3}), as well as most other
statements hold for $(<,=,<,<)$, but not for $(<,<,<,<)$.
Nevertheless, $\xi\to\mu$ holds also for $(<,<,<,<)$ with maximal
$\mu$ \idx{semi-probability}, i.e.\ fails with
$\mu$ semi-probability 0.}%
\indxs{enumerable}{weights}
In the following we will argue that in the class of enumerable
weight functions with short program there is an optimal
compromise, namely $w_\nu=2^{-K(\nu)}$.
Consider the class of enumerable weight functions with short
programs, namely ${\cal V}:=\{v_{(.)}:\M\to \Set{R}^+$ with
$\sum_\nu v_\nu\leq 1$ and $K(v)=O(1)\}$. Let $w_\nu:=2^{-K(\nu)}$
and $v_{(\cdot)}\in\cal V$.
%
Corollary 4.3.1 of \tcite[p255]{Li:97} says that $K(x)\leq -\log_2
P(x) + K(P) + O(1)$ for all $x$ if $P$ is an enumerable discrete
semimeasure. Identifying $P$ with $v$ and $x$ with (the program
index describing) $\nu$ we get
\beqn
\ln w_\nu^{-1} \;\leq \ln v_\nu^{-1} + O(1).
\eeqn
This means that the bounds for $\xi_w$ depending on $\ln
w_\nu^{-1}$ are at most $O(1)$ larger than the bounds for $\xi_v$
depending on $\ln v_\nu^{-1}$. So we lose at most an additive
constant of order one in the bounds when using $\xi_w$ instead of
$\xi_v$. In using $\xi_w$ we are on the safe
side, getting (within $O(1)$) best bounds for {\em all}
environments.
%------------------------------%
\ftheorem{thSolOpt}{Optimality of universal weights}{
%------------------------------%
\indxs{optimal}{weights}
Within the set $\cal V$ of enumerable weight functions with short
program, the universal weights $w_\nu=2^{-K(\nu)}$ lead to the
smallest loss bounds within an additive (to $\ln w_\mu^{-1}$)
constant in all enumerable environments.
}%------------------------------%
\noindent Since the above justifies the use of $\xi_w$, and
$\xi_w$ assigns high probability to an environment if and only if
it has low (Kolmogorov) complexity, one may interpret the result
as a justification of Occam's razor.$\!$\footnote{The {\em only
if} direction can be shown by a more easy and direct
argument \cite{Schmidhuber:02gtm}.} But
note that this is more of a \idx{bootstrap} argument, since we
implicitly used Occam's razor to justify the restriction to
enumerable semimeasures. We also considered only weight functions
$v$ with low complexity $K(v)=O(1)$. What did not enter as an
assumption but came out as a result is that the specific universal
weights $w_\nu=2^{-K(\nu)}$ are optimal.
On the other hand, this choice for $w_\nu$ is not unique (even not
within a constant factor). For instance, for $00\,\forall\nu$ leads within a multiplicative constant to
the same universal distribution, but this constant is not
necessarily of ``acceptable'' size. Details will be presented
elsewhere.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Miscellaneous}\label{secMisc}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
This section discusses miscellaneous topics.
%
Section \ref{secCPC} generalizes the setup to continuous
probability classes $\M = \{\mu_\theta\}$ consisting of continuously
parameterized distributions $\mu_\theta$ with parameter
$\theta \in \Set{R}^d$. Under certain smoothness and regularity
conditions a bound for the relative entropy between $\mu$ and
$\xi$, which is central for all presented results, can still be
derived. The bound depends on the Fisher information of $\mu$ and
grows only logarithmically with $n$, the intuitive reason being
the necessity to describe $\theta$ to an accuracy $O(n^{-1/2})$.
%
Section \ref{secFApp} describes two ways of using the prediction
schemes for partial sequence prediction, where not every symbol
needs to be predicted. Performing and predicting a sequence of
independent experiments and online learning of classification
tasks are special cases.
%
In Section \ref{secWM} we compare the universal prediction scheme
studied here to the popular predictors based on expert advice
(PEA)
\cite{Littlestone:89,Vovk:92,Littlestone:94,Cesa:97,Haussler:98,Kivinen:99}.
Although the algorithms, the settings, and the proofs are quite
different, the PEA bounds and our error bound have the same
structure.
%
Finally, in Section \ref{secOut} we outline possible extensions of
the presented theory and results, including infinite alphabets,
delayed and probabilistic prediction, active systems influencing
the environment, learning aspects, and a unification with PEA.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Continuous Probability Classes $\M$}\label{secCPC}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\indxs{probability class}{continuous}
%------------------------------%
%\subsection{Motivation and Definition}
%------------------------------%
\indxs{hypothesis class}{continuous}
\indxs{Bernoulli}{process}
\indxs{i.i.d.}{process}
We have considered thus far countable probability classes
$\M$, which makes sense from a computational point of view as
emphasized in Section \ref{secPCM}. On the other hand in
statistical parameter estimation one often has a continuous
hypothesis class (e.g.\ a Bernoulli($\theta$) process with unknown
$\theta \in[0,1]$). Let
\beqn
\M \;:=\; \{\mu_\theta:\theta\in\Theta\subseteq \Set{R}^d\}
\eeqn
be a family of probability distributions parameterized by a
$d$-dimensional continuous parameter $\theta$. Let $\mu \equiv
\mu_{\theta_0} \in \M$ be the true generating distribution and
$\theta_0$ be in the interior of the compact set $\Theta$. We may
restrict $\M$ to a countable dense subset, like $\{\mu_\theta\}$
with computable (or rational) $\theta$. If $\theta_0$ is itself a
computable real (or rational) vector then Theorem \ref{thConv} and bound
(\ref{thULoss}) apply. From a practical point of view the assumption
of a computable $\theta_0$ is not so serious. It is more from a
traditional analysis point of view that one would like quantities
and results depending smoothly on $\theta$ and not in a weird
fashion depending on the computational complexity of $\theta$.
\indxs{weights}{continuous} For instance, the weight $w(\theta)$
is often a continuous probability density
\beq\label{xidefc}
\xi(x_{1:n}) \;:=\; \int_\Theta \! d\theta\,
w(\theta)\!\cdot\!\mu_\theta(x_{1:n}), \qquad
\int_\Theta \! d\theta\,
w(\theta) \;=\; 1, \qquad
w(\theta) \;\geq\; 0.
\eeq
%
%------------------------------%
%\subsection{Smoothness and Fisher Information}
%------------------------------%
\noindent The most important property of $\xi$ used in this work was
$\xi(x_{1:n}) \geq w_\nu \cdot \nu(x_{1:n})$
which has been obtained from (\ref{xidefsp}) by dropping the sum
over $\nu$. The analogous construction here is to restrict the
integral over $\Theta$ to a small vicinity $N_\delta$ of $\theta$.
For sufficiently smooth $\mu_\theta$ and $w(\theta)$ we expect
$\xi(x_{1:n}) \approxgeq |N_{\delta_n}| \cdot
w(\theta) \cdot \mu_\theta(x_{1:n})$, where
$|N_{\delta_n}|$ is the volume of $N_{\delta_n}$. This in turn
leads to $D_n \approxleq
\ln{w_\mu^{-1}}+\ln|N_{\delta_n}|^{-1}$, where
$w_\mu:=w(\theta_0)$. $N_{\delta_n}$ should be the largest possible
region in which $\ln\mu_\theta$ is approximately flat on average.
The averaged instantaneous, mean, and total curvature
matrices of $\ln\mu$ are\index{curvature matrix}
\bqan\label{fisher}
j_t(x_{