Optimality of Universal Bayesian Sequence Prediction
Author: | Marcus Hutter (2001) |
Comments: | 15 LaTeX pages |
Subj-class: | Artificial Intelligence; Learning |
ACM-class: | I.2; I.2.6; I.2.8; F.1.3; F.2 |
Paper: |
PostScript (420kb) -
PDF (293kb) -
Html/Gif |
Slides: | PostScript - PDF |
Remark: | Technical Report |
Keywords: Bayesian sequence prediction; Mixture distributions, Solomonoff
induction; Kolmogorov complexity; learning; universal probability;
tight loss and error bounds; Pareto-optimality.
Abstract: Various optimality properties of universal sequence prediction
based on Bayes-mixtures in general, and Solomonoff's prediction
scheme in particular will be studied. The probability of observing
$x_t$ at time $t$, given past observations $x_1...x_{t-1}$ can be
computed with Bayes' rule if the true generating distribution
$\mu$ of the sequences $x_1x_2x_3...$ is known. If $\mu$ is
unknown, but known to belong to a class $\M$ one can base ones
prediction on the Bayes-mix $\xi$ defined as a $w_\nu$ weighted
sum of distributions $\nu\in\M$. The number of additional
prediction errors $E_\xi$ made by the optimal universal prediction
scheme based on $\xi$ minus the number of errors $E_\mu$ of the
optimal informed prediction scheme based on $\mu$ is bounded by
$O(\sqrt{E_\mu})$. We show that the bound is tight and that no
other predictor can lead to smaller bounds. Furthermore, for
various performance measures we show Pareto-optimality of $\xi$ in
the sense that there is no other predictor which performs better
or equal in all environments $\nu\in\M$ and strictly better in at
least one. So, optimal predictors can (w.r.t.\ to most performance
measures in expectation) be based on the mixture $\xi$. Finally we
give an Occam's razor argument that Solomonoff's choice $w_\nu\sim
2^{-K(\nu)}$ for the weights is optimal, where $K(\nu)$ is the
length of the shortest program describing $\nu$.
Table of Contents
- Introduction
- Setup and Convergence
- Error & Loss Bounds
- Lower Error Bound
- Pareto Optimality of ß
- On the Optimal Choice of Weights
- Conclusions
BibTeX Entry
@TechReport{Hutter:02splower,
author = "Marcus Hutter",
number = "IDSIA-09-01",
institution = "Istituto Dalle Molle di Studi sull'Intelligenza Artificiale (IDSIA)",
title = "Optimality of Universal {Bayesian} Sequence Prediction",
year = "2002",
pages = "15",
address = "Manno(Lugano), CH",
abstract = "Various optimality properties of universal sequence prediction
based on Bayes-mixtures in general, and Solomonoff's prediction
scheme in particular will be studied. The probability of observing
$x_t$ at time $t$, given past observations $x_1...x_{t-1}$ can be
computed with Bayes' rule if the true generating distribution
$\mu$ of the sequences $x_1x_2x_3...$ is known. If $\mu$ is
unknown, but known to belong to a class $\M$ one can base ones
prediction on the Bayes-mix $\xi$ defined as a $w_\nu$ weighted
sum of distributions $\nu\in\M$. The number of additional
prediction errors $E_\xi$ made by the optimal universal prediction
scheme based on $\xi$ minus the number of errors $E_\mu$ of the
optimal informed prediction scheme based on $\mu$ is bounded by
$O(\sqrt{E_\mu})$. We show that the bound is tight and that no
other predictor can lead to smaller bounds. Furthermore, for
various performance measures we show Pareto-optimality of $\xi$ in
the sense that there is no other predictor which performs better
or equal in all environments $\nu\in\M$ and strictly better in at
least one. So, optimal predictors can (w.r.t.\ to most performance
measures in expectation) be based on the mixture $\xi$. Finally we
give an Occam's razor argument that Solomonoff's choice $w_\nu\sim
2^{-K(\nu)}$ for the weights is optimal, where $K(\nu)$ is the
length of the shortest program describing $\nu$.",
url = "http://www.hutter1.net/ai/splower.htm",
}