Optimality of Universal Bayesian Sequence Prediction for General Loss and Alphabet
Keywords: Bayesian sequence prediction; mixture distributions; Solomonoff
induction; Kolmogorov complexity; learning; universal probability;
tight loss and error bounds; Pareto-optimality; games of chance;
classification.
Abstract: Various optimality properties of universal sequence predictors
based on Bayes-mixtures in general, and Solomonoff's prediction
scheme in particular, will be studied. The probability of
observing $x_t$ at time $t$, given past observations
$x_1...x_{t-1}$ can be computed with the chain rule if the true
generating distribution $\mu$ of the sequences $x_1x_2x_3...$ is
known. If $\mu$ is unknown, but known to belong to a countable or
continuous class $\M$ one can base ones prediction on the
Bayes-mixture $\xi$ defined as a $w_\nu$-weighted sum or integral
of distributions $\nu\in\M$. The cumulative expected loss of the
Bayes-optimal universal prediction scheme based on $\xi$ is shown
to be close to the loss of the Bayes-optimal, but infeasible
prediction scheme based on $\mu$. We show that the bounds are
tight and that no other predictor can lead to significantly
smaller bounds. Furthermore, for various performance measures, we
show Pareto-optimality of $\xi$ and give an Occam's razor argument
that the choice $w_\nu\sim 2^{-K(\nu)}$ for the weights is
optimal, where $K(\nu)$ is the length of the shortest program
describing $\nu$. The results are applied to games of chance,
defined as a sequence of bets, observations, and rewards. The
prediction schemes (and bounds) are compared to the popular
predictors based on expert advice. Extensions to infinite
alphabets, partial, delayed and probabilistic prediction,
classification, and more active systems are briefly discussed.
Table of Contents
- Introduction
- Setup and Convergence
- Error Bounds
- Application to Games of Chance
- Optimality Properties
- Miscellaneous
- Summary
BibTeX Entry
@Article{Hutter:03optisp,
author = "Marcus Hutter",
title = "Optimality of Universal {B}ayesian Prediction for General Loss and Alphabet",
volume = "4",
year = "2003",
pages = "971--997",
journal = "Journal of Machine Learning Research",
publisher = "MIT Press",
http = "http://www.hutter1.net/ai/optisp.htm",
url = "http://arxiv.org/abs/cs.LG/0311014",
url2 = "http://www.jmlr.org/papers/volume4/hutter03a/",
ftp = "ftp://ftp.idsia.ch/pub/techrep/IDSIA-02-02.ps.gz",
keywords = "Bayesian sequence prediction; mixture distributions; Solomonoff
induction; Kolmogorov complexity; learning; universal probability;
tight loss and error bounds; Pareto-optimality; games of chance;
classification.",
abstract = "Various optimality properties of universal sequence predictors
based on Bayes-mixtures in general, and Solomonoff's prediction
scheme in particular, will be studied. The probability of
observing $x_t$ at time $t$, given past observations
$x_1...x_{t-1}$ can be computed with the chain rule if the true
generating distribution $\mu$ of the sequences $x_1x_2x_3...$ is
known. If $\mu$ is unknown, but known to belong to a countable or
continuous class $\M$ one can base ones prediction on the
Bayes-mixture $\xi$ defined as a $w_\nu$-weighted sum or integral
of distributions $\nu\in\M$. The cumulative expected loss of the
Bayes-optimal universal prediction scheme based on $\xi$ is shown
to be close to the loss of the Bayes-optimal, but infeasible
prediction scheme based on $\mu$. We show that the bounds are
tight and that no other predictor can lead to significantly
smaller bounds. Furthermore, for various performance measures, we
show Pareto-optimality of $\xi$ and give an Occam's razor argument
that the choice $w_\nu\sim 2^{-K(\nu)}$ for the weights is
optimal, where $K(\nu)$ is the length of the shortest program
describing $\nu$. The results are applied to games of chance,
defined as a sequence of bets, observations, and rewards. The
prediction schemes (and bounds) are compared to the popular
predictors based on expert advice. Extensions to infinite
alphabets, partial, delayed and probabilistic prediction,
classification, and more active systems are briefly discussed.",
}