ICML 2005 Tutorial - (Marcus Hutter)

How to Predict Sequences with Bayes, MDL, and Experts

Bonn, Germany, Sunday 7, August, 2005
Organizer: Marcus Hutter, IDSIA, Switzerland


Topic Intended Audience Content Format Presenter References Links

Most passive Machine Learning tasks can be (re)stated as sequence prediction problems. This includes pattern recognition, classification, time-series forecasting, and others. Moreover, the understanding of passive prediction also serves as a basis for active learning and decision making. In the recent past, rich theories for sequence prediction have been developed, and this is still an ongoing process. On the other hand, we are arriving at the stage where some important results are already termed classical. While much of the current Learning Theory is formulated under the assumption of independent and identically distributed (i.i.d.) observations, this tutorial focusses on situations without this prerequisite (e.g. weather or stock-market time-series). The three most important approaches to sequence prediction are: Bayesian methods, Prediction with Expert Advice, and the Minimal Description Length (MDL) principle. Although aiming at the same (or similar) goal, comparisons between them are scarce. A tutorial comparing these different approaches fairly is in need. The tutorial shall initiate a discussion between (young) researchers in each field and may also trigger research on sequence prediction towards a more unified treatment.

Keywords: Machine learning; Philosophical foundations; Epicurus+Occam+Bayes+Turing+Solomonoff; Online/sequential prediction; Bayesian objective/subjective probabilities; Mixture distributions; Convergence, consistency, and loss bounds; Minimum description Length principle; Algorithmic information theory; Prefix codes and priors; Universal similarity metric; Tree-based clustering: genetic, music, text; Prediction with expert advice; Weighted majority algorithm; Follow the perturbed leader.

Intended Audience

The tutorial should by interesting for the majority of the ICML community, in particular to researchers interested in predictions of non-iid data. The philosophical and mathematical differences and similarities, advantages and disadvantages, relations, and combinations of the three approaches will be discussed. Basic prior knowledge of probability theory and statistics is assumed, but no more.

Content

30-60 minutes per item. 4-5 hours in total. More material for a full-day tutorial is available.

Philosophical Issues: The tutorial starts with the philosophical problems concerning machine learning in general and induction in particular. I illustrate the problems and their intuitive solution on various (classical) induction examples. The common principle to their solution is Occam's simplicity principle. I briefly describe the formal theory of induction based on Occam's and Epicurus' principle, Bayesian probability theory, and Turing's universal machine. Finally I describe the sequential/online setup considered in this tutorial and place it into the wider machine learning context.
Bayesian framework: Here one considers a (finite or countable or uncountable) class of probabilistic models. Each model is a distribution which generates sequences. The model class is endowed with a prior distribution. The true model, i.e. the model which actually generates the sequence we want to predict, is assumed to be within the class. Now Bayes rule is the optimal way of predicting, provided that the true model is randomly chosen according to the prior distribution. But even if this does not hold, one can give good performance guarantees for Bayesian prediction which depend on the prior of the true model.
MDL/MML: The Minimum Description/Message Length principle is one of the most important concepts in Machine Learning, and serves as a scientific guide, in general. The motivation is as follows: To make predictions involves finding regularities in past data, regularities in data allows for compression, hence short descriptions of data should help in making predictions. More formally, from a model class, a model is chosen which minimizes the joint description length of the model and the data observed so far given the model. These criteria can be related to a MAP (maximum a posteriori) model choice and thus to the Bayesian framework. The MDL method has been studied from very concrete and highly tuned practical applications to general theoretical assertions. Sequence prediction is just one application of MDL.
Universal Similarity Metric: The MDL idea has also be used to define the so called information distance or universal similarity metric, measuring the similarity between two individual objects.
I will present some very impressive recent clustering applications based on standard Lempel-Ziv or bzip2 compression, including a completely automatic reconstruction
- of the evolutionary tree of 24 mammals based on complete mtDNA
- of the classification tree of 52 languages based on the declaration of human rights.
Prediction with expert advice: This very active field of research studies prediction in a worst-case setting, i.e. the actual sequences are generated by an adversary. Instead of trying to predict absolutely well, the goal is a good relative performance with respect to a pool of experts. One can prove that the regret is essentially bounded by the square root of a complexity term and the loss of the best expert. This method and corresponding performance bounds are related or dual to Bayesian sequence prediction.

Format

Sunday 7.Aug'05, 14:00-17:30
Slides
Video

Presenter

Marcus Hutter (*1967) received a masters degree in computer sciences in 1992 at the University of Technology in Munich, Germany. After his PhD in theoretical particle physics in 1995, he developed algorithms in a medical software company for 5 years. Since 2000 he is working as a researcher at the AI institute IDSIA in Lugano, Switzerland, where he published over 35 research papers in/at ECML, ICML, COLT, ALT, JMLR, JCSS, IEEE-TIT, ..., with half of them being on topics of the tutorial. His current interests are centered around reinforcement learning, algorithmic information theory and statistics, universal induction schemes, adaptive control theory, and related areas. In his recently published book "Universal Artificial Intelligence" (Springer, EATCS, 2004), he unifies sequential decision theory with algorithmic information theory. Since 2003, he has also been an honorary official lecturer at the Munich University of Technology.

Address:
Marcus Hutter
IDSIA
Galleria 2
CH-6928 Manno-Lugano
Switzerland

Phone: +41-91/6108668

Fax: +41-91/6108661

Email: marcus@idsia.ch

HomePage: http://www.idsia.ch/~marcus

References

M. Hutter. Optimality of Universal Bayesian Sequence Prediction for General Loss and Alphabet
Journal of Machine Learning Research 4 (2003) 971-1000
P. D. Gruenwald, Tutorial on Minimum Description Length
Chapters 1 and 2 of: Advances in Minimum Description Length: Theory and Applications, MIT Press, pp. 3-80, 2005.
N. Cesa-Bianchi, Y. Freund, D.P. Helmbold, D. Haussler, R.E. Schapire, M.K. Warmuth. How to Use Expert Advice
Journal of the ACM, Vol. 44(3), pp. 427-485, May 1997
Y. Freund and R. E. Schapire, A decision-theoretic generalization of on-line learning and an application to boosting
Journal of Computer and System Sciences, 55(1):119--139, 1997.
N. Merhav and M. Feder. Universal Prediction
IEEE Transactions on Information Theory, Vol.44, No.6, pp. 2124-2147, 1998
R. Cilibrasi and P.M.B. Vitanyi, Clustering by compression
IEEE Trans. Information Theory, 51(4):1523-1545, 2005
M. Hutter, Universal Artificial Intelligence: Sequential Decisions based on Algorithmic Probability
Springer, Berlin, 300 pages, 2004.

Links

ICML 2005 Conference
Bayes Tutorial (ICML 2004)
Workshop on Universal Learning Algorithms and Optimal Search (NIPS 2002)
Occam Workshop (NIPS 2001)
MDL Workshop (NIPS 2001)

Address:	Marcus Hutter IDSIA Galleria 2 CH-6928 Manno-Lugano Switzerland
Phone:	+41-91/6108668
Fax:	+41-91/6108661
Email:	marcus@idsia.ch
HomePage:	http://www.idsia.ch/~marcus

ICML 2005 Tutorial on