Research Projects of Marcus Hutter

Research Projects of Marcus Hutter

Personal motivation
Current and past projects (...-2010) [Slides]
Future projects (2004-...)
Bayesian Statistics
Information Theory
Complexity Theory
Minimum Description Length (2003-...)
Algorithmic Probability and Information Theory (2002-...)
Prediction with expert advice (2002-...)
Bayesian Sequence Prediction (2001-...)
Robust predictions (2002-...)
Optimization (2002-...)
Reinforcement Learning (2005-...)
Universal decision theory (2003-...)
Computer vision & image processing
Bio-statistics & bio-informatics (2006-2010)
Universal Artificial Intelligence (2000-2006)
Applied R&D at Company BrainLAB (1996-2000)
PhD in Particle Physics (1993-1996)
Early student work (1983-1992)

Personal motivation. Since my early youth I have been following two great goals in my life. One is to work on the physical ``Theory of Everything'', which motivated me to do my BSc and PhD in Theoretical Particle Physics. The other is Artificial Intelligence, which motivated me to do my BSc & MSc & Habilitation in Theoretical Computer Science. This personal drive ``explains'' most of my CV. I have also, always been interested in significant open math problems, most prominently the P=NP question. My current research is centered around artificial intelligence, machine (reinforcement) learning, information theory and statistics, algorithmic information theory, Kolmogorov complexity, minimal description length (MDL) principle, Bayesian/robust/expert/MDL/online/sequence prediction, computational complexity theory, universal Solomonoff induction, universal Levin search, sequential decision theory, and adaptive control theory (see e.g. [P04uaibook] how they fit together).

Current and past projects (...-2010) Below you can find a description of my current and past (and future) projects with links to corresponding publications. Alternatively you may want to have a look at the (partial and quite condensed) summary slides of my projects. Therein I give an overview of what I regard as the foundations of (statistical) machine learning and of my own research in this direction. I start with a survey of the key subfields and dichotomies. Most, if not all problems can formally be reduced to predicting sequences or acting well. Occam's razor, Bayes' rule, utility, information, and computability theory are key to solving these problems. I discuss the following developments and applications: determining (in)dependence of samples by Mutual Information, non-parametric Bayesian inference, robust estimation, Bayesian change point detection, Bayesian sequence prediction, the MDL principle, prediction with expert advice, applications of algorithmic information theory, optimization, computer vision, and image processing. I also briefly summarize my past work in particle physics, medical software development, and others.

Future projects (2010-...) It is hard to make predictions, which includes future research projects. Research is not performed in a vacuum, but adapts to trends, chances, and the local environment. Some completely new topics may strike my interest and/or some of my former activities on image processing, 3D graphics, numerical simulations, and others may be revived.

Bayesian statistics. The Bayesian framework is ideally suited for complex statistical induction problems, including classification tasks, pattern recognition, time-series (e.g. stock market and weather) forecasting, and many more, where data is generated from some distribution μ. If μ is unknown, but known to belong to some class of environments=hypotheses=models M, from one's prior belief w_ν in environment ν in M and the observed data likelihood, Bayes' rule yields one's posterior confidence in hypothesis ν. In the predictive setting one can base one's prediction directly on the Bayes-mixture ξ defined as the w_ν-weighted average over these distributions. For i.i.d. data, a novel, non-parametric, consistent, scale-invariant Bayesian tree prior has been constructed, which allows very easy, efficient and exact inference and estimation of most quantities of interest [P05bayestree, P09bayestreex]. An exact and efficient Bayesian regression algorithm based on Dynamic Programming for piecewise constant functions of unknown segment number, boundary location, and levels has also been developed [P07pcreg].
Key open problems: The Bayesian framework leaves open how to choose the model class M and prior w_ν. Subjective, objective, universal, and sets-of priors are the key choices [P06usp].
Applications are abundant, e.g. density estimation and regression, feature selection, dependency networks; handling small sample sizes. It is intended to adapt and further develop [P07pcreg] within a joint SNF grant requested with the Oncology Institute of Southern Switzerland (IOSI) and apply it to microarray-CGH data for determining cancer relevant genes.

Information theory. Shannon entropy is the standard measure of the information rate of a stochastic source. The mutual information between two random variables is defined similarly, and is the default (in)dependency measure. If the true probabilities are unknown, they are typically estimated from observed relative frequencies. The resulting empirical mutual information is then itself a random variable. For robust or credible estimates, rather than risky average predictions, not only the mean but also quantiles or variances are needed. These and higher moments under a second order Dirichlet prior have been derived [P01xentropy], successfully used for robust feature selection [P02feature], and extended to the case of incomplete data [P03mimiss,P05mifs]. For the Imprecise Dirichlet Model, which goes beyond Bayes, a comprehensive collection of analytical formulae for computing robust intervals of entropy, mutual information and other quantities has been derived [P03idm] and applied to infer robust trees [P05tree].
Key open problems: better treatment of incomplete observations; efficient approximations; principles for choosing acceptance/rejection thresholds.
Example applications: safety critical areas such as planning and designing medical treatments and atomic reactors.

Complexity theory. Computational complexity theory studies the computation time and space required to solve a given problem. Remarkably, an asymptotically fastest algorithm for all well-defined problems exists [P02fast]. A closely related idea led to the computable AIXItl model [P00kcunai,P01aixi]. The class P of polynomially-time (=efficiently) solvable problems, and the class NP of problems whose solution can be verified in polynomial time are the most famous in the ever-increasing zoo of complexity classes. Whether P=NP, is the most important open problem. Some NP-hard problems allow for efficient approximations (PTAS), like the knapsack problem, which can actually be solved in linear time [P06knapsack]. Algorithmic complexity (or information) theory is concerned with the information content in strings, programs, and other objects. Kolmogorov defined the complexity K of (an arbitrary object coded as) a string as the length of the shortest program on a universal Turing machine computing that string. This description ``finds'' all effective regularities of the string and is no further compressible. Shannon entropy, specific MDL codings, Lempel-Ziv or CTW compression, and many others can be regarded as efficient approximations to the incomputable Kolmogorov complexity K.
Key open problems: effective approximations of K which respect key properties of K; applicability of other K complexity variants for prediction [P03unipriors, P06unipriorx].
Example applications: foundation of information and induction, universal sequence prediction [P01errbnd, P03unimdl, P06unimdlx, P03mlconv, P04mlconvx, P07mlconvxx, P05postbnd, P06usp]; Maxwell's demon; incompressibility proof-method; reversible computing; universal search and optimization [P02fast]; key ingredient to the AIXI theory [P04uaibook]; detecting similarities of objects.

The Minimum Description Length (2003-...) (MDL) principle is one of the most important concepts in Machine Learning, and even serves as a scientific guide, in general. An MDL estimator seeks a simple probabilistic model for which simultaneously the data likelihood is large. MDL has been studied on all possible levels from very concrete and highly tuned practical applications up to general theoretical assertions. MDL research mainly concentrates on continuous model classes, coding, modelling, and independent identically distributed (i.i.d.) data. On the other hand, the non-i.i.d. case is important for sequential prediction tasks, and from a computational point of view the largest relevant model class is countable, namely the class of all enumerable semimeasures, deeply rooted in Algorithmic Information Theory and Solomonoff's universal induction scheme. Convergence and stabilization of MDL based sequential prediction for arbitrary discrete (deterministic and probabilistic) model classes has been shown in [P03unimdl, P04mdl2p], but convergence may be slow [P04unimdlx, P04mdlspeed, P05mdl2px, P06mdlspeedx]. Three MDL variants have been identified, and the results have been extended to regression and classification [P05mdlreg]. Apart from these findings, the countable non-i.i.d. case has been left mostly untouched for future exploration.
Key open problems: Characterize the class of coding schemes and discrete model classes for which MDL performs well (as well Bayes); further explore the differences between the 3 MDL variants.
Example applications: prevention of overfitting; regression for large model classes; Bayesian net learning; inference of decision trees; learning (e.g. grammars) from positive examples only; sequence prediction; molecular biology; functional genomics (expression/mapping/interpretation).

Algorithmic Probability and Information Theory (2002-...) are the mathematical foundations of universal (domain independent) induction and information. Kolmogorov defined the information content or complexity K of (an arbitrary object coded as) a string as the length of the shortest program on a universal Turing machine computing that string. This description "finds" all effective regularities of the string and is no further compressible. Shannon entropy, specific MDL codings, Lempel-Ziv compression, and many others can be regarded as efficient approximations to the incomputable Kolmogorov complexity K. Solomonoff defined (earlier) the closely related universal a priori probability M, which assigns high/low probability to simple/complex strings, thus quantifying Occam's razor and Epicurus' multiple-explanation principle. Solomonoff's induction scheme based on M results in optimal sequential predictions (in expectation and with probability one), the only assumption to be made is that the sequence is sampled from a computable probability distribution [P99errbnd, P03optisp, P03unipriors, P05unipriorx, P05postbnd]. Despite some nearby results and proofs in the literature [P03mlconv], the stronger result of convergence for all (Martin-Löf) random sequences, does not hold in general [P04mlconvx, P06mlconvxx].
Key open problems: effective approximations of K which respect key properties of K; Martin-Löf convergence of M or related quantities; applicability of other K complexity variants for prediction; convergence class of time-limited M variants like Schmidhuber's speed prior; task-dependent universal distance measures.
Example applications: foundation of information and induction; Maxwell's demon; incompressibility proof-method; reversible computing; universal search and optimization; key ingredient to the AIXI theory (see below).

Prediction with expert advice (2002-...) (PEA) is currently a very active field. Many variations known by many names (prediction/learning with expert advice, weighted majority/average, aggregating strategy, boosting, hedge algorithm, ...) have meanwhile been invented. PEA combines forecasts of experts to form its own prediction and can be shown to perform nearly as well as the best expert in the considered class of experts, without making any assumption on the environment(al data sequence). An old rediscovered variant, "Follow the Perturbed Leader" allowed to elegantly derive generalized performance/regret bounds for adaptive learning rate (general weights, alphabet and loss) [P04expert, P05expertx], even in reactive environments [P05actexp, P05actexp2]. Dual to this approach, expected performance bounds for predictors based on Bayes mixtures of environments rather than experts have been studied (see below). The duality of both approaches and results has been pointed out in [P04bayespea] and should be further explored in the future, probably leading to novel and fruitful bounds and algorithms.
Key open problems: quantification, interpretation and exploitation of the duality of PEA and Bayesian sequence prediction (see below); loss bounds for some PEA variant for general weights without the hierarchy trick used in FPL; investigate structures for which PEA is efficient even for large number of experts; Close the sqrt(2) gap between lower and upper bounds.
Example applications: combining forecasting software (for weather, stock market, ...); online shortest path problem; binary search tree.

Bayesian (Sequence) Prediction (2001-...) The Bayesian framework is ideally suited for induction problems, including classification tasks, pattern recognition, time-series (e.g. stock market and weather) forecasting, and many more, where (sequential) data is generated from some distribution μ. If μ is unknown, but known to belong to some class of distributions, one can base ones prediction on the Bayes-mixture ξ defined as a weighted average over these distributions. One can show that the posterior of ξ converges rapidly to the true posterior μ [P01alpha, P03mlconv, P04mlconvx, P06mlconvxx]. The expected loss of the Bayes-optimal universal prediction scheme based on ξ is very close to the loss of the optimal, but infeasible prediction scheme based on μ [P01loss, P02spupper]. Tight bounds have been derived, and no other predictor can be significantly better. Furthermore, ξ is Pareto-optimal and an Occam's razor argument shows that the universal weights based on Algorithmic Information Theory leading to Solomonoff's universal induction scheme are optimal [P99errbnd, P03optisp, P03unipriors, P05unipriorx, P05postbnd]. The results may be generalized and applied in various ways, e.g. to games of chance in form of a sequence of bets, observations, and rewards [P03optisp]. In particular for i.i.d. data, a novel non-parametric, consistent, scale-invariant Bayesian tree prior has been constructed, which allows very easy, efficient and exact inference and estimation of most quantities of interest [P04bayestree, P05bayestreex]. Finally, an exact and efficient Bayesian regression algorithm based on Dynamic Programming for piecewise constant functions of unknown segment number, boundary location, and levels has been developed [H06pcreg]. It is intended to adapt and further develop this algorithm within a joint SNF grant requested with the Oncology Institute of Southern Switzerland (IOSI) and apply it to microarray-CGH data for determining cancer relevant genes.
Key open problems: High probability performance guarantees.
Example applications: prediction of Markov processes; speech recognition; handwritten word/text/character recognition, microarray-CGH data analysis.

Robust predictions (2002-...), rather than (risky), best on average predictions, are sometimes necessary. Various ways (apart from PEA) have been suggested to go beyond maximizing expected performance. In the Bayesian framework this means computing quantiles and/or variances. Higher moments of mutual information under a second order Dirichlet prior have been derived [P01xentropy], successfully used for robust feature selection [P02feature], and extended to the case of incomplete data [P03mimiss, P02mifs]. For the Imprecise Dirichlet Model, which goes beyond Bayes, a comprehensive collection of analytical formulae for computing robust intervals have been derived [P03idm] and applied to infer robust trees [P03tree]. These approaches are promising, theoretical as well as practical, and should be investigated further.
Key open problems: adequate treatment of incomplete observations; efficient approximations; principles for choosing acceptance/rejection thresholds.
Example applications: safety critical areas such as planning and designing medical treatments and atomic reactors.

Optimization (2002-...) of functions on very large/high-dimensional spaces is crucial in a plethora of applications, including operations research and artificial intelligence, to name only two. The functions often involve all known difficulties at once: multimodality, deceptiveness, discontinuities, ... Usually one has to be satisfied with approximate solutions. For concrete, often NP-complete problems, polynomial time approximation algorithms (PTAS) can often be found. To find these algorithms for problem after problem is a laborious, but rewarding procedure. Finding a PTAS does not necessarily imply possessing a practically efficient algorithm. Remarkably, a fully linear time approximation algorithm for knapsack problems could be developed [P02knapsack], improving upon previous PTAS algorithms. Other problems are also expected to possess linear time algorithms. Since a rigorous theoretical approach is not always possible/successful, more heuristic approaches are often used, including simulated annealing and population based optimization. More effective selection and deletion schemes (FUSS/FUDS) for evolutionary (or other population based) algorithms, which generate selection pressure towards sparsely populated fitness regions, rather than towards higher fitness, have been invented [P01fuss, P05fuds, P05fuo]. First experiments, to be continued in the future, indicate superiority to all previous selection schemes and to random deletion on many hard optimization problems, including the Traveling Salesman Problem, the set covering problem, and maximum CNF3SAT [P04fussexp, P05fuo]. The interesting ``go with the winners'' algorithms by Aldous and Vazirani may bridge the gap between heuristic and rigorous approaches to complex optimization problems. This connection is currently being explored.
Key open problems: linear time approximation algorithms; test FUSS and variants on other hard optimization problems; Consider ``go with the winners'' on more general practically important graph structures.
Example applications: operations research; scheduling problems.

Reinforcement Learning (2005-...) is a general approach for sequential decision making and active learning. The setup consists of a decision maker (agent) which interacts with an environment (world) by an alternating sequence of actions and observations. The observations consist of an informative part (e.g. a video image) and occasionally a potentially delayed reward part (e.g. function value in case function maximization or battery level in case of self-maintaining robots). The goal is to learn the decisions which maximize reward. In case the decision does not influence the environment the setup reduces to passive predictions, discussed above. In case of known environment, sequential decision theory provides a formal, but in general inefficient, optimal answer. Dynamic programming efficiently solves the case of known Markov environments with few states. Adaptive/dynamic control theory typically addresses the case of unknown/known linear environments with quadratic loss function (interesting for engineering, but too limited for AI). The general case of unknown environment and active agents is very difficult. The AIXI(tl) theory [P01aixi, P05aixifoe] represents an optimal but inefficient ``solution''. Efficient learning and planning algorithms exist for completely observable finite state Markov decision processes (MDPs). While these environments are very limited, one can reduce many real-world problems to fit this case. Traditionally this is an art that involves significant effort by designers, but it is possible to automate the reduction process to a large degree [P09phimdp, P09phimdpx]. Extensions to much more powerful structured MDPs are also possible [P09phidbn]. Many practical heuristic algorithms have been developed, in particular, versions of temporal difference learning [P08qlearn].
Key open problems: balancing exploration versus exploitation; optimal learning rate and eligibility parameter; convergence in case of function approximation and partial observability; comparison to model-based learning, non-ergodic and/or non-stationary environments; beyond (PO)MDPs; imitation-based learning (e.g. in a multi-agent or symmetric game setup).
Example applications: board games like backgammon and checkers; elevator control; robo-soccer; dynamic channel allocation; inventory problems.

Universal decision theory (2003-...) My favorite research project is on optimal sequential decision making, based on universal algorithmic probability (called AIXI(tl), see below). Whereas past research was focused on proving optimality for ultra general problem classes [P02selfopt], future research will also investigate the applicability of AIXItl. Variants of universal Levin search have been implemented and successfully applied at IDSIA by Schmidhuber to a variety of problems concerning induction, classification, and others. Analogously, we expect variants of AIXItl to be successful in solving problems in ``reactive'' environments, also called active learning. The goal is to implement AIXItl and apply it to difficult, yet simply describable problems and compare the performance to other approaches.
Key open problems: in which sense is AIXI universally optimal; analyze behavior of AIXI for several concept classes; axiomatic/algebraic characterization of AIXI; choice of the horizon; downscale and implement AIXI (like Levin-search and OOPS); more insights how AIXI works.
Example applications: mathematical and philosophical foundation of intelligence; repeated prisoner problem and generalized variants [P05aixifoe]; the n-armed bandit problem; simple strategic games.

Computer vision & image processing. While humans can easily recognize 3d objects or whole scenes from a stereographic projection, for computers, the reconstruction of 3d objects from single or multiple photographic projections is a challenge. Consider the recognition and 3d reconstruction of a specific object in a realistic (messy) outdoor photograph. The picture contains background, other objects, unknown lighting condition, shades, and texture. Reliably finding ``landmark'' points of interest for matching them with an idealized 3d model is not easy in general. But for cars, two wheels, if visible, can reliably be detected and provide enough information for performing the 2d-3d registration [P09wheel]. Alternatively, one may match 2d or 3d images directly by minimizing a pixel/voxel-based distance measure that is insensitive to different modalities, like mutual information. Overlaying medical Xray, CT, MRI, Ultra-Sound, etc. images is an important application. The merged multi-modal volume data can be displayed with volume or (after 3d segmentation) surface rendering algorithms. Sophisticated anti-aliasing based on finite-element interpolation can significantly enhance the display quality [P02uspatent, P01eupatent, R&D@BrainLAB].

Bio-statistics & bio-informatics (2006-2010) The advent of high-throughput DNA sequencing and genomic profiling with micro-arrays resulted in an unprecedented flood of data in molecular biology. This required to augment or replace traditional statistical modeling and analysis (bio-statistics) by computation-intensive algorithmic data processing (bio-informatics). Micro-arrays can determine the expression, copy-number, and LOH at millions of locations in our DNA simultaneously. Comparing the results from cancerous to normal cells can reveal the genes and molecules responsible for the progression of cancer. Since the measurements are extremely noisy, sophisticated statistical algorithms are required to separate signal from noise [P09bcnax, P09mbpcrcode]. By combining various data sources one can improve the identification of the different types of genomic aberrations [P09cnloh].

Universal Artificial Intelligence (2000-2006) My all-time favorite research project is about a mathematical theory for intelligence, called AIXI. The AIXI model is a parameter-free optimal reinforcement learning agent embedded in an arbitrary unknown environment. Disciplines of importance involved in the development of the AIXI model are artificial intelligence, reinforcement learning, algorithmic information theory, Kolmogorov complexity, minimal description length (MDL) principle, computational complexity theory, information theory and statistics, universal Solomonoff induction, universal Levin search, sequential decision theory, and adaptive control theory. The model was first introduced and discussed in [P00kcunai]. More succinct descriptions have been published in [P01aixi, P01decision]. See also [P03aixigentle] for an introductory survey and [P04uaibook] for a comprehensive treatment. The AIXI model has been argued to formally solve a number of problem classes, including sequence prediction, strategic games, function minimization, reinforcement and supervised learning [P00kcunai]. It comes along with a universal intelligence order-relation/measure [P04uaibook, P05iors]. The generalization AIxi has recently been shown to be self-optimizing and Pareto-optimal [P02selfopt]. Tight [P03optisp] error [P99errbnd, P01alpha] and loss [P01loss] bounds for Solomonoff's universal sequence prediction scheme have been proven. I found a remarkable, asymptotically fastest and shortest algorithm for all well-defined problems - fastest save for a constant factor 5 and a problem class-specific additive constant [P01fast]. A closely related idea led to a computable AIXItl model [P00kcunai,P01aixi]. Ideas, loosely related to AIXI, on a market/economy based reinforcement learner [P01market], a gradient based reinforcement planner [P01grep], and a reinforcement learning mobile robot [P06robot] have been implemented.

Applied R&D at Company BrainLAB (1996-2000) During my full-time employment at BrainLAB from 1996 to 2000 I developed various numerical algorithms in the medical field. They include a Neuro-Navigation system, a Brachytherapy planning system, a dose algorithm (PencilBeam) for radiotherapy for IMRT, a real time software volume renderer, various image processing modules, and many more. Image and volume data enhancement and post-antialiasing algorithms based on finite-element interpolation have been developed, patented [P01eupatent,P02uspatent], and implemented. Due to tough time-schedules and the nondisclosure policy of the company further publications were unfortunately not possible.

PhD in Particle Physics (1993-1996) The universe as a whole is determined by gravity, which is described by the general relativity theory. Its constituents, the elementary particles, behave according to the rules of quantum field theory or, more specifically, according to the standard model of particle physics. Super string theory could turn out to be the single, elegant Theory of Everything [P10ctoex] physicists dream of. This may also solve the old outstanding (philosophical) problem of the interpretation of quantum theory (collapse of the wave function, Schrödinger's cat,...). My interests in physics are centered around these topics. My active research concentrates on non-perturbative QCD, especially instantons [P96diss,P96thesis], gluon mass [P93gluon], quark propagator [P95prop], eta' mass [P96eta], meson correlators and masses [P97instanto], and the proton spin [P95spin]. My favorite paper explains the exponential fermion mass spectrum between successive generations [P97family]. On a lighter note, The importance of observer localization when developing theories of everything see my paper how to find theories of everything.

Early student work (1983-1992) A classifier system, allowing for comparison of many popular variants, was implemented in my Masters thesis [P92cfs]. It also contains the first proof of the equivalence of ranking and tournament selection. Hebb nets are unsupervised learners. In [P90fopra] the possibility of Hebb nets learning by reinforcement feedback was demonstrated. Computer aided design demands significant computational resources. Only two serious CAD programs for 8 bit computers exist. One of them (written in assembler) being [P87cad]. Some other mentionable programming jobs include the development of a member organization program in DBase (1983, Steuerberater Keller), a user interface for an expert system under GEM (1987, IABG), and a protection module and organization program for licensing programs in C (1993, IABG).