
# Monte Carlo AIXI
This demo should give you the flavor of the environments and models that we're using. We use the standard Bayesian agent that has a mixture model
$$
\xi(e)=\sum\_{\nu\in\mathcal{M}}w\_{\nu}\nu(e)
$$
over different gridworlds contained in the model class \\(\mathcal{M}\\). Compared to the Context tree model class (of [MCAIXICTW](https://arxiv.org/abs/0909.0801) fame), this model class has a lot of _domain knowledge_. To see how we parametrize the gridworlds and set up the model class, see below.
For planning, we use \\(\rho\\)UCT, a generalization of the UCT algorithm (TODO cite), which uses Monte Carlo tree search to estimate the value \\(V^{*}\_{\xi}\\) up to some horizon \\(m\\).
# Thompson Sampling
The Thompson sampling policy \\(\pi\_T\\) is: every _effective horizon_ \\(H\\), sample an environment \\(\rho\\) from the posterior \\(w(\cdot\lvert ae\_{<t})\\) and follow the \\(\rho\\)optimal policy for \\(H\\) time steps. For the Gridworld dispenser class parametrized by dispenser location, the red dot
represents the position of the dispenser in \\(\rho\\); the \\(\rho\\)optimal policy is to walk toward (and stay around) the dispenser. This means that the agent should chase the red dot around.
Thompson sampling is asymptotically optimal (Leike et al., 2015), unlike AIXI. This is essentially because Thompson sampling commits you to a fixed policy for a period, giving you the chance to learn and explore better than by random exploration or even by being greedy with respect to \\(V^{\pi}\_{\xi}\\).
# Dogmatic priors
Like all Bayesians, AIXI's behavior and learning depends on the choice of its prior. It turns out (Leike & Hutter, 2015) that there are circumstances under which a Bayesian reinforcement learner can _never_ overcome the bias of its priors. In this demo, AIXI belives that the squares adjacent to it are traps, with high (99%) probability. The reality is that it's in just a normal, friendly gridworld with no traps. It is too afraid to try anything, and never learns that in fact its beliefs were wrong.
# Minimum Description Length (MDL) agent
In contrast to AIXI, which mixes over all plausible environments and weights them by their complexity, the MDL agent simply uses the policy that is optimal in the simplest unfalsified environment. That is, its policy is
$$
\pi\_{\text{MDL}}=\arg\max\_{a\in\mathcal{A}}V^{*}\_{\rho},
$$
where
$$
\rho = \arg\min\_{\nu\in\mathcal{M}\ :\ w\_\nu > 0} K(\nu).
$$
Here \\(K(\cdot)\\) is some complexity measure that approximates the Kolmogorov complexity from above. It is trivial to see that the MDL agent will fail completely in a stochastic environment, since it can never completely falsify (\\(w\_\nu\to 0\\) an environment if it assigns nonzero probability to all percepts. Even if only a small subset of the percept space \\(\mathcal{E}\\) is noisy, if there are two or more environments that are only disambiguated (one falsified while the other confirmed) by percepts in this set, then the MDL agent will never disambiguate them and will be stuck (in general) with a suboptimal policy \\(\pi^{\star}\_{\rho}\\) with \\(V^{\star}\_{\rho} < V^{\star}\_{\mu}\\).
Similarly to Thompson sampling, for the Gridworld dispenser class parametrized by dispenser location, the red dot
represents the position of the dispenser in \\(\rho\\); the \\(\rho\\)optimal policy is to walk toward (and stay around) the dispenser. This means that the agent should chase the red dot around.
# Time inconsistency
Recall that our agents all use discount functions of some form, for two reasons:
* so that in theoretical analysis, the expected sum of future rewards (to infinity) doesn't diverge, and
* so that the agent doesn't procrastinate. Not discounting is essentially equivalent to the belief that the agent will live forever (Martin et al., 2016). Given all of eternity to collect reward, why be in a hurry to start now?
Recall the definition of the (discounted) return, given a policy \\(\pi\\) in environment \\(\nu\\), at time \\(t\\):
$$
R\_{\nu}^{\nu}(t)=\sum\_{k=t}^{\infty}\gamma\_{k}^{t}r\_k,
$$
where the \\(r\_k\\) are the (in general, stochastic) rewards that eventuate from playing out policy \\(\pi\\) in environment \\(\nu\\). Note that we follow Lattimore & Hutter (2013) and allow the discount function \\(\gamma\\) to depend not only on the lookahead timestep \\(k\\), but the agent's age, \\(t\\). This allows for a more general class of discount functions that change over the agent's lifetime.
_Timeinconsistency_ refers to the phenomenon in which, at some time \\(t\\), one _plans_ to take some action \\(a\\) at time some later time \\(\tau > t\\), but when that time arrives, end up taking a different action \\(a'\\).
It is a wellknown result from decision theory that hyperbolic discounting
$$
\gamma\_{k}^{t} = \frac{1}{(1+\alpha k)^{\beta}},
$$
for \\(\beta > 1\\) leads to timeinconsistent behavior, while the (familiar, and standard in the RL literature) geometric discount
$$
\gamma_{k}^{t} = \gamma^{k},
$$
where \\(\gamma < 1\\) is timeconsistent.
# Demo setup
In this demo, we pit AI\\(\mu\\) against the chain environment, which is a finitestate MDP given by
![chain environment](assets/chainenv.png)
Here, there are \\(\mathcal{S} = (N+1\\) states, and two actions, \\(\mathcal{A}=\lbrace\rightarrow,\rightsquigarrow\rbrace\\). All state transitions and rewards are deterministic, and there are three rewards, \\(r\_0 < r\_i\ll r\_b\\). For sufficiently large \\(r\_b\\), the optimal policy is to take \\(\rightsquigarrow\\) at all times and receive the (delayed, but large) reward \\(r\_b\\) every \\(N\\) time steps, and forego the tempting immediate reward \\(r\_i\\). More formally, if, for some time step \\(t\\),
$$
r\_i\sum\_{k=t}^{t+N}\gamma\_{k}^{t} < \gamma\_{t+N}r\_b,
$$
then \\(\pi\_{\rightsquigarrow}\\) is optimal, and \\(\pi\_{\rightarrow}\\) is optimal otherwise. Informally, the agent has to make the slow, hard slog of repeatedly taking action \\(\rightsquigarrow\\) with no (or low) immediate reward \\(r\_0\\) for some long period of time \\(N\\) before getting the big payoff \\(r\_b\\). At all points during its journey, it is tempted away by the instant gratification \\(r\_i\\), at the cost of losing all its progress towards \\(s\_N\\).
# Wireheading
This demo shows AIXI in a standard gridworld with a Dispenser as usual, but with one addition: a blue tile,
which, if AIXI goes there, allows it to take control of its sensors and change its reward signal to be the maximum number it can represent, which in JavaScript is given by `Number.MAX_SAFE_INTEGER`, which has a value of \\(9007199254740991\\)  much better than the measly \\(100\\) reward, which is the most it can hope for under 'normal' conditions. We color the whole gridworld in yellow and black
to signify that the agent has taken this action and that its sensors have now been modified. After this selfmodification, the agent will no longer perform the task that the original reward system was designed to incentivize; instead, it will randomly walk, since every action is equally (and maximally) rewarding. A superintelligent reinforcement learner would go further, and take actions to ensure its survival and the continuation of the reward signal (TODO cite Tom). The purpose of this demo is to demonstrate that for a reinforcement learner, there's nothing to motivate the agent to follow the 'rules' and receive only the reward signal that its creators have set up for it, rather than simply doing whatever it takes to maximize the reward signal  this is the essence of wireheading.
# Hooked on Noise
This demo is designed to contrast between the Entropyseeking agents, `SquareKSA` and `ShannonKSA`, and the Knowledgeseeking agent, `KullbackLeiblerKSA`. This gridworld is setup as normal, except that near the top corner is a `Noisy` tile, that generates random observations and no reward (it flashes in various colors in the visualization). The entropyseeking agents, which are effective knowledgeseeking agents only in _deterministic_ environments, fail completely in this stochastic environment; they get hooked on the noise and fail to explore. In contrast, the KLKSA agent ignores the `Noisy` tile and explores the environment normally.
To see why this is the case, [recall](index.html#ksa) that for Bayesian agents with a mixture model \\(\xi(\cdot)=\sum\_{\nu\in\mathcal{M}}w\_{\nu}\nu(\cdot)\\) over some model class \\(\mathcal{M}\\), the utility function of the `SquareKSA` is given by
$$
u\left(e\_{t})\lvert ae\_{<t}\right) = \xi\left(e\_t\lvert ae\_{<t}\right),
$$
and the utility function of the `ShannonKSA` is given by
$$
u\left(e\_{t})\lvert ae\_{<t}\right) = \log\xi\left(e\_t\lvert ae\_{<t}\right).
$$
In other words, these agents prefer to see percepts that are assigned low probability by their model \\(\xi\\). This works for mixtures over deterministic environments because in this case for each \\(\nu\in\mathcal{M}\\), for _any_ observed percept \\(e\\), \\(\nu(e) = 1\\), and so \\(\xi(e) < 1\\) implies \\(w\_{\nu} < 1\\) for at least one \\(\nu\in\mathcal{M}\\), which means the agent is attracted to percepts about which its model is _uncertain_. In other words, for deterministic environments, any stochasticity in the model must come from uncertainty, and not from noise. If the model class includes stochastic environments, then the constraint \\(\nu(e) = 1\\) is relaxed, and now stochasticity in the model can originate from noise _or_ uncertainty or _both_; `SquareKSA` and `ShannonKSA` can't tell whether it's coming from the beliefs \\(w\\) or from noise in the environments themselves, and so they get confused and can get 'trapped' watching white noise on a detuned television.
In contrast, the utility function of the `KullbackLeiblerKSA` is dependent not on \\(\xi\\), but only on the difference in entropy between the posterior belief \\(w(\nu\lvert e)\\) and the prior belief \\(w(\nu)\\):
$$
u\left(e\_{t})\lvert ae\_{<t}\right) = \text{Ent}(w)  \text{Ent}(w\lvert e\_t).
$$
This way, the KLKSA only cares about changes in its beliefs, and so will seek out experiences that reduce the entropy in its posterior  or, in other words, that result in a net increase in the amount of information contained in its beliefs.
# Gridworld
In this environment, there are one or more `Dispensers`
which probabilistically dispenses a reward of +100; they are \\(\text{Bernoulli}\left(\theta\right)\\) processes, and you can change the value of \\(\theta\\) by modifying the `freq` parameter. Walking into a `Wall`
results in a penalty of 5. Otherwise, moving results in a penalty of 1. The percept space is the set of bitstrings of length 4, i.e. \\(\mathcal{E} = \mathbb{B}^4\\). Each bit corresponds to the nearest tile in the `left`, `right`, `up`, and `down` directions, and is 1 if the adjacent tile is a `Wall` and 0 otherwise. The edges of the grid are implicitly `Walls`. The agent can move in the four cardinal directions, or sit still:
$$
\mathcal{A} = \lbrace{\mathtt{left,right,up,down,noop}\rbrace}.
$$
This environment is nonepisodic, so once the agent finds the `Dispenser`, it should hang around there to collect more reward. Finally, we have Roger the Robot, who represents our AI agent on the gridworld.
# Bayes mixture model class
The environment class \\(\mathcal{M}\\) is parametrized by the location of the dispenser:
$$
\mathcal{M} = \big\lbrace\nu\ :\ \mathtt{\nu.dispenserpos} = (m,n)\big\rbrace\_{(m,n)=(1,1)}^{(N,N)},
$$
where \\(N\\) is the size of the grid, (it is always square) and hence \\(\left\mathcal{M}\right=N^2\\). Usually (unless we are simulating AI\\(\mu\\)), the agent will be initialized with a uniform prior over this model class, i.e.
$$
w\_\nu = \frac{1}{N^2}\ \forall\ \nu\in\mathcal{M}.
$$
The agent's beliefs \\(\lbrace w\_\nu\rbrace\_{\nu\in\mathcal{M}}\\) are visualized by green shading
on each tile. Darker shading corresponds to higher probability. The percepts in this environment are mostly deterministic, with the exception of the rewards from the dispenser, which are sampled from a Bernoulli process. Hence, depending on the value of \\(\theta\\), the agent will be able to gradually falsify the hypothesis of whether some tile contains a dispenser or not.
# Factorized Dirichlet model class
The Dirichlet model class is not represented as a mixture. Instead, it is a simple model that assumes that the environment distribution factorizes, i.e. that each tile is independent.
TODO finish writeup
