The World Models (Ha et al., 2018) paper presented at NIPS in 2018 exploits the idea of having an agent train entirely within its latent representation of the world it is in - its world model - and apply this learned knowledge to the real world. This paper distills many years and areas of research into a highly efficient framework for model based reinforcement learning. Taking a moment to understand this paper is useful because many of the advancements made since this paper released use ideas similar to the ones developed here.
Humans use world models to plan and make decisions quickly and efficiently - something model free algorithms have not been able to achieve to date. Of course, humans do not have a model of the same dimension as the world. To deal with the complexity of the world our brains form and learn abstract representations of “temporal and spatial aspects of information”. As we have previously discussed, prediction plays a vital role - see Andy Clark’s Surfing Uncertainty.
By allowing an agent to dream and imagine about the world it is in, the agent can dramatically improve its sample efficiency and its overall performance.
The key here is to ensure that we know how limited we are based on how much information we have gathered about the environment. When building a model of our environment, we want to make sure that the information we have gathered to build a model is representative of the dynamics of the system. This model can then be trained in an unsupervised manner within the latent representations we form using the data we have gathered from the environment.
Let’s take a step back. What are the pieces we need to build an agent that can learn by imagination?
- Sample random rollouts according to a random policy.
- Extract visual information (pixels to information) from the environment,
- Ensure information is representative of the environment.
- Form a latent representation, and
- form predictions of the future states. Finally,
- select appropriate actions.
How would this work? The architecture proposed in World Models is summarised as follows.
The architecture consists of three components: the vision model (V), the memory model (M) and the controller (C). The vision and memory models form the ‘world model’. This world model can accomplish steps 1-4 above. High dimensional data is encoded into a latent representation by the vision model. Historical data along with current observations allow prediction of future states. These predictions are used to select the best actions based on expected future returns.
A Variational Autoencoder (VAE) is used as the vision model. Essentially, the VAE tries to encode the important information encoded in the images as they change over time. For example, in the car racing environment the most important information is encoded in how the track changes from frame to frame. A latent representation would ideally capture only this information. The smaller the representation, the faster we can train the networks and form predictions. The VAE also learns a decoder. This allows the latent variable to be decoded.
An MDN-RNN is used for the memory model. This uses past events to output a prediction for a distribution of outcomes based on what has been and is currently being observed by the vision model. That is, given one observation the memory model outputs several predictions with an associated probability of that event occurring. This is nicely displayed is the graphic below. Interestingly, a reproduction of this experiment found that training the MDN-RNN for the car racing environment does not improve performance, but removing the MDN-RNN does degrade performance. The investigators interpret this as a by considering that the MDN-RNN recurrent state contains crucial temporal information which is absent from individual frames.
Learning by imagination
Conventionally one would form predictions to select the action which is expected to give the highest return. This action would be selected and the response of the environment would be used to update the model. This is done in the first half of the paper and state-of-the-art (at the time of release) performance is achieved. Naturally one wonders whether we could continue to train our models on the predicted latent variables. This is exactly what the authors proceed to do in the VizDoom environment. The memory model is amended to also predict whether or not the agent dies in the next frame. This is done to ensure the world model produces the same dynamics that the VizDoom environment would return.
Thus by training entirely using the predictions formed by the world model, we can train our agent to be successful within the imagined world.
The aim, of course, is not to simply be successful within the world model, but to be successful in the real world. Since the virtual environment has an identical interface to the real environment, the policy learned by imagination can be easily transferred and applied in the real world. This is a major insight in this paper.
The authors note that the agent can learn to exploit the imperfections in the world model. Adding a temperature \( \tau \) parameter during the sampling of the predicted latent parameter increases the difficulty of surviving in the imagined environment and actually leads to improvements in transferred performance depending on the choice of \( \tau \). The intuition here is that it becomes harder for the agent to exploit these imperfections.
This paper is really quite remarkable in its deceptive simplicity. The individual components of the models presented can and have been improved since its publication, but the idea of training by imagination using latent representation to improve training performance was a crucial insight that is used in algorithms like SimPLE and MuZero.
Check out David Ha’s presentation at NIPS 2018 below.
Finally, their interactive online presentation of their work is outstanding. Go play around!