Visualizing Muzero Models

Joery de Vries & Ken Voskuil


This blog post briefly discusses the main idea behind our recent project [de vries et al., 2021] where we created visualizations of an interesting state-of-the-art algorithm by Google DeepMind, MuZero. To give some background and intuition on MuZero, we illustrate the differences here with AlphaZero (which has recently seen quite some attention in public media) with animations and figures.

The AlphaZero algorithm presented a breakthrough in deep reinforcement learning, being the first algorithm to master a variety of boardgames without human supervision [Silver et al., 2018]. Though AlphaZero can seem quite the complicated algorithm, the core of the algorithm is the incorporation of learning and planning.

To decide upon actions to take given a current environment state, AlphaZero evaluates the impact of its available actions by simulating specific actions within an internal model. The actions which it simulates are governed by a planning algorithm, in this case a variant of Monte Carlo Tree Search (MCTS). AlphaZero improves upon standard MCTS by learning a value function of environment states and learning a probability distribution over the actions that it can simulate. This learned value function is utilized to predict the value of simulated states, whereas this learned probability distribution is used to focus planning/ simulation only on actions that are learned to be beneficial.

Summarized into one sentence, AlphaZero plans over actions within a simulation where the planning algorithm gradually becomes more efficient as a result of learning action preferences and an accurate environment-outcome prediction.


Intro

Naturally, a practical limitation of AlphaZero is that a simulation must be available for planning. MuZero is a successor algorithm that tackles this issue and jointly learns an AlphaZero agent along with an environment model [Schrittwieser et al., 2020]. This model takes the form of a Recurrent Neural Network (RNN) that is trained on sequences of rewards and the AlphaZero targets (i.e., the value and policy computed by the AlphaZero-MCTS procedure). The animation below neatly illustrates the difference between the two types of action-simulation, AlphaZero performs actions within an actual game of Tic-Tac-Toe whereas MuZero performs actions within a neural network.



Visualizing a (learned) MuZero model

Although MuZero effectively addresses a key practical limitation of AlphaZero, the additional neural networks also make the decision making process much more esoteric. To make more sense of what MuZero is 'simulating', we created illustrations of its neural network activation values/ outputs following certain observations/ inputs. Note that MuZero defines two additional networks compared to AlphaZero for action simulation. The first network is a function over the observations of the environment, called the encoder. The second network is called the dynamics function, and can loosely be interpeted as a gradient-field over the encoder's domain conditioned on the environment actions.

The encoder defines a mapping from the environment's observations to the hidden values of the RNN, this allows the dynamics network to utilize these values for simulation through unrolling. An interesting observation during our experiments was that the encoder seems to warp observations such that input coordinates are separated based on dynamic distance within the RNN-space. What this means, is that if a coordinate x is adjacent to y as observations from the environment, the encoder can warp these coordinates far from each other if the difference in their value is considerably large.

The illustration below shows this concept on the OpenAI Gym's MountainCar. In this environment, a cart has to build up momentum to move from the valley to the top of the right-hill. Most importantly for us, all possible MountainCar observations can be defined over a 2D-grid. This is shown in the left figure, the x-axis plots the cart's position and the y-axis plots the cart's velocity, the small red circle highlights the input coordinate rendered in the right figure. The color contour is a learned function that roughly translates to the number of steps that MuZero needs to reach the top of the right-hill here, the required number of steps is large for the coordinates colored in blue which gradually becomes smaller moving towards the yellow contour.



What is important to note here, is that the blue and green-yellow contour are strongly separated. The trajectory shown in the contour-figure depicts a MuZero agent (successfully) attempting to climb the hill. As we can see, it follows along a sharp edge in the colour contour. The coordinates along this edge are adjacent in terms of their observations, but they are dynamically distant. If the agent were to select a bad action, the cart could lose its momentum and reset its progress, jumping from the green contour back to the blue contour.

In this example, we mapped the learned dynamic distance back to the 2D MountainCar grid. We visualized the encoder's geometry by computing all possible observation embeddings and performing a linear dimensionality reduction technique. This approach is also occasionally seen in conjunction with Kernel Methods [Szymanski et al., 2011]. In the animation below we portray the embeded geometry of the MountainCar's observation space for the first three Principal Components (PCs), this includes the colour contour of the previous figure and accounts for approximately 90% of the variance within the actual embedding. PC-projections are linear which allows us to preserve both global and local semantics within the visualization. This in turn allows us to more simply infer spatial relations e.g., as argued earlier we can see that coordinates with high and low values are warped strongly based on their value within euclidean space.

Simulating actions inside a RNN

The dynamics network of MuZero is comparatively even more difficult to interpret than the encoder. The above animation shows that the encoder warps the observation space into a dynamically separated space (at least in this specific scenario). However, this visualization is unfaithful to the actual space in which MuZero functions, which is of course the full dimensionality of the RNN. Canonically, MuZero is trained without any constraints on the state-transitions. As a result, when MuZero simulates actions within the RNN, it does not need to adhere to the geometry stipulated by the encoder. This is visualized in the animation below: in this example we simulated a sequence of actions recursively both within the RNN and the actual environment and projected the encoded observations (green) and simulated states (white) to the Principal Components of the embedding.



In the above example, the trajectory mapped by the dynamics function follows entirely different semantics than the same exact trajectory mapped from the environment to the embedding. The resulting trajectory of environment observations is of course embedded within the geometry mapped by the encoder, whereas the dynamics function trajectory departs from the embedding and follows its own path. Despite this, the value of the simulated trajectory is similar to the embedded trajectory as the value function is interpolated throughout the whole RNN-space, though we cannot see this here.

Because the dynamics function follows an entirely separate path to the encoder, this makes it highly obscure when spectating the agent during simulation. For this reason, MuZero is even more esoteric than AlphaZero its contrived semantics seems comprehensible only to the algorithm. In our paper we investigated the effects of self-consistency penalties, such as a contrastive distance-based loss function between the encoder and dynamics functions. Ultimately, we found that this can force the dynamics function to stay much closer to the encoder, which in turn makes the action simulation slightly more comprehensible.


Conclusion

All in all, MuZero is a very interesting algorithm that enables action simulation for AlphaZero within an end-to-end learnable RNN space. However, as these algorithms become increasingly more complex and add more layers of abstractions, visualizations like ours may aid with comprehending the functions defined by learned agents. For more visualizations of MuZero, definitely check out our visualization tool here (data files for visualization can be found in our GitHub), and for more details regarding our project along with visualizations for the contrastively trained agents see our paper [de vries et al., 2021].


Other Useful References

  • Tutorials on AlphaZero: