DON’T GUESS YOUR NEXT MOVE. PLAN IT!

Deep RL: A Model-Based Approach (Part 4)

Our quest to make Reinforcement Learning 200 times more efficient

Enrico Busto - EN

Published in

The Startup

6 min readNov 30, 2020

In this series of articles, we explained why the sample inefficiency is a critical limit of the Deep Reinforcement Learning and why the model-based approach can help solve it. Then we presented a state-of-the-art algorithm called PlaNet, and we used it to test the hypothesis. Today we present the obtained result.

Experiments

To maintain consistency with the original PlaNet paper, we have computed all the experiments with the DeepMind Control Suite. This Suite provides continuous control tasks built for benchmarking reinforcement learning agents. We have chosen four of them: Cartpole, Cheetah, Walker, and Reacher.

Visual representation of the 4 chosen environments in the same order, from top right: Cartpole, Cheetca, Walker, and Reacher.

Reward prediction model

The reward model is a critical component for the agent because the policy depends on it. For this reason, we create a plot in which we compare the predicted reward and the real one for an entire episode. We can clearly see that the model can correctly approximate the real reward function.

Next, we tested the model’s ability to reconstruct the observation.

The model works so well that it is difficult to see the difference between the two images clearly. Thus, we realized another plot in which we compared the real observations and the predicted one quantitatively.

The y-axis contains the MSE calculated between the real observation and the predicted one in the next plot. We extended the planning horizon to 19 steps to show how the errors pile up as the predictions go on.

MSE between the real observations and the predicted ones

The heatmap below highlights the area where the model computes the most errors (brighter areas indicate higher errors). As we could expect, the heatmap shows show that the model makes more errors in the hind and front cheetah legs areas and the head’s zone.

Heatmap shows where the model computes the most errors.

Finally, we can see the final agent performs after the training. For each agent video, we can see, on the left side, the environment observation at full resolution (the agent receives a resized version). We can see the relative images inside the agent’s mind (its prediction) on the Right.

Left: Environment Observation, Right: Agent’s Mind Predictions (blurred because the agent’s mind works in LowRes)

Model-Based VS Model-Free

We compared our results with the ones obtained with two model-free baselines: Soft Actor-Critic (SAC) and D4PG.

We show the cumulative reward amount collected after 500k steps for the PlaNet and SAC and after 100 million steps for the D4PG.

We can see of PlaNet overcome SAC in all four environments and maintain comparable results to D4PG using 200 times fewer steps.

Red Frame shows algorithm trained over 100M steps.

Next, we compared the same PlaNet results with other model-free baselines trained directly with the full state (NOT FROM PIXELS). Also, for this case, the red frame indicates the algorithms that are trained over 100 million steps (A3C and D4PG).
Observing the difference between the previous result from SAC and D4PG, we can see how much the training from images is harder with respect to the one where the agent has full knowledge of the environment. But also that, despite this difference, PlaNet is still able to present competitive results.

State-of-the-art

We also provide some provide a comparison with respect to the current state of the art. We tested our PlaNet implementation with the original one, with the new version published by the same authors and called Dreamer and another method called Contrastive Unsupervised Representations for Reinforcement Learning (CURL). Even if the CURL is not a Model-Based method, we reported it as a source for the used comparison data.

All the algorithms are trained over 500k steps. We can see how our result overcomes all the other in all the four environments with only one exception where the CURL method has an advantage.

Repeating the experiment after 1 million steps, we see that all the methods converge to a similar result, except for the cheetah environment where Dreamer can keep learning over the convergence of the other methods.

You can find our Open Source implementation here: PlanPix

Bonus

We tried to push further the PlaNet limits. By observing the plot of the obtained rewards during the training, we can see that the agent obtains poor results in the initial episodes. That’s because it has not collected enough experience and its reward model tends to overestimate the reward prediction. Because of the misleading predictions, the planner reaches sub-optimal solutions. This problem disappears when the agent collects more experience.

A comparison for the first 100 training episodes

Considering we want the best possible agent since the first episodes, we introduced a regularization technique. The basic idea is to penalize the agent when it makes unrealistic predictions. In this way, we encourage predictions that are more similar to the collected ones. We trained a Denoising Auto Encoder (DAE) on the collected experiences. Then we use the newly trained model to calculate how a prediction deviates from the real observed ones. Plotting the result obtained using this new component, we can see how it successfully fixes the problem.

Finally, in the next plot, we can see how the agent can use the regularizer to overcome the one that doesn’t use it. This new agent can reach a cumulative reward of 400 just from the first 100 episodes.

The regularizer has provided a promising result and is in line with what appears in the literature, but it needs further experiments to be validated.

Conclusions

In conclusion, our experiment suggests that the model-based approach is crucial for sample efficiency even when we can only describe the environment with camera images.

We showed that with a model-based approach, we could achieve better results with fewer sample/interactions with respect to model-free methods.

Even if we used a public model to do experiments, our implementation produces better results than the original one.

Also, we propose an improvement that increments the model result when only a little experience is available.

**This article was written in collaboration with Luca Sorrentino