DON’T GUESS YOUR NEXT MOVE. PLAN IT!

Deep RL: a Model-Based approach (part 3)

Enrico Busto - EN
Analytics Vidhya
Published in
3 min readNov 30, 2020

--

The Deep Planning Network (PlaNet)

Image from: Jason LeungUnsplash

In the previous article Deep RL: a Model-Based approach (part 2), we saw how Deep Reinforcement Learning (DRL) works and how the model-based approach can improve sample efficiency. In this article, we present a specific algorithm model-based algorithm called PlaNet.

Learning a model allows the robot to plan before acting. Source: BAIR

Learning from observation

In the vast majority of cases, we use a simulator to create the environment used to train an agent with reinforcement learning. At each time step, the simulator collects all the necessary information to produce a new state and sends it to the agent.

In real-world scenarios, all this information is not always available. In a more realistic scenario, a robot doesn’t have them. For example, a robot that moves a teacup doesn’t have the exact coordinates with respect to the table. Often it can only use a camera to capture images. For this reason, we must develop an algorithm that allows the agent to solve the problem and teaches it to build its internal representation. In other words, the agent will autonomously identify, collect, and maintain all the necessary information. This makes the training problem way more complicated.

PlaNet: a Deep Planning Network

In 2019, Danijar et al. released Deep Planning Network (PlaNet), a model-based algorithm capable of directly learning the environment model from image inputs only and using it for planning. Let’s shortly explain how it works.

Notice: To better understand this method, you should know how an Autoencoder and Gated Recurrent Unit (GRU) network work.

The Autoencoder restricts the input in a Latent vector and then reconstruct it. It is trained to minimize the Mean Square Error (MSE) between the original input and the reconstructed one. This way, we ensure that the latent vector will maintain all the principal input information necessary to the reconstruction.

Autoencoder compresses the input into a latent vector, kindly from Dr. Lux’s Laboratory.

The GRU network instead examines the sequence of latent vectors and the list of performed actions. In this way, it extracts the temporal information like the velocity and direction of objects.

The agent encodes all the visual and temporal information in a compact sequence of latent vectors and builds its own abstract representations. Instead of predicting the next images directly, it learns to predict the next latent vector and reconstruct the corresponding observation. PlaNet is also able to learn the reward function.

A general overview of PlaNet architecture. Source: Dr. Lux’s Laboratory.
The reward and the observation model. The last one reconstructs the observation from the state. Source: Dr. Lux’s Laboratory.

Working with latent vectors also makes the planning process faster. The planner is based on a genetic algorithm called Cross-Entropy Method (CEM). For each generation, we use a multivariate Gaussian distribution to generate a population of action sequences.

Then we use the learned Reward Model to calculate, for each action list, the amount of reward that we could achieve by executing them. Once we have evaluated all the action sequences, we pick a subgroup from them (Top Candidate).

A visual representation of the CEM algorithm. Source: Dr. Lux’s Laboratory.

We use this subgroup to update the Gaussian parameters and generate the next population.

In the next article, we will present some experiments made with PlaNet where we analyzed the effective prediction ability, and we compared the obtained result with model-free baselines.

**This article was written in collaboration with Luca Sorrentino

--

--

Enrico Busto - EN
Analytics Vidhya

Founding Partner and CTO @ Addfor S.p.A. We develop Artificial Intelligence Solutions.