DON’T GUESS YOUR NEXT MOVE. PLAN IT!

Deep RL: a Model-Based approach (part 2)

Enrico Busto - EN

Published in

Analytics Vidhya

5 min readNov 30, 2020

Model-based Deep Reinforcement Learning explained

In the previous article Deep RL: a Model-Based approach (part 1), we saw how Deep Reinforcement Learning (DRL) could be very effective and very inefficient. Now we examine how it works and why a model-based approach can drastically improve the sample efficiency.

Reinforcement learning

In reinforcement learning, an agent acts in an unknown environment to reach an unknown goal.

Time is discretized into time steps, and for each of them, the agent receives information about the environment and takes action. Then it receives a feedback signal called reward. This reward is positive when the action brings the agent closer to his goal, negative in the opposite case. The agent is built to maximize positive rewards. So the learning process is based on the idea that the agent will penalize situations (the actions computed in a particular state) where it received a negative reward. It will tend to make more likely the opposite situations. More formally, we can say:

Reward Hypothesis: All goals can be described by the maximization of expected cumulative reward.

Space Invaders

At each time step, the game calculates its current state and sends it to the agent. This state can be a vector with all the principal information like the number of enemies, their speed, the player position, etc. If we have no direct access to all this information, we can give the agent partial information, like a game frame.
The agent receives this information and uses it to pick an action to take. Every agent has an internal policy function that maps any state to the action that maximizes the expected cumulative reward (the expected sum of all the rewards during the game). In this case, there is a set of possible actions from where the policy can choose: move left, move right, fire. In other cases, the action space is continuous, which means that real number vectors describe the action.
The environment receives the agent’s action and sends it to its internal Model to produce the reward signal and calculate the new state.
Optionally the agent can conserve all the interactions in an Experience Replay Buffer.

Experience Replay Buffer, kindly from Dr. Lux’s Laboratory

The Model

The environment model is composed of a Reward function and a Transition Function.

The Reward Function is the only form of supervision that the agent has and depends on both the state and the action.

The Transition Function can be viewed as “the rule of the game” since it stabilizes how the environment could change due to the agent’s action. Since the Reinforcement Learning is used for a stochastic environment, the same couple of “state” and “action” does not always produce the same “next state.” The transition function produces a distribution that indicates the probability of all the possible new states.

Model-free Vs. Model-based

Model-free algorithms do not have an explicit representation of the environment model. They express the policy as a Deep Neural Network with parameters θ trained to predict each state’s best action.

Model-based algorithms use the deep neural network to approximate both the reward function and the transition distributions. Then they can use the learned model in many ways. The two principal ways are: expressing the policy as a planner or use the model to produce synthetic transitions to enrich the experience replay buffer.

How Model-Based increments the sample efficiency

Imagine you are starting to play chess for the first time. You do not know what a chess piece is, how it moves, the game’s goal, and the main strategies. In your turn, you know what the legal moves are, and for each of them, you receive reward feedback.

Source: Best Chess Strategy Tips For Club Players

With a model-free approach, the amount of information you can take from a single play is meager, and to obtain a good approximation of the moves’ true evaluation, you need to play an immense amount of time.

On the contrary, understanding the rule of the game is a more straightforward task. Furthermore, once you know how the game works, you can simulate all the game’s possible evolution concerning your action. So you can speed up the process of collecting data by simulating games.

In some cases, with the model-based approach, you can even stop the training once you learned the model and build a policy directly from a planner. In other words, at each step, you use the simulations as a planner to choose the best action to take.

Conclusions

In this article, Deep RL: a Model-Based approach (part 3), we saw the theoretical difference between model-free and model-based approaches. In the next article, we will analyze a practical model-based algorithm called PlaNet.

**This article was written in collaboration with Luca Sorrentino