# RL algorithms – the key ideas summarized

We assume you know the basic ideas of reinforcement learning: agents, actions, rewards, environment, episodes, total episode reward, discounted reward, Q-values, state values.

Over the history of reinforcement learning, tens of algorithms were developed and it can be difficult to keep track of all the progress. As a refresher, here I list the main classical algorithms as well as the main deep RL algorithms. This can serve as a reference to quickly compare algorithms and remember the key idea behind them. Here are the algorithms that will be covered:

Classical algorithms:

• Q Learning
• SARSA
• Actor Critic

Deep reinforcement learning algorithms:

• Deep Q Network (DQN)
• Double DQN (DDQN)
• Dueling DQN
• Double Deep Policy Gradient (DDPG)
• Trust Region Policy Optimization (TRPO)
• Proximal Policy Optimization (PPO)
• Asynchronous Advantage Actor Critic (A3C)
• Soft Actor Critic (SAC)
• Rainbow

Note: if you want the pseudo-code and a more detailed explanation for the algorithm, you can check Open AI’s Spinning Up created by Josh Achiam.

Q Learning (Watkins, 1989).

In state s, the agent could pick many actions. The goal is to pick the best one, so the agent picks the action a’ with the highest Q value:

a’ = \underset{a}{\mathrm{argmax}}\; Q(s, a)

The quality of this policy depends on the quality of the Q-values. Each time the agent does an action, the corresponding Q-value is updated. More precisely, if the agent picks action a in state s, receiving reward r and moving to state s’, then the following update rule is used:

Q^{new}(s, a) = Q^{old}(s, a) + \alpha\cdot TD\,error
Q^{new}(s, a) = Q^{old}(s, a) + \alpha\cdot(r + \gamma\;\underset{a’}{\mathrm{max}}\; Q(s’, a’) – Q^{old}(s, a))

where \alpha is the learning rate and the Temporal Difference error measures the error in the old Q-value. This expresses that the difference between Q^{old}(s, a) and \underset{a'}{\mathrm{max}}\; Q(s', a') should be \mathbb{E}(r), because if that’s true, then on average the TD error is 0

SARSA (Rummery & Niranjan, 1994).

SARSA is almost identical to Q-learning, except for 1 change in the update rule. In Q-learning, the considered action a’ in the next state s’ is always the one with the highest Q-value. In SARSA, we pick the one that the policy would actually use in s’.

Q^{new}(s, a_t) = Q^{old}(s, a_t) + \alpha\cdot(r + \gamma\; Q(s’, a_{t+1}) – Q^{old}(s, a_t))

One good question would be: what’s the difference? Isn’t the action we’ll pick in the next state the one with the highest Q-value? Not necessarily. This was the policy we used in Q-learning, but if we add exploration (picking random action once in a while) or decide to pick the actions another way, then SARSA would adapt to that new policy, while Q-learning wouldn’t.

So, in other words, SARSA uses the action that the policy would actually pick while Q-learning could potentially use in the update rule an action that wouldn’t be the one picked by the policy. For this reason, we say SARSA is on-policy while Q-learning is off-policy.

Note: Q-Learning, Sarsa and other methods that rely on picking the action with the best Q-value have one weakness: if the number of actions is too large or infinite, it can’t be done in practice. This makes them unusable for environments where the actions are continuous.

Policy Gradient / REINFORCE (Williams, 1992).

Actor Critic (Konda & Tsitsiklis, 2000).

DQN – Deep Q-Learning (Mnih et al, 2013, 2015).

DDQN – Double DQN (van Hasselt et al, 2015).

Dueling DQN (Wang et al, 2015).

DDPG – Deep Deterministic Policy Gradient (Lillicrap et al, 2015).

A3C – Asynchronous Advantage Actor-Critic (Mnih et al, 2016).

Instead of experience replay, we asynchronously execute multiple agents in parallel, on multiple instances of the environment. This parallelism also decorrelates the agents’ data into a more stationary process, since at any given time-step the parallel agents will be experiencing a variety of different states. This simple idea enables a much larger spectrum of fundamental on-policy RL algorithms, such as Sarsa, n-step methods, and actor-critic methods, as well as off-policy RL algorithms such as Q-learning, to be applied robustly and effectively using deep neural networks.

Multiple actors-learners running in parallel are likely to be exploring different parts of the environment. Moreover, one can explicitly use different exploration policies in each actor-learner to maximize this diversity. By running different exploration policies in different threads, the overall changes being made to the parameters by multiple actor-learners applying online updates in parallel are likely to be less correlated in time than a single agent applying online updates. Hence, we do not use a replay memory and rely on parallel actors employing different exploration policies to perform the stabilizing role undertaken by experience replay in the DQN training algorithm

The model parameters for the main and target network are shared by all threads/agents! There isn’t a different model for each agent, all agents share the same model parameters. The idea is to add stabilization without using experience replay, not to train K different models (which wouldn’t improve performance, it would just train more models).

• Can train in half the time than previous approaches, with far fewer resources (a CPU instead of a GPU)
• Can combine on-policy algorithms with deep neural networks without becoming unstable (previous results with experience replay allowed only off-policy algorithms)
• Surpasses previous state-of-the-art performance
• General, works on many different problems. Articles show that it can navigate 3D mazes using images, solve some Atari Games, play an F1 game and solve continuous motor control problems (robots)

We believe that the success of A3C on both 2D and 3D games, discrete and continuous action spaces, as well as its ability to train feedforward and recurrent agents makes it the most general and successful reinforcement learning agent to date.

TRPO – Trust Region Policy Optimization (Schulman et al, 2015) & PPO – Proximal Policy Optimization (Schulman et al, 2017).

In these two algorithms, we have a neural network that estimates the value of a state, e.g. the discounted reward we expect to receive at that state. Since it’s a neural network, the output is noisy.

We use the policy many times until we have many trajectories, along with the list of actions, visited states and rewards at each step. We can compute the discounted reward for each action taken (at each state). To see if we should make that action more likely or less likely, we can check if the reward is greater or smaller than the value predicted by the neural network for that state. The difference between the real reward and the predicted reward is called the advantage.

If the advantage is positive (better than expected) we make taking that action in that state more likely. If the advantage is negative (worse than expected) we make that action at that state less likely.

However, since the output of the neural network is noisy, the advantage may have high variance and be unreliable, especially at the beginning. If we blindly trusted it, the policy could change a lot in a single policy gradient update and become very bad. We should be more cautious and limit how much the policy changes at each update step.

TRPO addresses this using KL-divergence, which measures how similar two distributions are. In their approach, the KL-divergence between the old policy distribution and the new policy distribution must be below a certain threshold.

PPO is a much simpler approach, which ensures the change is not too big by clipping the log(probability) term in the gradient loss expression.

Soft Actor Critic (Haarnoja et al, 2018).

Rainbow (Hessel et al, 2017).

Tags:, ,