Deep Reinforcement Learning

RL Overview

MDP

5-tuple where:

is the set of states

is the set of actions

is the reward function, with

is the transition probability function

is the starting state distribution

Trajectory

Infinite-horizon discounted return

Expected return

The RL problem

Value functions

The on-policy value function

The on-policy action-value function

Bellman equations

Advantage function

Approaches to (model-free) RL

Q-Learning

Aim = to learn a parameterised approximation to the optimal action-value function:

Action taken by policy is given by:

Objective function based on bellman equation

Optimisation typically off-policy

Policy Optimisation

Aim = to learn a parameterised approximation to the optimal policy:

Also often involves learning a value function to act as a "critic"

Done by gradient ascent on an estimate of (or a surrogate objective)

Optimisation typically on-policy

Policy Gradient Algorithms

Policy gradient

The gradient of the policy performance - i.e. expected return:

Grad-Log-Prob of a Trajectory

Simply equals the sum of action grad-log-probs:

Log derivative trick

Policy gradient estimate

We simply average over each trajectory the return-to-go-weighted sum of the action grad-log-probs.

"Overfitting" on the current policy

Updating to increase the policy gradient gives us better parameters for the trajectories sampled using .

But it does not account for the fact that the policy has changed too.

Thus the new trajectories we sample will come from a different distribution to the one we optimised.

If we change too much it will only be suitable for the old trajectory distribution and will not generalise to the new one.

Expected Grad-Log-Prob Lemma

The expected value of a grad-log-prob is 0:

Baselines

The expected grad-log-prob lemma results in:

as doesn't depend on .

This means we can replace in our policy gradient estimate with , or any other expression that leaves the expectation of the policy gradient unchanged.

Some potential choices of are:

The advantage function:

Generalised advantage estimation: a weighted sum ofadvantage functions looking a different number of steps into the future. Reduces the variance of the policy gradient estimate.

Vanilla policy gradient algorithm

Repeat:

Rollouts

Compute advantage estimates for each state

Estimate policy gradient

Optimise policy gradient

Optimise critic using MSE between prediction and rollout values

TRPO

Vanilla policy gradient

Plus kl constraint at each update which bounds the distance the policy can change

Constraint approximated by second-order taylor approximation

Trick used to avoid computing whole hessian

This effectively leads to following the "natural policy gradient"

Using surrogate objective like PPO, but without clipping

PPO

PPO-clip:

Vanilla policy gradient but using surrogate objective:

Note that the clipping only happens in the cases where we expect to significantly improve the policy: making positive-advantage actions more probable and negative-advantage ones less.

Hence the objective gives no benefit from making the policy too much better than the old policy on the training data.

PPO-penalty:

Approximately solves a KL-constrained update like TRPO, but penalizes the KL-divergence in the objective function instead of making it a hard constraint