Deep Reinforcement Learning

RL Overview


5-tuple where:
  1. is the set of states
  1. is the set of actions
  1. is the reward function, with
  1. is the transition probability function
  1. is the starting state distribution


Infinite-horizon discounted return

Expected return

The RL problem

Value functions

The on-policy value function
The on-policy action-value function

Bellman equations

Advantage function

Approaches to (model-free) RL


  1. Aim = to learn a parameterised approximation to the optimal action-value function:
  1. Action taken by policy is given by:
  1. Objective function based on bellman equation
  1. Optimisation typically off-policy

Policy Optimisation

  1. Aim = to learn a parameterised approximation to the optimal policy:
  1. Also often involves learning a value function to act as a "critic"
  1. Done by gradient ascent on an estimate of (or a surrogate objective)
  1. Optimisation typically on-policy

Policy Gradient Algorithms

Policy gradient

The gradient of the policy performance - i.e. expected return:

Grad-Log-Prob of a Trajectory

Simply equals the sum of action grad-log-probs:

Log derivative trick

Policy gradient estimate

We simply average over each trajectory the return-to-go-weighted sum of the action grad-log-probs.

"Overfitting" on the current policy

  1. Updating to increase the policy gradient gives us better parameters for the trajectories sampled using .
  1. But it does not account for the fact that the policy has changed too.
  1. Thus the new trajectories we sample will come from a different distribution to the one we optimised.
  1. If we change too much it will only be suitable for the old trajectory distribution and will not generalise to the new one.

Expected Grad-Log-Prob Lemma

The expected value of a grad-log-prob is 0:


The expected grad-log-prob lemma results in:
as doesn't depend on .
This means we can replace in our policy gradient estimate with , or any other expression that leaves the expectation of the policy gradient unchanged.
Some potential choices of are:
  1. The advantage function:
  1. Generalised advantage estimation: a weighted sum ofadvantage functions looking a different number of steps into the future. Reduces the variance of the policy gradient estimate.

Vanilla policy gradient algorithm

  1. Rollouts
  1. Compute advantage estimates for each state
  1. Estimate policy gradient
  1. Optimise policy gradient
  1. Optimise critic using MSE between prediction and rollout values


  • Vanilla policy gradient
  • Plus kl constraint at each update which bounds the distance the policy can change
  • Constraint approximated by second-order taylor approximation
  • Trick used to avoid computing whole hessian
  • This effectively leads to following the "natural policy gradient"
  • Using surrogate objective like PPO, but without clipping


Vanilla policy gradient but using surrogate objective:
Note that the clipping only happens in the cases where we expect to significantly improve the policy: making positive-advantage actions more probable and negative-advantage ones less.
Hence the objective gives no benefit from making the policy too much better than the old policy on the training data.
Approximately solves a KL-constrained update like TRPO, but penalizes the KL-divergence in the objective function instead of making it a hard constraint