RL OverviewMDPTrajectoryInfinite-horizon discounted returnExpected returnThe RL problemValue functionsBellman equationsAdvantage functionApproaches to (model-free) RLQ-LearningPolicy OptimisationPolicy Gradient AlgorithmsPolicy gradientGrad-Log-Prob of a TrajectoryLog derivative trickPolicy gradient estimate"Overfitting" on the current policyExpected Grad-Log-Prob LemmaBaselinesVanilla policy gradient algorithmTRPOPPO
(This page is based primarily on material from OpenAI's Spinning Up)
- is the set of states
- is the set of actions
- is the reward function, with
- is the transition probability function
- is the starting state distribution
The on-policy value function
The on-policy action-value function
- Aim = to learn a parameterised approximation to the optimal action-value function:
- Action taken by policy is given by:
- Objective function based on bellman equation
- Optimisation typically off-policy
- Aim = to learn a parameterised approximation to the optimal policy:
- Also often involves learning a value function to act as a "critic"
- Done by gradient ascent on an estimate of (or a surrogate objective)
- Optimisation typically on-policy
The gradient of the policy performance - i.e. expected return:
Simply equals the sum of action grad-log-probs:
We simply average over each trajectory the return-to-go-weighted sum of the action grad-log-probs.
- Updating to increase the policy gradient gives us better parameters for the trajectories sampled using .
- But it does not account for the fact that the policy has changed too.
- Thus the new trajectories we sample will come from a different distribution to the one we optimised.
- If we change too much it will only be suitable for the old trajectory distribution and will not generalise to the new one.
The expected value of a grad-log-prob is 0:
The expected grad-log-prob lemma results in:
as doesn't depend on .
This means we can replace in our policy gradient estimate with , or any other expression that leaves the expectation of the policy gradient unchanged.
Some potential choices of are:
- The advantage function:
- Generalised advantage estimation: a weighted sum ofadvantage functions looking a different number of steps into the future. Reduces the variance of the policy gradient estimate.
- Compute advantage estimates for each state
- Estimate policy gradient
- Optimise policy gradient
- Optimise critic using MSE between prediction and rollout values
- Vanilla policy gradient
- Plus kl constraint at each update which bounds the distance the policy can change
- Constraint approximated by second-order taylor approximation
- Trick used to avoid computing whole hessian
- This effectively leads to following the "natural policy gradient"
- Using surrogate objective like PPO, but without clipping
Vanilla policy gradient but using surrogate objective:
Note that the clipping only happens in the cases where we expect to significantly improve the policy: making positive-advantage actions more probable and negative-advantage ones less.
Hence the objective gives no benefit from making the policy too much better than the old policy on the training data.
Approximately solves a KL-constrained update like TRPO, but penalizes the KL-divergence in the objective function instead of making it a hard constraint