Contents
RL OverviewMDPTrajectoryInfinite-horizon discounted returnExpected returnThe RL problemValue functionsBellman equationsAdvantage functionApproaches to (model-free) RLQ-LearningPolicy OptimisationPolicy Gradient AlgorithmsPolicy gradientGrad-Log-Prob of a TrajectoryLog derivative trickPolicy gradient estimate"Overfitting" on the current policyExpected Grad-Log-Prob LemmaBaselinesVanilla policy gradient algorithmTRPOPPO
RL Overview
MDP
5-tuple where:
- is the set of states
- is the set of actions
- is the reward function, with
- is the transition probability function
- is the starting state distribution
Trajectory
Infinite-horizon discounted return
Expected return
The RL problem
Value functions
The on-policy value function
The on-policy action-value function
Bellman equations
Advantage function
Approaches to (model-free) RL
Q-Learning
- Aim = to learn a parameterised approximation to the optimal action-value function:
- Action taken by policy is given by:
- Objective function based on bellman equation
- Optimisation typically off-policy
Policy Optimisation
- Aim = to learn a parameterised approximation to the optimal policy:
- Also often involves learning a value function to act as a "critic"
- Done by gradient ascent on an estimate of (or a surrogate objective)
- Optimisation typically on-policy
Policy Gradient Algorithms
Policy gradient
The gradient of the policy performance - i.e. expected return:
Grad-Log-Prob of a Trajectory
Simply equals the sum of action grad-log-probs:
Log derivative trick
Policy gradient estimate
We simply average over each trajectory the return-to-go-weighted sum of the action grad-log-probs.
"Overfitting" on the current policy
- Updating to increase the policy gradient gives us better parameters for the trajectories sampled using .
- But it does not account for the fact that the policy has changed too.
- Thus the new trajectories we sample will come from a different distribution to the one we optimised.
- If we change too much it will only be suitable for the old trajectory distribution and will not generalise to the new one.
Expected Grad-Log-Prob Lemma
The expected value of a grad-log-prob is 0:
Baselines
The expected grad-log-prob lemma results in:
as doesn't depend on .
This means we can replace in our policy gradient estimate with , or any other expression that leaves the expectation of the policy gradient unchanged.
Some potential choices of are:
- The advantage function:
- Generalised advantage estimation: a weighted sum ofadvantage functions looking a different number of steps into the future. Reduces the variance of the policy gradient estimate.
Vanilla policy gradient algorithm
Repeat:
- Rollouts
- Compute advantage estimates for each state
- Estimate policy gradient
- Optimise policy gradient
- Optimise critic using MSE between prediction and rollout values
TRPO
- Vanilla policy gradient
- Plus kl constraint at each update which bounds the distance the policy can change
- Constraint approximated by second-order taylor approximation
- Trick used to avoid computing whole hessian
- This effectively leads to following the "natural policy gradient"
- Using surrogate objective like PPO, but without clipping
PPO
PPO-clip:
Vanilla policy gradient but using surrogate objective:
Note that the clipping only happens in the cases where we expect to significantly improve the policy: making positive-advantage actions more probable and negative-advantage ones less.
Hence the objective gives no benefit from making the policy too much better than the old policy on the training data.
PPO-penalty:
Approximately solves a KL-constrained update like TRPO, but penalizes the KL-divergence in the objective function instead of making it a hard constraint