### Contents

RL OverviewMDPTrajectoryInfinite-horizon discounted returnExpected returnThe RL problemValue functionsBellman equationsAdvantage functionApproaches to (model-free) RLQ-LearningPolicy OptimisationPolicy Gradient AlgorithmsPolicy gradientGrad-Log-Prob of a TrajectoryLog derivative trickPolicy gradient estimate"Overfitting" on the current policyExpected Grad-Log-Prob LemmaBaselinesVanilla policy gradient algorithmTRPOPPO

### RL Overview

#### MDP

5-tuple where:

- is the set of states

- is the set of actions

- is the reward function, with

- is the transition probability function

- is the starting state distribution

#### Trajectory

#### Infinite-horizon discounted return

#### Expected return

#### The RL problem

#### Value functions

The on-policy value function

The on-policy action-value function

#### Bellman equations

#### Advantage function

### Approaches to (model-free) RL

#### Q-Learning

- Aim = to learn a parameterised approximation to the optimal action-value function:

- Action taken by policy is given by:

- Objective function based on bellman equation

- Optimisation typically off-policy

#### Policy Optimisation

- Aim = to learn a parameterised approximation to the optimal policy:

- Also often involves learning a value function to act as a "critic"

- Done by gradient ascent on an estimate of (or a surrogate objective)

- Optimisation typically on-policy

### Policy Gradient Algorithms

#### Policy gradient

The gradient of the policy performance - i.e. expected return:

#### Grad-Log-Prob of a Trajectory

Simply equals the sum of action grad-log-probs:

#### Log derivative trick

#### Policy gradient estimate

We simply average over each trajectory the return-to-go-weighted sum of the action grad-log-probs.

#### "Overfitting" on the current policy

- Updating to increase the policy gradient gives us better parameters for the trajectories sampled using .

- But it
*does not account for the fact that the policy has changed too*.

- Thus the new trajectories we sample will come from a different distribution to the one we optimised.

- If we change too much it will only be suitable for the old trajectory distribution and will not generalise to the new one.

#### Expected Grad-Log-Prob Lemma

The expected value of a grad-log-prob is 0:

#### Baselines

The expected grad-log-prob lemma results in:

as doesn't depend on .

This means we can replace in our policy gradient estimate with , or any other expression that leaves the expectation of the policy gradient unchanged.

Some potential choices of are:

**The advantage function:**

**Generalised advantage estimation:**a weighted sum ofadvantage functions looking a different number of steps into the future. Reduces the variance of the policy gradient estimate.

#### Vanilla policy gradient algorithm

Repeat:

- Rollouts

- Compute advantage estimates for each state

- Estimate policy gradient

- Optimise policy gradient

- Optimise critic using MSE between prediction and rollout values

#### TRPO

- Vanilla policy gradient

- Plus kl constraint at each update which bounds the distance the policy can change

- Constraint approximated by second-order taylor approximation

- Trick used to avoid computing whole hessian

- This effectively leads to following the "natural policy gradient"

- Using surrogate objective like PPO, but without clipping

#### PPO

PPO-clip:

Vanilla policy gradient but using surrogate objective:

Note that the clipping only happens in the cases where we expect to

*significantly improve*the policy: making positive-advantage actions more probable and negative-advantage ones less.Hence the objective gives no benefit from making the policy too much better than the old policy on the training data.

PPO-penalty:

Approximately solves a KL-constrained update like TRPO, but penalizes the KL-divergence in the objective function instead of making it a hard constraint