®️

# Deep Reinforcement Learning

### RL Overview

#### MDP

5-tuple where:
1. is the set of states
1. is the set of actions
1. is the reward function, with
1. is the transition probability function
1. is the starting state distribution

#### Value functions

The on-policy value function
The on-policy action-value function

### Approaches to (model-free) RL

#### Q-Learning

1. Aim = to learn a parameterised approximation to the optimal action-value function:
1. Action taken by policy is given by:
1. Objective function based on bellman equation
1. Optimisation typically off-policy

#### Policy Optimisation

1. Aim = to learn a parameterised approximation to the optimal policy:
1. Also often involves learning a value function to act as a "critic"
1. Done by gradient ascent on an estimate of (or a surrogate objective)
1. Optimisation typically on-policy

The gradient of the policy performance - i.e. expected return:

Simply equals the sum of action grad-log-probs:

#### Log derivative trick

We simply average over each trajectory the return-to-go-weighted sum of the action grad-log-probs.

#### "Overfitting" on the current policy

1. Updating to increase the policy gradient gives us better parameters for the trajectories sampled using .
1. But it does not account for the fact that the policy has changed too.
1. Thus the new trajectories we sample will come from a different distribution to the one we optimised.
1. If we change too much it will only be suitable for the old trajectory distribution and will not generalise to the new one.

The expected value of a grad-log-prob is 0:

#### Baselines

The expected grad-log-prob lemma results in:
as doesn't depend on .
This means we can replace in our policy gradient estimate with , or any other expression that leaves the expectation of the policy gradient unchanged.
Some potential choices of are:
1. Generalised advantage estimation: a weighted sum ofadvantage functions looking a different number of steps into the future. Reduces the variance of the policy gradient estimate.

Repeat:
1. Rollouts
1. Compute advantage estimates for each state
1. Optimise critic using MSE between prediction and rollout values

#### TRPO

• Plus kl constraint at each update which bounds the distance the policy can change
• Constraint approximated by second-order taylor approximation
• Trick used to avoid computing whole hessian