PrefixRL

“The optimization of prefix circuits is challenging as their large design space grows exponentially with input length and is intractable to enumerate” → search quickly becomes intractable

This defines the constraints of the computational graph we wish to find and turn into a circuit

Our target for the Q update is the observed reward, plus our estimate for subsequent returns

This is calculated by taking the best possible action (via argmax) and putting it into the Q-function (along with the next state in the observed trajectory)

The Q-function selects the best action and gives the value of it
Naturally we’ll overestimate some actions - this implementation makes us both more likely to select those actions, and overestimate them
By learning two separate Q functions we can decouple this - so one network selects the action (an overestimate), and the other evaluates based on that action (not an overestimate)
Outer eval network updated less frequently to make them sufficiently different

Why not just use the action that was actually taken (rather than the argmax)? Then we’re on-policy (see pros/cons of on vs off-policy)

State space:

Action space:

post-action “legalisation procedure” to add/remove nodes to satisfy constraints, or forbid actions in the first place

Reward:

Why not use a GNN → they’re putting an adjacency matrix into a CNN! The features are just the node’s graph statistics

How large are the models?
How large is the replay buffer? 400k elements. If each is , most of this is taken up by , which is bytes. For N=32 this gives 1024 bytes total per buffer elem. So 400Mb. So we can fit this on-chip, but only one and it’ll be a bit of a squeeze.
Do you need a replay buffer? I feel like we could use on-policy? I used PPO for Snowflake which worked fine. Update: reward calc expensive - so they want to be able to re-use experience
Looks like simulator will be the bottleneck, so I think we may as well stay in the off-policy regime (especially as it improves sample-efficiency, meaning less simulation needed!)