Intro
- “The optimization of prefix circuits is challenging as their large design space grows exponentially with input length and is intractable to enumerate” → search quickly becomes intractable
- Existing approaches optimise suboptimal heuristics
Background
Prefix graphs
- Note: prefix sum = haskell
scan
- Can be generalised to any binary (2-arg) associative operator
- Easily parallelisable
- Computation can be described as a graph where:
- is the sum of elements to ()
- The upper parent of is and the lower parent is
- Each node has an
- This defines the constraints of the computational graph we wish to find and turn into a circuit
RL
- Recap on Double Q-learning (https://paperswithcode.com/method/double-q-learning)
- Our target for the Q update is the observed reward, plus our estimate for subsequent returns
- This is calculated by taking the best possible action (via argmax) and putting it into the Q-function (along with the next state in the observed trajectory)
- Problem:
- The Q-function selects the best action and gives the value of it
- Naturally we’ll overestimate some actions - this implementation makes us both more likely to select those actions, and overestimate them
- By learning two separate Q functions we can decouple this - so one network selects the action (an overestimate), and the other evaluates based on that action (not an overestimate)
- Outer eval network updated less frequently to make them sufficiently different
- Why not just use the action that was actually taken (rather than the argmax)? Then we’re on-policy (see pros/cons of on vs off-policy)
Prefix + RL
State space:
Action space:
- add or delete any non-input/output node
- post-action “legalisation procedure” to add/remove nodes to satisfy constraints, or forbid actions in the first place
Reward:
- Vector of area and delay reduction
- DDQN adapted (salaraised) to handle this
- We can vary to create pareto-front of designs
Questions
- Why not use a GNN → they’re putting an adjacency matrix into a CNN! The features are just the node’s graph statistics
- Is area & delay exactly what we want to optimise?
- At a glance, looks like they could train for a bit longer?
- How would this run on the IPU?
- How large are the models?
- How large is the replay buffer? 400k elements. If each is , most of this is taken up by , which is bytes. For N=32 this gives 1024 bytes total per buffer elem. So 400Mb. So we can fit this on-chip, but only one and it’ll be a bit of a squeeze.
- Do you need a replay buffer? I feel like we could use on-policy? I used PPO for Snowflake which worked fine. Update: reward calc expensive - so they want to be able to re-use experience
- Looks like simulator will be the bottleneck, so I think we may as well stay in the off-policy regime (especially as it improves sample-efficiency, meaning less simulation needed!)