🤖

PrefixRL

Title
PrefixRL: Optimization of Parallel Prefix Circuits using Deep Reinforcement Learning
Authors
Rajarshi Roy, Jonathan Raiman, Neel Kant, Ilyas Elkin, Robert Kirby, Michael Siu, Stuart Oberman, Saad Godil, Bryan Catanzaro
Date
2022
Venue
DBLP
Keywords

Intro

  • “The optimization of prefix circuits is challenging as their large design space grows exponentially with input length and is intractable to enumerate” → search quickly becomes intractable
  • Existing approaches optimise suboptimal heuristics

Background

Prefix graphs

  • Note: prefix sum = haskell scan
  • Can be generalised to any binary (2-arg) associative operator
  • Easily parallelisable
  • Computation can be described as a graph where:
    • is the sum of elements to ()
    • The upper parent of is and the lower parent is
    • Each node has an
      • notion image
    • This defines the constraints of the computational graph we wish to find and turn into a circuit

RL

  • Our target for the Q update is the observed reward, plus our estimate for subsequent returns
  • This is calculated by taking the best possible action (via argmax) and putting it into the Q-function (along with the next state in the observed trajectory)
  • Problem:
    • The Q-function selects the best action and gives the value of it
    • Naturally we’ll overestimate some actions - this implementation makes us both more likely to select those actions, and overestimate them
    • By learning two separate Q functions we can decouple this - so one network selects the action (an overestimate), and the other evaluates based on that action (not an overestimate)
    • Outer eval network updated less frequently to make them sufficiently different
  • Why not just use the action that was actually taken (rather than the argmax)? Then we’re on-policy (see pros/cons of on vs off-policy)

Prefix + RL

State space:
notion image
notion image
Action space:
  • add or delete any non-input/output node
    • post-action “legalisation procedure” to add/remove nodes to satisfy constraints, or forbid actions in the first place
Reward:
  • Vector of area and delay reduction
  • DDQN adapted (salaraised) to handle this
    • notion image
  • We can vary to create pareto-front of designs

Questions

  • Why not use a GNN → they’re putting an adjacency matrix into a CNN! The features are just the node’s graph statistics
  • Is area & delay exactly what we want to optimise?
  • At a glance, looks like they could train for a bit longer?
  • How would this run on the IPU?
    • How large are the models?
    • How large is the replay buffer? 400k elements. If each is , most of this is taken up by , which is bytes. For N=32 this gives 1024 bytes total per buffer elem. So 400Mb. So we can fit this on-chip, but only one and it’ll be a bit of a squeeze.
    • Do you need a replay buffer? I feel like we could use on-policy? I used PPO for Snowflake which worked fine. Update: reward calc expensive - so they want to be able to re-use experience
    • Looks like simulator will be the bottleneck, so I think we may as well stay in the off-policy regime (especially as it improves sample-efficiency, meaning less simulation needed!)