🍟

Chip Placement with DRL

Title
Chip Placement with Deep Reinforcement Learning
Authors
Azalia Mirhoseini, Anna Goldie, Mustafa Yazgan, Joe Jiang, Ebrahim Songhori, Shen Wang, Young-Joon Lee, Eric Johnson, Omkar Pathak, Sungmin Bae, Azade Nazi, Jiwoo Pak, Andy Tong, Kavya Srinivasa, William Hang, Emre Tuncer, Anand Babu, Quoc V. Le, James Laudon, Richard Ho, Roger Carpenter, Jeff Dean
Date
2021
Venue
Nature
DBLP
Keywords
rl
gnn
chip-placement

Contents:



My Summary

The high-level problem here is that the chip-placement problem is currently done with a lot of manual work and evaluating our true objective is a very slow and expensive process. This method gives a much faster approach using a neat approximation.
More broadly, the chip placement problem is an optimisation problem—specifically, a non-differentiable discrete constrained multi-objective one. That it is non-differentiable means that we can't just use standard gradient-based methods to optimise the objective(s).
The authors' solution is to turn this into an RL problem - each discrete step (placing the next-smallest macro) becomes an action, the state is the current chip layout, and the reward is some approximation of our objectives.
That this appears to work well suggests similar problems could also have RL solutions. This is the most interesting aspect of the paper, and could make RL very real-world-useful for many non-differentiable optimisation problems!
To train the model, the authors use two steps:
  1. Train a value network on a generated dataset using various chip layouts, using supervised learning with the slow "true" reward as a target. This generates embeddings used in ⬇️
  1. Train an RL policy to output macro placements. The reward is an approximation of the true reward.
The architecture diagram here is very helpful in understanding how this all works:
notion image
In terms of the architecture, a (single layer?) MPNN is used to learn the graph embeddings, and deconv nets are used to output the action.
The authors claim that the GNN architecture is novel, but it seems extremely standard to me.
Step 2) is done on training tasks, and can then be zero-shot transferred or fine-tuned on a subsequent chip/task.
I don't know enough about the domin to compare well with respect to the baselines, but what can be said is that a) zero-shot performance appears to be really good - we can generalise really well to unseen architectures (thanks GNNs!), b) finetuning can give fairly substantial boosts.
In terms of real-world use, this method can take 6 hours with fine-tuning on a new task to generate its best chip-placement, relative to weeks for standard human-in-the-loop methods. This appears pretty promising. The authors suggest their approach is competitive to existing non-ML methods. I'm a little sceptical, and Yann LeCun has posted ⬇️, so I'm reserving judgement for now.

Introduction

Context

  • End of Moore's law / Dennard scaling means specialist AI hardware v. valuable
  • AI chips currently take years to design ➡️ we have to try and anticipate requirements of models years in advance
  • Aim = shortening development cycle
  • If we can use AI for chip design, will create virtuous circle

Problem

Task: place a netlist graph (connectivity of the circuit) of macros (e.g. SRAMs) and standard cells (logic gates) onto a chip canvas (bounded 2D space)
Objective: optimize PPA (power, performance, area), while adhering to placement constraints

Current Approach

  • Human experts still required, many weeks of iteration
  • Huge netlist graphs (millions to billions of nodes)
  • ⬆️ cost of computing true target metrics (hours to days for single design)

Proposed Approach

Pose chip placement as RL problem
Key motivation for using RL: our (multi-) objective is non differentiable
Iterations of training ➡️ for each one, all of the macros are placed by the agent, then the standard cells are placed by a secondary (non-learned) method
Fast approximate reward used

Generalisation

First placement approach which can generalise to new unseen netlists
As agent is exposed to more different chips, becomes better and better on new ones

Contribution

  1. Placement method that learns from experience
  1. Superior to baselines on real AI chips, and comparable to human experts

Related Work

Categories of previous approaches

  1. partitioning-based methods (divide-and-conqueor)
  1. stochastic/hill-climing (simmulated annealing)
  1. analytic solvers (e.g. no-linear optimisers) inc. SOTA RePLAce

Difference of proposed approach

Like analytic solvers uses gradient updates to optimise objective
However, optimisation no longer from-scratch - leverages knowledge from prior placements on different chips (domain adaptation)
Unlike other approaches, optimises target metrics directly without having to make convex approximations

Method

Problem statement

Map the nodes of a netlist(connectivity of the circuit) onto a chip canvas (bounded 2D space), such that the PPA (power, performance, area) is optimised.
notion image

Overview

MDP:
States: possible partial placements of the netlist on the canvas. A concatenation of features including:
  1. graph embedding of the netlist (inc. placed & unplaced nodes)
  1. node embedding of the current macro
  1. netlist metadata
  1. mask representing the feasibility of placing the current node onto each cell of the grid
Actions: (given the current macro to place) the set of locations in the canvas space (discrete grid cells) the macro can be placed, without violating any hard density constraints
State transition: just the obvious deterministic outcome?
Reward:
True reward: output from slow commercial tool (EDA)
Proxy reward: 0 for all actions except the last, where = - weighted sum of proxy wirelength and congestion
T = number of macros in netlist
Algorithm: PPO

Cost function

Average of returns over all netlists in dataset, given current policy:
where is the dataset of graphlist of size and

Architecture

notion image

Supervised embedding learning

Why? We can train this easily across many different netlists/placements, giving strong generalisation.
The ValueNet is trained (supervised) first, which enables the learning of the embeddings:
Dataset = 10,000 chip placements; input = standard state, label = true (slow) reward ← built by picking 5 accelerator netlists and then generating 2,000 placements for each by using a basic policy network.
GraphConv: standard MPNN:
embeddings = 32-dim
= feed-forward
= feed-forward
= learnable weights
Single GNN layer??
outputs = node & edge embeddings
For a specific task we can fine-tune these weights further.

Policy Network

Deconvolution (like convolution, but input is "spaced out": see link) layers used to increase dimensionality.
Batch norm used

Results

We can see clearly that a) zero-shot performance is really good, b) finetuning can give fairly substantial boosts.
We can see clearly that a) zero-shot performance is really good, b) finetuning can give fairly substantial boosts.
Finetuning also gets us to strong performance much more quickly.
Finetuning also gets us to strong performance much more quickly.
Left: Using larger datasets for subsequent fine-tuning really helps zero-shot and in the first few hours. Perhaps as expected, though perhaps a little disappointingly, in the long-run fine tuning doesn't make as much difference.
Right: as expected, ⬆️ data = ⬇️ cost for supervised training.
Left: Using larger datasets for subsequent fine-tuning really helps zero-shot and in the first few hours. Perhaps as expected, though perhaps a little disappointingly, in the long-run fine tuning doesn't make as much difference. Right: as expected, ⬆️ data = ⬇️ cost for supervised training.
Appears superior to simmulated annealing .
Appears superior to simmulated annealing .
notion image
Ultimately takes about 6 hours to generate strong placements, whereas it takes human experts in-the-loop several weeks.