Contents:
Contents:My SummaryIntroductionContextProblemCurrent ApproachProposed ApproachGeneralisationContributionRelated WorkCategories of previous approachesDifference of proposed approachMethodProblem statementOverviewCost functionArchitectureSupervised embedding learningPolicy NetworkResults
My Summary
The high-level problem here is that the chip-placement problem is currently done with a lot of manual work and evaluating our true objective is a very slow and expensive process. This method gives a much faster approach using a neat approximation.
More broadly, the chip placement problem is an optimisation problem—specifically, a non-differentiable discrete constrained multi-objective one. That it is non-differentiable means that we can't just use standard gradient-based methods to optimise the objective(s).
The authors' solution is to turn this into an RL problem - each discrete step (placing the next-smallest macro) becomes an action, the state is the current chip layout, and the reward is some approximation of our objectives.
That this appears to work well suggests similar problems could also have RL solutions. This is the most interesting aspect of the paper, and could make RL very real-world-useful for many non-differentiable optimisation problems!
To train the model, the authors use two steps:
- Train a value network on a generated dataset using various chip layouts, using supervised learning with the slow "true" reward as a target. This generates embeddings used in ⬇️
- Train an RL policy to output macro placements. The reward is an approximation of the true reward.
The architecture diagram here is very helpful in understanding how this all works:

In terms of the architecture, a (single layer?) MPNN is used to learn the graph embeddings, and deconv nets are used to output the action.
The authors claim that the GNN architecture is novel, but it seems extremely standard to me.
Step 2) is done on training tasks, and can then be zero-shot transferred or fine-tuned on a subsequent chip/task.
I don't know enough about the domin to compare well with respect to the baselines, but what can be said is that a) zero-shot performance appears to be really good - we can generalise really well to unseen architectures (thanks GNNs!), b) finetuning can give fairly substantial boosts.
In terms of real-world use, this method can take 6 hours with fine-tuning on a new task to generate its best chip-placement, relative to weeks for standard human-in-the-loop methods. This appears pretty promising. The authors suggest their approach is competitive to existing non-ML methods. I'm a little sceptical, and Yann LeCun has posted ⬇️, so I'm reserving judgement for now.
Introduction
Context
- End of Moore's law / Dennard scaling means specialist AI hardware v. valuable
- AI chips currently take years to design ➡️ we have to try and anticipate requirements of models years in advance
- Aim = shortening development cycle
- If we can use AI for chip design, will create virtuous circle
Problem
Task: place a netlist graph (connectivity of the circuit) of macros (e.g. SRAMs) and standard cells (logic gates) onto a chip canvas (bounded 2D space)
Objective: optimize PPA (power, performance, area), while adhering to placement constraints
Current Approach
- Human experts still required, many weeks of iteration
- Huge netlist graphs (millions to billions of nodes)
- ⬆️ cost of computing true target metrics (hours to days for single design)
Proposed Approach
Pose chip placement as RL problem
Key motivation for using RL: our (multi-) objective is non differentiable
Iterations of training ➡️ for each one, all of the macros are placed by the agent, then the standard cells are placed by a secondary (non-learned) method
Fast approximate reward used
Generalisation
First placement approach which can generalise to new unseen netlists
As agent is exposed to more different chips, becomes better and better on new ones
Contribution
- Placement method that learns from experience
- Superior to baselines on real AI chips, and comparable to human experts
Related Work
Categories of previous approaches
- partitioning-based methods (divide-and-conqueor)
- stochastic/hill-climing (simmulated annealing)
- analytic solvers (e.g. no-linear optimisers) inc. SOTA RePLAce
Difference of proposed approach
Like analytic solvers uses gradient updates to optimise objective
However, optimisation no longer from-scratch - leverages knowledge from prior placements on different chips (domain adaptation)
Unlike other approaches, optimises target metrics directly without having to make convex approximations
Method
Problem statement
Map the nodes of a netlist(connectivity of the circuit) onto a chip canvas (bounded 2D space), such that the PPA (power, performance, area) is optimised.

Overview
MDP:
States: possible partial placements of the netlist on the canvas. A concatenation of features including:
- graph embedding of the netlist (inc. placed & unplaced nodes)
- node embedding of the current macro
- netlist metadata
- mask representing the feasibility of placing the current node onto each cell of the grid
Actions: (given the current macro to place) the set of locations in the canvas space (discrete grid cells) the macro can be placed, without violating any hard density constraints
State transition: just the obvious deterministic outcome?
Reward:
True reward: output from slow commercial tool (EDA)
Proxy reward: 0 for all actions except the last, where = - weighted sum of proxy wirelength and congestion
T = number of macros in netlist
Algorithm: PPO
Cost function
Average of returns over all netlists in dataset, given current policy:
where  is the dataset of graphlist of size  and
Architecture

Supervised embedding learning
Why? We can train this easily across many different netlists/placements, giving strong generalisation.
The ValueNet is trained (supervised) first, which enables the learning of the embeddings:
Dataset = 10,000 chip placements; input = standard state, label = true (slow) reward ← built by picking 5 accelerator netlists and then generating 2,000 placements for each by using a basic policy network.
GraphConv: standard MPNN:
embeddings = 32-dim
 =  feed-forward
 =  feed-forward
 = learnable  weights
Single GNN layer??
outputs = node & edge embeddings
For a specific task we can fine-tune these weights further.
Policy Network
Deconvolution (like convolution, but input is "spaced out": see link) layers used to increase dimensionality.
Batch norm used
Results





Ultimately takes about 6 hours to generate strong placements, whereas it takes human experts in-the-loop several weeks.