The overall approach will probably use a hybrid of the RippleFPGA and UTPlaceF algorithms for combined packing and placing. Early experiments with VPR demonstrated to me that an artificial barrier between these stages is not helpful. These algorithms all use quadratic programming. These two algorithms are fairly similar to each other.

For now, routing will just use whatever algorithm is "typical" (Pathfinder?). I may end up with egg on my face, but it seems that this is "just" an informed search problem. Some ideas that may end up enhancing this stage:

A* or other heuristics (XXX can we use inadmissible heuristics?)
Beam search

Miscellaneous papers on how we "got to" analytic placers from SA placers:

http://janders.eecg.toronto.edu/1387/readings/marcelfpl12.pdf
http://www.ece.umich.edu/cse/awards/pdfs/iccad10-simpl.pdf

Proposed overall flow for KinglerPAR:

Read yosys netlist
Cell munging (general purpose and arch-specific)
- Types: IO, LUT, FF, RAM/DSP/large-in-fabric-movable-blocks, fixed specials (blocks like PLLs, flash, ADC), movable specials?
Arch-specific logic propagates around some constraints now (e.g. for special blocks or for promoting globals (XXX should this be generic?))
- XXX should we have some logic for handling clock domains? Is this ever useful?
Arch-specific greedy packing (e.g. for RAM/DSP/special)
Arch-specific greedy placement (e.g. for specials)
Build HeAP-style superblocks for carry chains
- Warn if carry chains are huge?
- XXX can we auto-split them? This is hard; involves injecting feedthrough cells, which affects legalization
- XXX UNTESTEED IDEA: Can we twiddle net weights to encourage packing of carry chains and BLEs and other similar inside-CLB features? Do we need to?
XXX can QP work with zero constraints? If not, assign non-user-constrained IOs uniformly so as to avoid getting a giant blob of 100% overlapped cells (XXX what about clocks and stuff?).
Distribute all cells uniformly in fabric (required to seed B2B).
Run some iterations of HPWL-driven QP (3 iterations? just to get something that vaguely looks right. We need this for the next step)
Optionally run RippleFPGA partitioning (e.g. this doesn't work on MAX V because it's way too small. It's unclear for iCE40 which doesn't have as much directional routing bias.)
Run HPWL-driven QP to a convergence criteria.
- QP legalization will probably be UTPlaceF/POLAR-like for the spreading phase
Arch-specific hook point (e.g. for clocks). This hook point has enough information to e.g. assign quadrants but is also early enough that there is room to "recover from" constraints such as quadrants.
Run some iterations of congestion-driven QP.
- Fully legalize RAM/DSP now, but don't lock it in yet (UTPlaceF)
- Will probably use RippleFPGA-style area inflating and congestion estimation. Definitely won't import NTUPlace.
Finish congestion-driven global placement (skip RippleFPGA soft BLE packing -- intuition tells me this probably doesn't do much)
CLB packing with full legalization (XXX if it fails do we run QP again?)
Arch-specific hook point (e.g. for clocks). This is an "extra" hook point that I'm not currently sure how to use. It is after CLB packing, so that might be useful.
Save a design checkpoint now; possibly rewrite data structures ((somewhat) packed netlist)
Detail placement (RippleFPGA-style with alignment)
Final slot assignment
Save a design checkpoint now; possibly rewrite data structures (packed+placed netlist)
Arch-specific hook point (e.g. for clocks). This is intended for code to actually implement clock routing.
XXX research how to implement routing algorithms
- It seems we should be able to reenter pack+place and adjust cell density if we save enough state
Save a design checkpoint now; possibly rewrite data structures (fully-PARed result)
- XXX we probably want a way to back-annotate this
Convert to arch-specific data structures and write bitstream

Need to further research (directly-relevant algorithm details):

What does the RippleFPGA deferred slot assignment actually gain us?
- Simplicity, but it seems we need to do per-arch manual simplifying of legalization rules to take advantage of this
Can MAX V or iCE40 or other "simple LUT4" FPGAs skip complexity in CLB packing?
- See above
How do UTPlaceF/RippleFPGA CLB packing differ?

Need to further research (FPGA architectures):

Investigate global net structures in real FPGAs
How will other "special" blocks affect our PAR algorithm? (e.g. PLLs, ADCs, user flash, etc.)
Does any FPGA have "fracturable" RAM/DSP blocks? (Answer: YES) What to do about those? (Probably just treat them similarly and apply legalization etc. except with fixing their locations earlier)

Need to further research (general abstract CS topics):

Efficient cache-optimized data structures, especially for multithreading
General research on "modern" approaches to multithreaded algorithms
Brief research on numerical stability (can we just use integers/fixed?)

Features for minimum viable product:

Quadratic programming core engine, HPWL only
CLB packing
Detail placement
Final assignment
MVP will target iCE40 + MAX V

Post-MVP top priorities:

Carry chains (can probably ignore for MVP as long as code gets plumbed properly for them)
RAM/DSP (iCE40 has, MAX V doesn't)
LUT6/ALM (question: can we understand X-ray by then? If not then we can consider working on S6 or Cyc10GX)
Congestion-driven placement (does this need to be higher priority?)
Partitioning

XXX additional notes:

Need to ensure we don't pessimize LUT6/ALM architectures (everything I know well are "simple" LUT4)
VPR XML packing descriptions for AAPack are ridiculously overengineered -- KinglerPAR will just use code instead of declarative data
- How to make this code sufficiently reusable? MAX V is probably simpler than iCE40 is simpler than ECP5 is simpler than 7
G-cell congestion estimates seem to be designed around "Xilinx-style" big switch box architectures, what happens on Altera-style architectures?
Can we "dump" data back into VPR for routing initially?

Architectures:

MAX V -- "Altera-style", LUT4
iCE40 -- "Altera-style", LUT4, RAM
7 -- "Xilinx-style", LUT6/ALM, RAM+DSP
ECP5 -- "Xilinx-style", LUT4, RAM+DSP
MAX10 (future) -- "Altera-style", LUT4, RAM+mult
Cyc10LP (future) -- "Altera-style", LUT4, RAM+mult
Cyc10GX (future) -- "Altera-style", LUT6/ALM, RAM+DSP

Unsorted references:

http://www.cse.cuhk.edu.hk/~byu/papers/C54-ICCAD2016-RippleFPGA-slides.pdf
https://chengengjie.github.io/papers/C2-ICCAD16-RippleFPGA.pdf
https://pdfs.semanticscholar.org/c1c9/43c047e9d834c8ac487ec6d5292485546744.pdf
http://wuxili.net/pdf/TCAD17_UTPlaceF_Li.pdf
http://sci-hub.tw/https://ieeexplore.ieee.org/document/6691143/
http://sci-hub.tw/https://ieeexplore.ieee.org/document/1560039/
http://www.cse.cuhk.edu.hk/~fyyoung/paper/iccad11_ripple.pdf
http://appsrv.cse.cuhk.edu.hk/~jkuang/pdf/ripple2.pdf
http://janders.eecg.toronto.edu/1387/readings/marcelfpl12.pdf
http://www.ece.umich.edu/cse/awards/pdfs/iccad10-simpl.pdf
https://atrium.lib.uoguelph.ca/xmlui/bitstream/handle/10214/12985/Abuowaimer_Ziad_201805_PhD.pdf?sequence=5&isAllowed=y

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

notes.md

notes.md

Files

notes.md

Latest commit

History

notes.md

File metadata and controls