RTL Simulation Reading Group Notes

Papers / Patents

Emulation HW

Custom ASIC-based emulation hardware produced by industry.

Yorktown simulation engine (DAC 1982, IBM)
A survey of HW accelerators used in CAD (IEEE Design and Test 1984, Tom Blank (Stanford))
Logic simulation engines in Japan (IEEE Design and Test 1989, NEC / Fujitsu)
Multiprocessor for HW emulation (Patent 1994, IBM / Cadence)
Emulating multi-ported memory circuits (Patent 1997, Quickturn / Cadence)
Speeding Up Look-up-Table Driven Logic Simulation (Springer 1999, Fujitsu)
Sahara: Massively Parallel Dedicated Hardware for Cycle-Based Logic Simulations (Wiley 2005, Fujitsu)
ibm logic engine
ibm logic engine 2

FPGA Overlays

FPGA overlay-oriented emulation hardware and techniques for word-level FPGA compilation.

Time multiplexed FPGA architecture for logic emulation (CICC 1995, UToronto)
A CAD framework for Malibu: an FPGA with time-multiplexed coarse-grained elements (FPGA 2011, UBC)
Hoplite: Building Austere Overlay NoCs for FPGAs (FPL 2015)
Overgen (MICRO 2022)
grvi-phalanx
- FPGA efficient implementation of a RISC-V processor
- 2/3 stage
- Each processor takes about 320 6 LUTS and the design closes timing at 300MHz
hoplite
- FPGA efficient implementation of a NoC
- Seems like they support the mesh topology, would need to profile the design to decide the amount of fanout
- Scheduling becomes much mor difficult with this constraint (which is also why this is interesting)

Academic Attempts at Emulation

Academic efforts to create emulation hardware either using an FPGA overlay or modeling a custom emulation ASIC.

Cyclist (ICCAD 2017)
- flo-llvm, libflo
- Chisel DREAMER emulation platform
Accelerating RTL Simulation with Hardware-Software Co-Design (MICRO 2023)
Manticore: Hardware-Accelerated RTL Simulation with Static Bulk-Synchronous Parallelism (ASPLOS 2023)
- A 475 MHz Manycore FPGA Accelerator for RTL Simulation - Manticore implementation paper (FPGA 2024)

etc

Event driven simulation for networking: Networking simulation

Tentative schedule

Week 1 - uarch

Multiprocessor for HW emulation

Summary

Processor design

Instructions execute step by step (no control flow, fixed set of instructions in IMEM). Each iteration through the instruction memory corresponds to one target cycle
Two data memory (denoted as input/data stack)
- In each step, the function bit out (FBO) is stored in the data stack
- In each step, the input from the switch is stored in the input stack. Instruction encodes which other processor to accept the incoming switch bit from (a bit can be ignored or broadcasted to X other processors)
- Need to perform logic computation in a BFS manner in order to use the values created from previous steps
LUTs are configured to simulate arbitrary N-1 gates. Operands are read from the input/data stack.
Bits can be forwarded to nearby processor (N-3 ~ N+3) instead of going through the network.
- Way of saving one cycle: if the bit goes over the network, to use it, the processor has to store it in the input stack and use it in the following cycle
Instruction memory is split into two parts: left and right
- For logic emulation, left and right both encodes the operation to perform
- For memory (SRAM?) emulation, the right instruction is essentially the data array. 16 processors are grouped and a bit from each group is used to generate the address for the memory operation

Emulation module, board, platform

Module
- 64 processors are grouped together as a module
- All the processors within a module are connected as a crossbar
Board
- Collection of emulation modules
- Module ports are connected in a pre-configured fashion
Platform
- Collection of boards, DRAM(?), host communication logic, and other platform control logic
Need to synchronize across every cycle across all boards. How should this global synchronization achieved? Also can we allow certain parts to slip ahead of this global synchronization barrier (I think we can, but the benefit might not be significant due to straggler effects)

Can connect multiple processors to simulate logic where the logic depth is larger than the max steps
- The performance degradation as the target design size increases is gradual
Inter board communication has to happen in fixed latency / compiler is aware of the latency (to the compiler, the link doesn't really matter except that the scheduling might change a little bit)
Need to have a core that can run testbench code near the machine (display messages, assertions, C++ models ...)
For 4 state, just use software and inject state
- However, there are other cases where 4 state sim make sense : external IP can inject 4 state, low power simulation...
- Cadence added support for X-prop in their latest Palladium
- Problem with X-prop is that you have to use 2 bits to simulate a single bit (00 -> 0, 01 -> 1, 10 -> X, 11 -> Z) but can be very area inefficient (especially X's are a rare state compared to just 0 & 1)
- But for the problems that we are trying to deal with (functional & performance verification) 2 state simulation may be sufficient
Expanding SRAM depth is cheap because we can use custom macros
- So when you can increase the frequency of the design, you would want to increase the SRAM depth so that each processor can emulate more gates w/o performance loss
- However, if the frequency is fixed, increasing the number of steps per cycle translates to lower simulation perf
- So that seems to be the reason that the IBM people found out 128 steps
- Need to find the optimal step for FPGA & ASIC w/ modern technology nodes.
One implementation option: add the processor grid as a FireSim LI-BDN where the interface is fixed (e.g., your tile)
- Can share the FireSim bridge/IO infrastructure
- Can save FPGA resources by mapping parts of the design directly on the FPGA and only the part where you anticipate RTL changes on the emulation processors
- For Fpga overlay (300 MHz), may have to make the network more simple to save FPGA routing resources
  - The compiler has to be aware of the network latency (network has to be designed to have static latency & maps well onto an FPGA in such a way it matches switch boxes well)
  - The compiler has to be able to pipeline instructions to hide extra network latency
- GCD is a good place to start
- FMR will increase (perhaps similar to when running TracerV)
  - Jerry's opinion is that we shouldn't try to compromise on performance
  - In my opinion, this is somewhat inevitable and not too bad
Approximating how many gates we can emulate when using a FPGA overlay
- FPGA can simulate N ASIC gates
- Each emulation processor corresponds to M ASIC gates, and has max T steps (T gates)
  - M has to take account of the network
- Gates that can be emulated is approximately (N / M * T)
- Need to measure T/M by implementing a dummy module and building a bitstream with it

Discussion/Questions

Should the compiler always cut across register boundaries?
- If a RTL block mapped to a single processor contains sequential logic, the processor cannot use the bit in the data stack that correpsonds to the FF as it will be overwritten. So that bit must go across the network and come back and the compiler would have to insert NOPs. -> utilization vs performance tradeoff
- Alternatively, can double the on-chip memory so that each half can work like a master (producing bits) and slave (storing bits for the next cycle). This enables more partitioning flexibility in the compiler but decreases area efficiency of the processors
How many processors can fit in a single FPGA & how many processors/modules/boards would we need to simulate a reasonably sized CY SoC?
What are some problems that might show up when scaling this system up in such a way that it can support a billion gate simulation?
(Since this word seems like some magic keyword to people) Heterogeneous integration of processor designs? Can we design certain modules/blocks to have different number of operands, bitwidth, ... to optimize for area & performance?
How to do X-propagation? We can encode that by just using 2 bits instead of 1 bit but that will have a significant area overhead. However, the most recent palladium started supporting X-propagation as well. Maybe they only have certain processors that have X-modeling while most processors only support 2 state simulation? Static analysis to identify gates that will not be X's for certain.

Week 2 - uarch

Yorktown simulation engine

Discussion

What is the GDM for? Why not just use LUTs like in the patent
Go over SRAM emulation
What does it mean to propagate the clock distribution logic?
- It must be due to how their clock distribution network is designed
- Pre multi-clock domain ages
- Can simulate clock gating, however there is no performance benefit from logic skipping
- You can have logic in the clock tree -> clock in is a combinational of some data & clock
- So you can simulate a FF as transistors, gate and functiona level, as you go down the abstraction takes more cycles to simulate a single FF
4 state simulation support
- Can model x optimixim & pessimism
Very different from cyclist
If there is slipping, it has to be fixed amount because you will need memory to perform some sort of bookkeeping
Skipping has to happen in a coarse-grained manner & the parts that can be skipped at the same time has to be pre-determined
- Also the amount of host cycles that can be skipped has to be predetermined & known by the compiler -> kind of reaches a multicore simulation logic
- However, the simulation throughput is determined by the worst case processor steps
Core functional logic
- GDM : it is for x-prop pess & opt for tuning (for 4 state simulation)
  - x opt: X & 1 -> 0 or 1
  - x pess: X & 1 -> X
  - x symbolic: if the output can be proved, use that value (even with this, there are cases when you need x prop for registers in RR-arbiter)
  - interrupt logic
    - propagation of X when happens after a certain point in simulation to check if x prop breaks stuff
    - can be used to generate trigger conditions
vs Quickturn
- much more effort was put to perform 4 state sim & clock gating modeling
- possibly because they had low confidence about their digital logic
- z3 has 4 state -> (emulated 4 state by using multiple bits, probably x-optimistic simulation)
How clock trees are modeled in modern palladiums

Week 3 - uarch

Week 4 - compiler

Yorktown simulation SW support
ibm logic engine
Compiler/HW complexity tradeoffs
- Unit delay model vs rank order
- How does this tradeoff space differ from FPGAs vs ASIC?
Partitioning & instruction scheduling
- When partitioning, should we try to partition across register boundaries? Or if we have a partition that is balanced & minimizes communication, would that also be a nice partition?
- Linker: can it link arbitrary boundaries or are there any conditions for these link boundaries? How can we use the permuters for incremental compilation flows?
What is a nice interface/method to load the compiled instructions into these processors?
- FESVR -> too slow?
- ???

Week 5 - academic

Cyclist (ICCAD 2017)

Cyclist

No traction at all, sad
Related work
- Compared against EVE, YSE
- Palladium 100 million gates an hour
  - compilation performance strong scaling with more cores -> most of the compilation time is in partitioning
- Malibu (another related work)
simulates in the RTL operator level -> datapath width vs simulator platform capacity tradeoff
Uarch
- Modified Rocket, 32 bit wide instruction
- No custom logic function, it uses ALUs to perform computation
- ISA
  - log2: are they recovering the RTL semantics to find use cases for log2 (find highest bit, because in chisel, this circuit is blasted out)
  - cat: more of a consequence of FIRRTL having Cat & made implementation easier
  - mul: extreme -> may be area inefficient to have in every single emulation core
- 32 architectural registers
  - They didn't want to spend too much time
- Explicit NOPs to resolve data hazards
- Only neighbor to neighbor routing -> a lot of cycles are spent routing data across the network
- Can broadcast outputs to all the neighbors
Debug
- Nice engineering
- Capture IO traces and replay them later
Utilization only 4%
Pay as you go
- perform annealing to come up with a better compilation output while loading & running the simulation
- high engineering effort, but not impossible
- must maintain a mapping of new compilation, done on the host
Interactive visibility
- Find signal at a particular point in time
- Take peridoic snapshots & replay
- Only 12% perf slowdown (in Palladium, it is like 2 ~ 5x)

Week 6 - academic

Manticore: Hardware-Accelerated RTL Simulation with Static Bulk-Synchronous Parallelism (ASPLOS 2023)
Recap:
- Cyclist: msg passing btw cores (mesh), low utilization
- neighbor to neighbor results in low utilization
Manticore:
- No message passing
  - Bulk synchronous parallelism (separate bit shuffling phase)
  - 2D torus network
  - Leads to low utilization
  - Enables the compiler to perform core local scheduling of instructions
  - However, while performing the communication phase the compiler still has to be aware of the NoC traffic and make sure things don't collide
- Statically scheduled via compiler
- Verilator is the baseline, but not a fair comparison, but similar to verilator and repcut, it uses a bulk synchronous execution model
- Each tile is larger because of the above execution model
  - State has to be duplicated & maintained within each tile
- 14 stage pipeline
  - Specialized to FPGAs
  - No interlocks, compiler is inserting NOPs
- Large datapath & low utilization
- Custom function unit
  - Particular design
- Results
  - 2x compared to Rocket / cannot extract out enough parallelism to compete with Xeons
  - No utilization, NOPs...
Taxanomy
- Event driven vs static -> where is the static dynamic boundary? accessing SRAM?
- Bulk synch vs fine-grained msg passing
- Core compute element (LUT vs ALU)... degree to how close it looks like a LUT / datapath width
- Synch vs intra cycle Timing
- 4 state simulation support
- Memory and encoding support / how are SRAMs are mapped

Week 7 - misc

Malibu
Nexus

Week 8 - Power & gate level simulation

CPF_palladium (cadence manual)
LowPowerCPF-Simulation-Guide (cadence manual)

Week 9 - FPGA overlay

Overgen

Week 10 - FPGA based emulation

ParSGCN: Bridging the Gap Between Emulation Partitioning and Scheduling

A good partition doesn't mean the scheduling results will be good
They found out that "good partitions" usually contains certain nets in the partition cuts
- What does "certain nets" mean in this context?
- Probably is some structural characteristic of the graph
They train a GCN to obtain a probability P(e) where e represents the net and P represents the probablity that it will be included in the partition cut
During the partitioning process, they use the GCN to guide partitioning decisions so that the scheduling quality will be high
To limit the explosion of compute requirements, they only apply the above technique in the final partitioning step (where subpartitions are again partitioned onto emulation processors)
But what are the characteristics of the nodes that have high P(e) vs the ones that do not? This isn't revealed from the paper
The results look quite promising and they seemed to have used Palladium compilers as the baseline. (avg 10% less steps than the Palladium compiler for open source designs, max up to 33% less steps)
It seems like the KaHyPar partitioner provides pretty decent compilation results as well though

Sphinx: A Hybrid Boolean Processor-FPGA Hardware Emulation System

FPGA + boolean processor approach

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
RAMP-slides		RAMP-slides
resources/emulation		resources/emulation
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RTL Simulation Reading Group Notes

Papers / Patents

Emulation HW

FPGA Overlays

Academic Attempts at Emulation

Compiler Partitioning Strategy

FireSim ancestors

Power, gate level Simulation

Software RTL Simulation

HW-Accelerated (Non-FPGA) RTL Simulation

etc

Tentative schedule

Week 1 - uarch

Summary

Processor design

Emulation module, board, platform

Discussion/Questions

Week 2 - uarch

Discussion

Week 3 - uarch

Week 4 - compiler

Week 5 - academic

Cyclist

Week 6 - academic

Week 7 - misc

Week 8 - Power & gate level simulation

Week 9 - FPGA overlay

Week 10 - FPGA based emulation

ParSGCN: Bridging the Gap Between Emulation Partitioning and Scheduling

Sphinx: A Hybrid Boolean Processor-FPGA Hardware Emulation System

HW design and CAD for processor based logic emulation systems

About

Releases

Packages

Contributors 3

euphoric-hardware/rtl-simulation-reading

Folders and files

Latest commit

History

Repository files navigation

RTL Simulation Reading Group Notes

Papers / Patents

Emulation HW

FPGA Overlays

Academic Attempts at Emulation

Compiler Partitioning Strategy

FireSim ancestors

Power, gate level Simulation

Software RTL Simulation

HW-Accelerated (Non-FPGA) RTL Simulation

etc

Tentative schedule

Week 1 - uarch

Summary

Processor design

Emulation module, board, platform

Discussion/Questions

Week 2 - uarch

Discussion

Week 3 - uarch

Week 4 - compiler

Week 5 - academic

Cyclist

Week 6 - academic

Week 7 - misc

Week 8 - Power & gate level simulation

Week 9 - FPGA overlay

Week 10 - FPGA based emulation

ParSGCN: Bridging the Gap Between Emulation Partitioning and Scheduling

Sphinx: A Hybrid Boolean Processor-FPGA Hardware Emulation System

HW design and CAD for processor based logic emulation systems

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Packages