Skip to content

euphoric-hardware/rtl-simulation-reading

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 

Repository files navigation

RTL Simulation Reading Group Notes

Papers / Patents

Emulation HW

Custom ASIC-based emulation hardware produced by industry.

FPGA Overlays

FPGA overlay-oriented emulation hardware and techniques for word-level FPGA compilation.

Academic Attempts at Emulation

Academic efforts to create emulation hardware either using an FPGA overlay or modeling a custom emulation ASIC.

Compiler Partitioning Strategy

FireSim ancestors

Power, gate level Simulation

Software RTL Simulation

HW-Accelerated (Non-FPGA) RTL Simulation

etc


Tentative schedule

Week 1 - uarch

Summary

Processor design

  • Instructions execute step by step (no control flow, fixed set of instructions in IMEM). Each iteration through the instruction memory corresponds to one target cycle
  • Two data memory (denoted as input/data stack)
    • In each step, the function bit out (FBO) is stored in the data stack
    • In each step, the input from the switch is stored in the input stack. Instruction encodes which other processor to accept the incoming switch bit from (a bit can be ignored or broadcasted to X other processors)
    • Need to perform logic computation in a BFS manner in order to use the values created from previous steps
  • LUTs are configured to simulate arbitrary N-1 gates. Operands are read from the input/data stack.
  • Bits can be forwarded to nearby processor (N-3 ~ N+3) instead of going through the network.
    • Way of saving one cycle: if the bit goes over the network, to use it, the processor has to store it in the input stack and use it in the following cycle
  • Instruction memory is split into two parts: left and right
    • For logic emulation, left and right both encodes the operation to perform
    • For memory (SRAM?) emulation, the right instruction is essentially the data array. 16 processors are grouped and a bit from each group is used to generate the address for the memory operation

Emulation module, board, platform

  • Module
    • 64 processors are grouped together as a module
    • All the processors within a module are connected as a crossbar
  • Board
    • Collection of emulation modules
    • Module ports are connected in a pre-configured fashion
  • Platform
    • Collection of boards, DRAM(?), host communication logic, and other platform control logic
  • Need to synchronize across every cycle across all boards. How should this global synchronization achieved? Also can we allow certain parts to slip ahead of this global synchronization barrier (I think we can, but the benefit might not be significant due to straggler effects)

  • Can connect multiple processors to simulate logic where the logic depth is larger than the max steps
    • The performance degradation as the target design size increases is gradual
  • Inter board communication has to happen in fixed latency / compiler is aware of the latency (to the compiler, the link doesn't really matter except that the scheduling might change a little bit)
  • Need to have a core that can run testbench code near the machine (display messages, assertions, C++ models ...)
  • For 4 state, just use software and inject state
    • However, there are other cases where 4 state sim make sense : external IP can inject 4 state, low power simulation...
    • Cadence added support for X-prop in their latest Palladium
    • Problem with X-prop is that you have to use 2 bits to simulate a single bit (00 -> 0, 01 -> 1, 10 -> X, 11 -> Z) but can be very area inefficient (especially X's are a rare state compared to just 0 & 1)
    • But for the problems that we are trying to deal with (functional & performance verification) 2 state simulation may be sufficient
  • Expanding SRAM depth is cheap because we can use custom macros
    • So when you can increase the frequency of the design, you would want to increase the SRAM depth so that each processor can emulate more gates w/o performance loss
    • However, if the frequency is fixed, increasing the number of steps per cycle translates to lower simulation perf
    • So that seems to be the reason that the IBM people found out 128 steps
    • Need to find the optimal step for FPGA & ASIC w/ modern technology nodes.
  • One implementation option: add the processor grid as a FireSim LI-BDN where the interface is fixed (e.g., your tile)
    • Can share the FireSim bridge/IO infrastructure
    • Can save FPGA resources by mapping parts of the design directly on the FPGA and only the part where you anticipate RTL changes on the emulation processors
    • For Fpga overlay (300 MHz), may have to make the network more simple to save FPGA routing resources
      • The compiler has to be aware of the network latency (network has to be designed to have static latency & maps well onto an FPGA in such a way it matches switch boxes well)
      • The compiler has to be able to pipeline instructions to hide extra network latency
    • GCD is a good place to start
    • FMR will increase (perhaps similar to when running TracerV)
      • Jerry's opinion is that we shouldn't try to compromise on performance
      • In my opinion, this is somewhat inevitable and not too bad
  • Approximating how many gates we can emulate when using a FPGA overlay
    • FPGA can simulate N ASIC gates
    • Each emulation processor corresponds to M ASIC gates, and has max T steps (T gates)
      • M has to take account of the network
    • Gates that can be emulated is approximately (N / M * T)
    • Need to measure T/M by implementing a dummy module and building a bitstream with it

Discussion/Questions

  • Should the compiler always cut across register boundaries?
    • If a RTL block mapped to a single processor contains sequential logic, the processor cannot use the bit in the data stack that correpsonds to the FF as it will be overwritten. So that bit must go across the network and come back and the compiler would have to insert NOPs. -> utilization vs performance tradeoff
    • Alternatively, can double the on-chip memory so that each half can work like a master (producing bits) and slave (storing bits for the next cycle). This enables more partitioning flexibility in the compiler but decreases area efficiency of the processors
  • How many processors can fit in a single FPGA & how many processors/modules/boards would we need to simulate a reasonably sized CY SoC?
  • What are some problems that might show up when scaling this system up in such a way that it can support a billion gate simulation?
  • (Since this word seems like some magic keyword to people) Heterogeneous integration of processor designs? Can we design certain modules/blocks to have different number of operands, bitwidth, ... to optimize for area & performance?
  • How to do X-propagation? We can encode that by just using 2 bits instead of 1 bit but that will have a significant area overhead. However, the most recent palladium started supporting X-propagation as well. Maybe they only have certain processors that have X-modeling while most processors only support 2 state simulation? Static analysis to identify gates that will not be X's for certain.

Week 2 - uarch

Discussion

  • What is the GDM for? Why not just use LUTs like in the patent

  • Go over SRAM emulation

  • What does it mean to propagate the clock distribution logic?

    • It must be due to how their clock distribution network is designed
    • Pre multi-clock domain ages
    • Can simulate clock gating, however there is no performance benefit from logic skipping
    • You can have logic in the clock tree -> clock in is a combinational of some data & clock
    • So you can simulate a FF as transistors, gate and functiona level, as you go down the abstraction takes more cycles to simulate a single FF
  • 4 state simulation support

    • Can model x optimixim & pessimism
  • Very different from cyclist

  • If there is slipping, it has to be fixed amount because you will need memory to perform some sort of bookkeeping

  • Skipping has to happen in a coarse-grained manner & the parts that can be skipped at the same time has to be pre-determined

    • Also the amount of host cycles that can be skipped has to be predetermined & known by the compiler -> kind of reaches a multicore simulation logic
    • However, the simulation throughput is determined by the worst case processor steps
  • Core functional logic

    • GDM : it is for x-prop pess & opt for tuning (for 4 state simulation)
      • x opt: X & 1 -> 0 or 1
      • x pess: X & 1 -> X
      • x symbolic: if the output can be proved, use that value (even with this, there are cases when you need x prop for registers in RR-arbiter)
      • interrupt logic
        • propagation of X when happens after a certain point in simulation to check if x prop breaks stuff
        • can be used to generate trigger conditions
  • vs Quickturn

    • much more effort was put to perform 4 state sim & clock gating modeling
    • possibly because they had low confidence about their digital logic
    • z3 has 4 state -> (emulated 4 state by using multiple bits, probably x-optimistic simulation)
  • How clock trees are modeled in modern palladiums

Week 3 - uarch

Week 4 - compiler

  • Yorktown simulation SW support

  • ibm logic engine

  • Compiler/HW complexity tradeoffs

    • Unit delay model vs rank order
    • How does this tradeoff space differ from FPGAs vs ASIC?
  • Partitioning & instruction scheduling

    • When partitioning, should we try to partition across register boundaries? Or if we have a partition that is balanced & minimizes communication, would that also be a nice partition?
    • Linker: can it link arbitrary boundaries or are there any conditions for these link boundaries? How can we use the permuters for incremental compilation flows?
  • What is a nice interface/method to load the compiled instructions into these processors?

    • FESVR -> too slow?
    • ???

Week 5 - academic

Cyclist

  • No traction at all, sad
  • Related work
    • Compared against EVE, YSE
    • Palladium 100 million gates an hour
      • compilation performance strong scaling with more cores -> most of the compilation time is in partitioning
    • Malibu (another related work)
  • simulates in the RTL operator level -> datapath width vs simulator platform capacity tradeoff
  • Uarch
    • Modified Rocket, 32 bit wide instruction
    • No custom logic function, it uses ALUs to perform computation
    • ISA
      • log2: are they recovering the RTL semantics to find use cases for log2 (find highest bit, because in chisel, this circuit is blasted out)
      • cat: more of a consequence of FIRRTL having Cat & made implementation easier
      • mul: extreme -> may be area inefficient to have in every single emulation core
    • 32 architectural registers
      • They didn't want to spend too much time
    • Explicit NOPs to resolve data hazards
    • Only neighbor to neighbor routing -> a lot of cycles are spent routing data across the network
    • Can broadcast outputs to all the neighbors
  • Debug
    • Nice engineering
    • Capture IO traces and replay them later
  • Utilization only 4%
  • Pay as you go
    • perform annealing to come up with a better compilation output while loading & running the simulation
    • high engineering effort, but not impossible
    • must maintain a mapping of new compilation, done on the host
  • Interactive visibility
    • Find signal at a particular point in time
    • Take peridoic snapshots & replay
    • Only 12% perf slowdown (in Palladium, it is like 2 ~ 5x)

Week 6 - academic

  • Manticore: Hardware-Accelerated RTL Simulation with Static Bulk-Synchronous Parallelism (ASPLOS 2023)

  • Recap:

    • Cyclist: msg passing btw cores (mesh), low utilization
    • neighbor to neighbor results in low utilization
  • Manticore:

    • No message passing
      • Bulk synchronous parallelism (separate bit shuffling phase)
      • 2D torus network
      • Leads to low utilization
      • Enables the compiler to perform core local scheduling of instructions
      • However, while performing the communication phase the compiler still has to be aware of the NoC traffic and make sure things don't collide
    • Statically scheduled via compiler
    • Verilator is the baseline, but not a fair comparison, but similar to verilator and repcut, it uses a bulk synchronous execution model
    • Each tile is larger because of the above execution model
      • State has to be duplicated & maintained within each tile
    • 14 stage pipeline
      • Specialized to FPGAs
      • No interlocks, compiler is inserting NOPs
    • Large datapath & low utilization
    • Custom function unit
      • Particular design
    • Results
      • 2x compared to Rocket / cannot extract out enough parallelism to compete with Xeons
      • No utilization, NOPs...
  • Taxanomy

    • Event driven vs static -> where is the static dynamic boundary? accessing SRAM?
    • Bulk synch vs fine-grained msg passing
    • Core compute element (LUT vs ALU)... degree to how close it looks like a LUT / datapath width
    • Synch vs intra cycle Timing
    • 4 state simulation support
    • Memory and encoding support / how are SRAMs are mapped

Week 7 - misc

Week 8 - Power & gate level simulation

  • CPF_palladium (cadence manual)
  • LowPowerCPF-Simulation-Guide (cadence manual)

Week 9 - FPGA overlay

Week 10 - FPGA based emulation


  • A good partition doesn't mean the scheduling results will be good
  • They found out that "good partitions" usually contains certain nets in the partition cuts
    • What does "certain nets" mean in this context?
    • Probably is some structural characteristic of the graph
  • They train a GCN to obtain a probability P(e) where e represents the net and P represents the probablity that it will be included in the partition cut
  • During the partitioning process, they use the GCN to guide partitioning decisions so that the scheduling quality will be high
  • To limit the explosion of compute requirements, they only apply the above technique in the final partitioning step (where subpartitions are again partitioned onto emulation processors)
  • But what are the characteristics of the nodes that have high P(e) vs the ones that do not? This isn't revealed from the paper
  • The results look quite promising and they seemed to have used Palladium compilers as the baseline. (avg 10% less steps than the Palladium compiler for open source designs, max up to 33% less steps)
  • It seems like the KaHyPar partitioner provides pretty decent compilation results as well though
  • FPGA + boolean processor approach

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •