Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(CtrlUnit, DCache): support L1 DCache RAS #4009

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

cz4e
Copy link
Contributor

@cz4e cz4e commented Dec 9, 2024

L1 DCache RAS extension support

The L1 DCache supports the part of Reliability, Availability, and
Serviceability (RAS) Extension.

  • L1 DCache protection with Single Error Correct Double Error Detect
    (SECDED) ECC on the RAMs. This includes the L1 DChace tag and data RAMs.
    Not recovery error tag or data.
  • Fault Handling Interrupt (Bus Error Unit Interrupt,BEU, 65)
  • Error inject

ECC Error Detect

An error might be triggered, when access L1 DCache.

  • Error Report:

    • Tag ECC Error: As long as an ECC error occurs on a certain path, it
      is judged that an ECC error has occurred.
    • Data ECC Error: If an ECC error occurs in the hit line, it is considered
      that an ECC error has occurred. If it does not hit, it will not be processed.
    • If an instruction access triggers an ECC error, a Hardware error is
      considered and an exception is reported.
    • Whenever there is an error in starting, an error message needs to
      be sent to BEU.
    • When the hardware detects an error, it reports it to the BEU and
      triggers the NMI external interrupt(65).
  • Load instruction:

    • Only ECC errors of tags or data will be triggered during execution,
      and the errors will be reported to the BEU and a Hardware Error
      will be reported.
  • Probe/Snoop:

    • If a tag ecc error occurs, there is no need to change the cache status,
      and a ProbeAck with corrupt=1 needs to be returned to l2.
    • If a data ecc error occurs, change the cache status according to
      the rules. If data needs to be returned, ProbeAckData with corrupt=1
      needs to be returned to l2.
  • Replace/Evict:

    • ReleaseData with corrupt=1 needs to be returned to l2.
  • Store to L1 DCache:

    • If a tag ecc error occurs, the cacheline is released according to the
      Repalce/Evict process and the data is written to L1 DCache without
      reporting errors to l2.
    • If a data ecc error occurs, the data is written directly without reporting
      the error to l2.
  • Atomics:

    • report Hardware Error, do not report errors to l2.

Error Inject

Each core's L1 DCache is configured with a memory map register-controlled
controller, and each hardware unit that supports ECC is configured with a
control bank. After the Bank register configuration is completed, L1 DCache
will trigger an ecc error for the first access L1 DCache.

err_inject

Address Space

Address space 0x38022000-0x3802207F, a total of 128 bytes of space,
this space is the local space of each hart.

ctl_bank

L1 DCache Control Bank

Each Control Bank contains registers: ECCCTL, ECCEID, ECCMASK,
each register is 8 bytes.
eccctl

  • ECCCTL(ECC Control): ECC injection control register.
    • ese(error signaling enable): Indicates that the injection is valid
      and is initialized to 0. When the injection is successful and pst==0,
      ese will be clean.
    • pst(persist): Continuously inject signals. When pst==1,
      the ECCEID
      counter decreases to 0 and after successful injection, the
      injection timer will be restored to the last set ECCEID and re-injected;
      when pst==0, it will be injected only once.
    • ede(error delay enable): Indicates that counter is valid and
      initialized to 0. If
      • ese==1 and ede==0, error injection is effective immediately.
      • ese==1 and ede==1, you need to wait until ECCEID
        decrements to 0 before the injection is effective.
    • cmp(component): Injection target, initialized to 0.
      • 1'b0: The injection object is tag.
      • 1'b1: The injection object is data.
    • bank: The bank valid signal is initialized to 0. When the bit in
      the bank is set, the corresponding mask is valid.
ecceid
  • ECCEID(ECC Error Inject Delay): ECC injection delay controller.
    • When ese==1 and ede==1, it
      starts to decrease until it reaches 0. Currently, the same clock as
      the core frequency is used, which can also be divided. Since ECC
      injection relies on L1 DCache access, the time of the EID and the
      time when the ECC error is triggered may not be consistent.
eccmask
  • ECCMASK(ECC Mask): ECC injection mask register.
    • 0 means no inversion, 1 means flip.
      Tag injection only uses the bits in ECCMASK0 corresponding to
      the tag length.

Error Inject Example

1 # set control bank base address
2 mv x3, $(BASEADDR)
3
4 # set eid
5 mv x5, 500 # delay 500 cycles
6 sd x5, 8(x3) # mmio store
7
8 # set mask
9 mv x5, 0x1 # flip bit 0
10 sd x5, 16(x3) # mmio store
11
12 # set ctl
13 mv x5, 0x7 # comp = 0, ede = 1, pst = 1, ese = 1
14 sd x5, 0(x3) # mmio store

@cz4e cz4e added the enhancement New feature in plan label Dec 9, 2024
@cz4e cz4e requested a review from Anzooooo December 9, 2024 10:54
@cz4e cz4e force-pushed the feat-l1dcache-ras-support branch from cad44db to 488c05c Compare December 11, 2024 05:55
@XiangShanRobot
Copy link

[Generated by IPC robot]
commit: be82a1c

commit astar copy_and_run coremark gcc gromacs lbm linux mcf microbench milc namd povray wrf xalancbmk
be82a1c 1.904 0.440 2.689 1.227 2.833 2.461 2.368 0.916 1.396 2.042 3.432 2.719 2.368 3.231

master branch:

commit astar copy_and_run coremark gcc gromacs lbm linux mcf microbench milc namd povray wrf xalancbmk
d4265a7 1.899 2.701 1.219 2.833 2.461 2.395 0.921 1.426 2.022 3.432 2.707 2.368 3.227
81ed416
ad7236c 1.904 1.227 2.833 2.461 0.916 2.042 3.432 2.368 3.231
7dc438a 1.904 0.450 2.698 1.227 2.833 2.461 2.395 0.916 1.386 2.042 3.432 2.719 2.368 3.231
b867eb9 1.917 0.450 2.689 1.232 2.842 2.461 2.394 0.925 1.418 2.035 3.435 2.704 2.383 3.261

@linjuanZ linjuanZ changed the title feat(L1DCache RAS): l1dcache ras support feat(CtrlUnit, DCache): support L1 DCache RAS Dec 12, 2024
@XiangShanRobot
Copy link

[Generated by IPC robot]
commit: 3580c5e

commit astar copy_and_run coremark gcc gromacs lbm linux mcf microbench milc namd povray wrf xalancbmk
3580c5e 1.899 0.440 2.689 1.219 2.833 2.461 2.368 0.921 1.396 2.022 3.432 2.707 2.368 3.227

master branch:

commit astar copy_and_run coremark gcc gromacs lbm linux mcf microbench milc namd povray wrf xalancbmk
f346d72
d29ebcf
9cf1e44 1.899 1.219 2.833 2.461 0.921 2.022 3.432 2.368 3.227
98d2aaa 1.899 0.450 2.701 1.219 2.833 2.461 2.395 0.921 1.426 2.022 3.432 2.707 2.368 3.227
433cc30 1.899 0.450 2.701 1.219 2.833 2.461 2.395 0.921 1.426 2.022 3.432 2.707 2.368 3.227

@cz4e cz4e force-pushed the feat-l1dcache-ras-support branch from 3580c5e to 56127a2 Compare December 13, 2024 06:02
@XiangShanRobot
Copy link

[Generated by IPC robot]
commit: fbe3e9f

commit astar copy_and_run coremark gcc gromacs lbm linux mcf microbench milc namd povray wrf xalancbmk
fbe3e9f 1.937 0.440 2.691 1.227 2.866 2.462 2.368 0.930 1.387 2.054 3.437 2.718 2.367 3.210

master branch:

commit astar copy_and_run coremark gcc gromacs lbm linux mcf microbench milc namd povray wrf xalancbmk
c7ca40e 1.937 0.451 2.697 1.227 2.866 2.462 2.393 0.930 1.425 2.054 3.437 2.718 2.367 3.210
38d0d7c
8ffb12e
99baa88 1.899 0.450 2.701 1.219 2.833 2.461 2.395 0.921 1.426 2.022 3.432 2.707 2.368 3.227
f346d72 1.899 0.450 2.701 1.219 2.833 2.461 2.395 0.921 1.426 2.022 3.432 2.707 2.368 3.227

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature in plan
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants