Skip to content

Commit

Permalink
Merge pull request #43 from ved-rivos/ar_approved
Browse files Browse the repository at this point in the history
AR updates
  • Loading branch information
ved-rivos authored Jan 18, 2024
2 parents d764e3f + c2dada4 commit c5ecffa
Show file tree
Hide file tree
Showing 4 changed files with 67 additions and 66 deletions.
2 changes: 1 addition & 1 deletion reri_contributors.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,4 @@
This RISC-V specification has been contributed to directly or indirectly by (in alphabetical order):

[%hardbreaks]
Aaron Durbin, Allen Baum, Andrew Walter, Anup Patel, Cameron McNairy, Dimitris Gizopoulos, Daniele Rossi, David Kruckemeyer, Dhaval Sharma, Greg Favor, Himanshu Chauhan, Holger Blasum, Mostafa Hadizadeh, Nicasio Canino, Petar Radojkovic, Vedvyas Shanbhogue, Xiaohan Ma
Aaron Durbin, Allen Baum, Andrew Walter, Anup Patel, Cameron McNairy, Dimitris Gizopoulos, Daniele Rossi, David Kruckemeyer, Dhaval Sharma, Greg Favor, Himanshu Chauhan, Holger Blasum, Nicasio Canino, Petar Radojkovic, Shubu Mukherjee, Vedvyas Shanbhogue, Xiaohan Ma
94 changes: 47 additions & 47 deletions reri_err_reporting.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -273,8 +273,8 @@ is as follows:
{bits: 1, name: 'else'},
{bits: 1, name: 'cece'},
{bits: 2, name: 'ces'},
{bits: 2, name: 'udes'},
{bits: 2, name: 'uues'},
{bits: 2, name: 'ueds'},
{bits: 2, name: 'uecs'},
{bits: 24, name: 'WPRI'},
{bits: 16, name: 'eid'},
{bits: 1, name: 'sinv'},
Expand Down Expand Up @@ -306,8 +306,8 @@ continue to use containment techniques like data poisoning even when error
reporting is disabled.
====

The `ces`, `udes`, and `uues` are WARL fields used to enable signaling of CE, UDE,
and UUE respectively when they are logged (i.e. when `else` is 1). Enables for
The `ces`, `ueds`, and `uecs` are WARL fields used to enable signaling of CE, UEC,
and UEC respectively when they are logged (i.e. when `else` is 1). Enables for
unsupported classes of errors may be hardwired to 0. The encodings of these
fields are specified in <<ERR_SIG_ENABLES>>.

Expand Down Expand Up @@ -405,7 +405,7 @@ of 0. Writing a value of 0 disables the counter. If error injection is not
supported by the error record then the `eid` field may be hardwired to 0. When
`eid` reaches a count of 0, the status register is made valid by setting the
`status_i.v` bit to 1. The `status_i.v` transition from 0 to 1 generates a RAS
signal corresponding to the class of error (CE, UDE, or UUE) setup in the
signal corresponding to the class of error (CE, UED, or UEC) setup in the
`status_i` register. The counter continues to count even if the `status_i`
register was overwritten by a hardware detected error before the `eid` counts
down to 0.
Expand Down Expand Up @@ -441,8 +441,8 @@ the hardware unit.
{reg: [
{bits: 1, name: 'v'},
{bits: 1, name: 'ce'},
{bits: 1, name: 'ude'},
{bits: 1, name: 'uue'},
{bits: 1, name: 'ued'},
{bits: 1, name: 'uec'},
{bits: 2, name: 'pri'},
{bits: 1, name: 'mo'},
{bits: 1, name: 'c'},
Expand All @@ -466,23 +466,23 @@ The error record holds a valid error log if the `v` field is 1. The `status_i`
register does not accept a software write when the `v` field is 1.

If the detected error was corrected then `ce` is set to 1. If the detected error
could not be corrected but was deferred then `ude` is set to 1. If the detected
error could not be corrected or deferred and thus needs urgent handling by an
RAS handler, then the `uue` bit is set to 1. If the error record does not log a
class of errors (e.g., does not support UDE), then the corresponding bit may be
could not be corrected but was deferred then `ued` is set to 1. If the detected
error could not be corrected or deferred and thus needs immediate handling by an
RAS handler, then the `uec` bit is set to 1. If the error record does not log a
class of errors (e.g., does not support UED), then the corresponding bit may be
hardwired to 0. If the bits corresponding to more than one error class are set
to 1 then the error record holds information about the highest severity error
class among the bits set. The error record may be used to provide an
informational update by setting the `v` bit to 1 and setting `ce`, `ude`, and
`uue` bits to 0. Such informational updates are signaled using the signal
informational update by setting the `v` bit to 1 and setting `ce`, `ued`, and
`uec` bits to 0. Such informational updates are signaled using the signal
configured in `control_i.ces`.

When `v` is 1, if more errors of the same class as the error currently logged in
the error record occur then the multiple-occurrence (`mo`) bit is set to indicate
the multiple occurrence of errors of the same severity. See <<OVERWRITE_RULES>>
for rules on overwriting the error record in such cases.

Each error of an error class (CE, UDE, or UUE) that may be logged in an error
Each error of an error class (CE, UED, or UEC) that may be logged in an error
record may be associated with a priority which is a number between 0 and 3;
priority value of 3 being the highest priority and priority value of 0 being the
lowest priority. The priority values indicate relative priority among errors of
Expand All @@ -505,17 +505,17 @@ implementation may support only a subset of legal values for this field and
an implementation that does not support reporting of a priority per error may
hardwire this field to 0.

The error record overwrite rules use the error class (CE, UDE, or UUE) and the
The error record overwrite rules use the error class (CE, UED, or UEC) and the
error priority (`pri`) as specified in <<OVERWRITE_RULES>>.

When an UUE occurs the containable (`c`) bit may be set to 1 to indicate
When an UEC occurs the containable (`c`) bit may be set to 1 to indicate
that the error has not propagated beyond the boundaries of the hardware unit
that detected the error and thus may be *containable* through recovery actions
(e.g., terminating the computation, etc.) carried out by the RAS handler.
The `c` bit is WARL. For error classes other than UUE, the interpretation of
The `c` bit is WARL. For error classes other than UEC, the interpretation of
the `c` bit may be specified in a future standard extension.

For a RISC-V hart, some UUE may cause a Hardware Error exception cite:[PRIV].
For a RISC-V hart, some UEC may cause a Hardware Error exception cite:[PRIV].
A Hardware Error is a synchronous exception, triggered when corrupted or
uncorrectable data is accessed, either explicitly or implicitly, by an
instruction. In this context, "data" encompasses all types of information used
Expand Down Expand Up @@ -593,23 +593,24 @@ to 0.
| 7 | Implicit write.
|===

For a RISC-V hart, the Privileged specification cite:[PRIV] defines memory
accesses by instructions as either explicit or implicit. An Implicit read or
write is an access that may be implicitly performed by hardware to perform an
explicit operation. For example, a load or store instruction executed by the
hart may perform implicit memory accesses to page table data structures.
Instruction memory accesses by a hart are termed as implicit accesses by the
Privileged specification. However, for the purposes of error reporting, only
the implicit accesses to data structures, such as the (guest) page tables that
are used to determine the address of the instructions to be fetched, are termed
as implicit accesses. The read to fetch the instruction bytes themselves is
classified as an explicit read.

[NOTE]
====
Implementations may report additional information about the transaction (e.g.,
whether speculative, on-demand vs. prefetch, etc.) in the `info_i` and/or
`suppl_info_i` registers.
For a RISC-V hart, the Privileged specification cite:[PRIV] defines memory
accesses by instructions as either explicit or implicit. Implicit read and write
are accesses that may be implicitly performed by hardware to perform an explicit
operation. For example, a load or store instruction executed by the hart may
perform implicit memory accesses to page table data structures. Instruction
memory accesses by a hart are termed as implicit accesses by the Privileged
specification. However for the purposes of error reporting only the implicit
accesses to data structures like the (guest) page tables used to determine the
address of the instruction to fetch are termed as implicit accesses. The
read to fetch the instruction bytes themselves are termed as explicit reads.
A non-hart component may also perform implicit accesses in order to process an
explicit transaction. For example, processing a memory transaction may require
a fabric component to implicitly access a routing table data structure.
Expand Down Expand Up @@ -695,7 +696,7 @@ writing a new error into the record and setting the `v` field to 1, then softwar
should repeat this process.
====

When an UUE or UDE error is logged in an error record, the `cec` and `ceco` fields
When an UEC or UED error is logged in an error record, the `cec` and `ceco` fields
of the error record are not modified and retain their values.

==== Address or information Register (`addr_info_i`)
Expand Down Expand Up @@ -777,12 +778,12 @@ When a hardware unit detects an error it may find its error record still valid
due to an earlier detected error that has not yet been consumed by software.

The overwrite rules allow a higher severity error to overwrite a lower severity
error. UUE has the highest severity, followed by UDE, and then CE. When the two
error. UEC has the highest severity, followed by UED, and then CE. When the two
errors have same severity the priority of the errors (as determined by
`status_i.pri`) is used to determine if the error record is overwritten. Higher
priority errors overwrite the lower priority errors. When a error record is
overwritten by a higher severity error (UDE/CE by UUE, UDE by UUE, or CE by
UUE/UDE), the status bits indicating the severity of the older errors are
overwritten by a higher severity error (UED/CE by UEC, UED by UEC, or CE by
UEC/UED), the status bits indicating the severity of the older errors are
retained (i.e., are sticky).

When an error writes or overwrites an error record, the `status_i.cec` and
Expand All @@ -801,32 +802,31 @@ overflow on `cec` increment sets `ceco` to 1.
if status_i.v == 1
// There is a valid first error recorded
if ( severity(new_error) > severity(status_i) )
// Higher severity errors overwrite less severe errors, retaining
// previous error status bits (sticky) but clearing the rdip bit.
status_i.rdip = 0
status_i.uue |= new_status.uue
status_i.ude |= new_status.ude
status_i.ce |= new_status.ce
// Higher severity errors overwrite less severe errors and clear mo
status_i.mo = 0
overwrite = TRUE
endif
if ( severity(new_status) == severity(status_i) )
// Second errors of the same severity set MO and clear rdip.
// Second errors of the same severity set MO
status_i.mo = 1
status_i.rdip = 0
// Second error of same severity overwrites previous error if it
// has higher priority (status_i.pri).
if ( new_status.pri > status_i.pri )
overwrite = TRUE;
endif
endif
// previous error status bits are retained (sticky) but rdip bit is cleared.
status_i.rdip = 0
status_i.uec |= new_status.uec
status_i.ued |= new_status.ued
status_i.ce |= new_status.ce
else
// No valid error recorded; new error logged, clearing sticky history
// and MO bit, and rdip is set.
status_i.rdip = 1
status_i.uue = new_status.uue
status_i.ude = new_status.ude & ~new_status.uue
status_i.ce = new_status.ce & ~new_status.uue & ~new_status.ude
status_i.uec = new_status.uec
status_i.ued = new_status.ued & ~new_status.uec
status_i.ce = new_status.ce & ~new_status.uec & ~new_status.ued
status_i.mo = 0
overwrite = TRUE;
endif
Expand All @@ -849,10 +849,10 @@ overflow on `cec` increment sets `ceco` to 1.

<<<

If the `status_i.v`, `status_i.mo`, and `status_i.uue` are all 1 then the RAS
If the `status_i.v`, `status_i.mo`, and `status_i.uec` are all 1 then the RAS
handler should preferably restart the system to bring it to a correct state as
an UUE record has been lost. If the `status_i.v` and `status_i.mo` are 1 but
`status_i.uue` is 0 (i.e., the logged error is a UDE or a CE) then the RAS
an UEC record has been lost. If the `status_i.v` and `status_i.mo` are 1 but
`status_i.uec` is 0 (i.e., the logged error is a UED or a CE) then the RAS
handler may keep the system operational.

If multiple errors occur simultaneously then they may be recorded individually
Expand Down
17 changes: 9 additions & 8 deletions reri_header.adoc
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
[[header]]
:description: RISC-V RAS Error Record Register Interface Specification
:company: RISC-V.org
:revdate: 03/2023
:revnumber: 0.1
:revremark: This document is in development. Assume everything can change. See http://riscv.org/spec-state for details.
:revdate: 01/2024
:revnumber: 1.0-rc1
:revremark: This document is in stable state. Assume everything can change. See http://riscv.org/spec-state for details.
:url-riscv: http://riscv.org
:doctype: book
:preface-title: Preamble
Expand Down Expand Up @@ -39,11 +39,12 @@ RERI Task Group

// Preamble
[WARNING]
.This document is in the link:http://riscv.org/spec-state[Development state]
.This document is in the link:http://riscv.org/spec-state[Stable state]
====
Assume everything can change. This draft specification will change before
being accepted as standard, so implementations made to this draft
specification will likely not conform to the future standard.
Assume anything could still change, but limited change should be expected.
This draft specification will change before being accepted as standard, so
implementations made to this draft specification will likely not conform to
the future standard.
====

[preface]
Expand All @@ -53,7 +54,7 @@ Attribution 4.0 International License (CC-BY 4.0). The full
license text is available at
https://creativecommons.org/licenses/by/4.0/.

Copyright 2022 by RISC-V International.
Copyright 2022 - 2024 by RISC-V International.

[preface]
include::reri_contributors.adoc[]
Expand Down
20 changes: 10 additions & 10 deletions reri_intro.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -139,15 +139,15 @@ corrected by the hardware are called *Corrected Errors (CE)*.
Errors that could not be corrected are called uncorrected errors. A component
that detects an uncorrected error may allow possibly corrupted data to
propagate to the requester of the data but associate an indicator (e.g., poison)
with the data. Such errors are said to be *Uncorrected Deferred Errors (UDE)* as
with the data. Such errors are said to be *Uncorrected Errors Deferred (UED)* as
they allow the component to continue operation and defer dealing with the error
to a later point in time if the data corrupted by the error is consumed. Deferring
errors allows deferring the error handling to an ultimate consumer of the
corrupted data that may be able to provide more precise information to a RAS
handler about the contexts affected by the corruption and thus enable more
precise error recover actions by the RAS handler. The component that detected
and deferred the error may notify a RAS handler by reporting the UDE
but such a UDE does not need an immediate remedial action to be performed by the
and deferred the error may notify a RAS handler by reporting the UED
but such a UED does not need an immediate remedial action to be performed by the
RAS handler. For example, a memory controller may detect an uncorrectable ECC
error on data in memory but since there is no immediate consumer of the data the
memory controller may just mark the data as poisoned and defer the error
Expand All @@ -158,17 +158,17 @@ data is only partially written then the data continues to be marked as poisoned.

A component that detects an uncorrected error may be unable to defer the
handling of the error by techniques such as poisoning. Such errors are said to
be *Uncorrected Urgent Errors (UUE)* and a RAS handler is invoked as
be *Uncorrected Errors Critical (UEC)* and a RAS handler is invoked as
immediate remedial actions are required. For example, a cache controller
may detect an uncorrectable ECC error on the memory used to hold cache tags
and since such errors cannot be attributed to any particular data element
these errors may be classified as UUE. If poisoned data is attempted to be
consumed by a component (e.g. a hart, an IOMMU, a device, etc.) then an UUE
these errors may be classified as UEC. If poisoned data is attempted to be
consumed by a component (e.g. a hart, an IOMMU, a device, etc.) then an UEC
occurs as immediate remedial actions are required and further deferral of the
error is not possible.

A component that signals a request for execution of an RAS handler
for an UUE may indicate that the error has not propagated beyond the boundaries
for an UEC may indicate that the error has not propagated beyond the boundaries
of the component that detected the error and thus may be *containable* through
recovery actions (e.g., terminating the computation, etc.) carried out by the
RAS handler.
Expand All @@ -180,7 +180,7 @@ endpoint. In such cases the component may receive the data with a deferred
error. Such a component may propagate the error and not log an error by itself.
However, if the component to which the data is being propagated (e.g. a PCIe
endpoint) is not capable of handling poison then the former component must
signal a UUE instead of propagating the corrupted data, as the act of
signal a UEC instead of propagating the corrupted data, as the act of
propagation breaks containment of the error.

An error detected by a component may lead to a failure mode where the component
Expand Down Expand Up @@ -331,8 +331,8 @@ between hardware components and error errors/banks.
| SPA | Supervisor Physical Address. See Priv. specification.
| TLB | Translation Lookaside Buffer.
| VA | Virtual Address. See Priv. specification.
| UDE | Uncorrected Deferred Error.
| UUE | Uncorrected Urgent Error.
| UED | Uncorrected Error Deferred.
| UEC | Uncorrected Error Critical.
| WARL | Write Any values, Reads Legal values: Attribute of a
register field that is only defined for a subset of bit
encodings, but allow any value to be written while
Expand Down

0 comments on commit c5ecffa

Please sign in to comment.