Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AdditionalVectorCrypto] adding section on addition vector crypto extensions #1306

Draft
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

nibrunieAtSi5
Copy link
Contributor

@nibrunieAtSi5 nibrunieAtSi5 commented Mar 28, 2024

This pull request is the riscv-isa-manual version of a pull request started on riscv-crypto: riscv/riscv-crypto#362.

/!\ This pull request is a draft for the future fast track extensions.

This pull requests draft the changes associated with two fast track extensions for vector crypto.

During the specification process for vector crypto 1.0.0 a few items had to be discarded because they appeared too late in the process. This fast track extension tries to address some of them.

The official demand that will be discussed in the Task Group and submitted to the Unpriv Committee is being drafter here: https://docs.google.com/document/d/1zpYhnZi2NxhjfcBGvPOy0oDhx6lTXchscG17Qcl6wv8/edit?usp=sharing

New features:

  • Zvbc32e: Extending vclmul[h].v[vh] instruction to support SEW=32-bit value
    • should be available standalone (ELEN >= 32) or in addition to Zvbc (ELEN >= 64)
    • no new encoding
  • Zvkgs: Adding .vs variants to vghsh and vghmul
    • should depend on Zvkg
    • new encodings

Open questions:

  • Should Zvbc32e be allowed when ELEN >= 32 without depending on Zvbc ? (Answer: YES)
  • Should Zvbc32e support SEW=16 ? (SEW=8 ?)
  • Find encodings
  • How to name the two new extensions
  • Do we need to define a Zvkt(bc/bc32e) to extend Zvkt to the extension of vclmul[h/] defined in Zvbc32e ?

Related changes:

Draft versions:

Version pdf
v0.0.1 (August 31st 2023) https://github.com/riscv/riscv-crypto/files/12487628/riscv-crypto-spec-vector-extra.pdf
v0.0.2 (January 17th 2024) https://github.com/riscv/riscv-crypto/files/13970691/riscv-crypto-spec-vector-extra.pdf
v0.0.3 (February 1st 2024) https://github.com/riscv/riscv-crypto/files/14146438/riscv-crypto-spec-vector-extra_v0.0.3.pdf
v0.0.4 (February 6th 2024) riscv-crypto-spec-vector-extra_v0.0.4.pdf
v0.0.5 (March 7th 2024) riscv-crypto-spec-vector-extra_v0.0.5.pdf
v0.0.6 (June 19th 2024) riscv-crypto-spec-vector-extra_v0.0.6.pdf
v0.0.7 (July 31st 2024) riscv-crypto-spec-vector-extra_v0.0.7.pdf
v0.0.9 (September 23rd 2024) riscv-crypto-spec-vector-extra_v0.0.9.pdf
v0.0.10 (October 19th 2024) riscv-crypto-spec-vector-extra_v0.0.10.pdf
v0.0.11 (December 5th 2024) riscv-crypto-spec-vector-extra_v0.0.11.pdf

Changelogs

  • from v0.0.5 to v0.0.6:
    • adding vs2 / vd overlap as reserved encoding for new vghsh.vs / vgmul.vs instructions (review feedback from @QJtaibai )
  • from v0.0.8 to v0.0.9:
    • change name OP-P to OP-VE
    • adding justification for Zvbc32e and Zvkgs
  • from v0.0.9 to v0.0.10:
    • Changing Zvbc32e specification to incorporate 8 and 16-bit carry-less multiplication
    • Fixing typos (including unifying to "carry-less" (1))
  • from v0.0.10 to v0.0.11:
    • implementing ARC feedback:
      • explicit mention of "reserved" behavior for vl / vstart not multiple of EGS
      • split into two chapters (one per extension)
    • multiple typo fixes + switch to in "development" state

Original Plan for the fast track schedule

image

References

@QJtaibai
Copy link

QJtaibai commented Jun 5, 2024

Will it be reserved or not for instructions (vghsh and vgmul) , if the vd register group overlaps with the vs2 register group.

@nibrunieAtSi5
Copy link
Contributor Author

Will it be reserved or not for instructions (vghsh and vgmul) , if the vd register group overlaps with the vs2 register group.

Good question @QJtaibai , this is not listed but should be to aligned with the other .vs instructions (and since overlapping those does not make sense functionally and can make implementation harder).

@Yunzezhu94
Copy link

Hello! I have a question that in the instruction part of document vghsh.vs shares same encoding with vghsh.vv, and has func6:101100, while in Appendix A it looks vghsh.vs have func6:100011. I wonder which one is correct?

@nibrunie
Copy link

nibrunie commented Jul 5, 2024

Hello! I have a question that in the instruction part of document vghsh.vs shares same encoding with vghsh.vv, and has func6:101100, while in Appendix A it looks vghsh.vs have func6:100011. I wonder which one is correct?

Appendix A and the riscv-v opcode Pull request (https://github.com/nibrunieAtSi5/riscv-opcodes/pull/1/files) are right, the instruciton part of the document was incorrect. I have fixed it.

Note that those opcodes are simply suggestion at this point (but that does not mean the suggestions should not be consistent).

Thank you for pointing that out.

@nibrunieAtSi5
Copy link
Contributor Author

I have developed a small example to demonstrate how Zvbc32e can be leveraged to implement CRC (with the folding method): https://github.com/nibrunie/rvv-examples/blob/e55d529fe54316a95b733aea82cadbfbfbf08e67/src/crc/vector_crc_be_zvbc32e.c

The current implementation provides "performance" of about 0.65 instruction per bytes (on 1MB input buffer).

@nibrunie
Copy link

nibrunie commented Aug 1, 2024

wangpc-pp added a commit to wangpc-pp/llvm-project that referenced this pull request Aug 14, 2024
These two extensions add addtional instructions for carryless
multiplication with 32-bits elements and Vector-Scalar GCM
instructions.

Please see riscv/riscv-isa-manual#1306.
[wavedrom, , svg]
....
{reg:[
{bits: 7, name: 'OP-P'},

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have renamed it to OP_VE in #1311.

@wangpc-pp
Copy link

Will it be reserved or not for instructions (vghsh and vgmul) , if the vd register group overlaps with the vs2 register group.

Good question @QJtaibai , this is not listed but should be to aligned with the other .vs instructions (and since overlapping those does not make sense functionally and can make implementation harder).

We don't have this constraint on vghsh.vv and vgmul.vv? Is it OK to make it stricter?

@topperc
Copy link
Collaborator

topperc commented Aug 15, 2024

Will it be reserved or not for instructions (vghsh and vgmul) , if the vd register group overlaps with the vs2 register group.

Good question @QJtaibai , this is not listed but should be to aligned with the other .vs instructions (and since overlapping those does not make sense functionally and can make implementation harder).

We don't have this constraint on vghsh.vv and vgmul.vv? Is it OK to make it stricter?

For LMUL>1, the operations likely take place over multiple cycles or uops. The restriction on .vs ensure that the scalar element in the first register is not overwritten before all LMUL parts have been processed. This is the same reason we have rules on widening and narrowing instructions. This restriction is not needed for .vv since the input bits and output bits are from the same offset in the register group. Overwriting an earlier part of the register doesn't affect operations on later parts of the register.

@wangpc-pp
Copy link

Will it be reserved or not for instructions (vghsh and vgmul) , if the vd register group overlaps with the vs2 register group.

Good question @QJtaibai , this is not listed but should be to aligned with the other .vs instructions (and since overlapping those does not make sense functionally and can make implementation harder).

We don't have this constraint on vghsh.vv and vgmul.vv? Is it OK to make it stricter?

For LMUL>1, the operations likely take place over multiple cycles or uops. The restriction on .vs ensure that the scalar element in the first register is not overwritten before all LMUL parts have been processed. This is the same reason we have rules on widening and narrowing instructions. This restriction is not needed for .vv since the input bits and output bits are from the same offset in the register group. Overwriting an earlier part of the register doesn't affect operations on later parts of the register.

Yeah, thanks! I get it! But it seems that vd can overlap vs1 for reductions, aren't they the same?

@topperc
Copy link
Collaborator

topperc commented Aug 15, 2024

Will it be reserved or not for instructions (vghsh and vgmul) , if the vd register group overlaps with the vs2 register group.

Good question @QJtaibai , this is not listed but should be to aligned with the other .vs instructions (and since overlapping those does not make sense functionally and can make implementation harder).

We don't have this constraint on vghsh.vv and vgmul.vv? Is it OK to make it stricter?

For LMUL>1, the operations likely take place over multiple cycles or uops. The restriction on .vs ensure that the scalar element in the first register is not overwritten before all LMUL parts have been processed. This is the same reason we have rules on widening and narrowing instructions. This restriction is not needed for .vv since the input bits and output bits are from the same offset in the register group. Overwriting an earlier part of the register doesn't affect operations on later parts of the register.

Yeah, thanks! I get it! But it seems that vd can overlap vs1 for reductions, aren't they the same?

Maybe the assumption for reductions is that vd would only be updated after all of vs1 is read since you need to see all elements before you know the result.

@aswaterman
Copy link
Member

@topperc Right, for reductions, the amount of additional state that’s needed is nil for many implementations, but in the worst case, it’s bounded by ELEN. Here, I guess it’s bounded by EGW. That difference could be material for narrow vector machines. @nibrunieAtSi5 can you illuminate the situation?

@DavidYu360
Copy link

What about the VPR alignment rules for vghsh.vs / vgmul.vs?
As I mentioned in riscv-software-src/riscv-isa-sim#1548

Possible alignment value: 1, 2, 4, 8 (1 for any VPR, 2 for V0, V2, ...)
Prerequisite: VLEN * LMUL >= EGW

  1. Single key in .vs inst: max(EGW/VLEN, 1)
  2. Others: max(LMUL, 1)

wangpc-pp added a commit to wangpc-pp/llvm-project that referenced this pull request Aug 19, 2024
These two extensions add addtional instructions for carryless
multiplication with 32-bits elements and Vector-Scalar GCM
instructions.

Please see riscv/riscv-isa-manual#1306.
wangpc-pp added a commit to llvm/llvm-project that referenced this pull request Aug 19, 2024
These two extensions add addtional instructions for carryless
multiplication with 32-bits elements and Vector-Scalar GCM
instructions.

Please see riscv/riscv-isa-manual#1306.
@nibrunieAtSi5
Copy link
Contributor Author

The non-overlap constraint for vector element group is part of the vector crypto specification

Source/Destination overlap constraints::

In the case of the .vs instructions defined in this specification, vs2 holds a 128-bit scalar element group.
For implementations with VLEN ≥ 128, vs2 refers to a single register. Thus, the vd register group must not
overlap the vs2 register.
However, in implementations where VLEN < 128, vs2 refers to a register group comprised of the number
of registers needed to hold the 128-bit scalar element group. In this case, the vd register group must not
overlap this vs2 register group.

[%autowidth]
[%header,cols="4,4,4"]
|===
| Instruction
| Register
| Cannot Overlap

| vaes*.vs | vs2 | vd
| vsm4r.vs | vs2 | vd
| vsha2c[hl] | vs1, vs2 | vd
| vsha2ms | vs1, vs2 | vd
| vsm3me | vs2 | vd
| vsm3c | vs2 | vd

It was defined as no overlapping between vector register groups.

This was not necessarily repeated for each operation

* Only for the `.vs` form: the `vd` register group overlaps the `vs2` scalar element group

  • Only for the .vs form: the vd register group overlaps the vs2 scalar element group

I think this constraint makes sense here as well since allowing overlapping would constraint uarch which cannot save/stage the scalar element group before finishing the operation (as stated by @topperc it is highly likely that for LMUL > EGW / VLEN implementations will iterate to compute the operation over the full vector register groups in a few cycles). Note that .vs makes sense for vl / EGS > 1 since this is where we can actually exploit the fact of having a scalar element group.
The size of vs2 vector register group should be understood as with EMUL = min(EGW / VLEN, 1).
The example in the base vector crypto spec seems to hint this way:

However, in implementations where VLEN < 128, vs2 refers to a register group comprised of the number
of registers needed to hold the 128-bit scalar element group.

but should maybe be generalized to EGW.

@topperc
Copy link
Collaborator

topperc commented Aug 29, 2024

Zvkgs: Adding .vs variants to vghsh and vghmul
should depend on Zvkgs

Is the second line supposed to say Zvkg rather than Zvkgs?

@nibrunie
Copy link

nibrunie commented Sep 2, 2024

Zvkgs: Adding .vs variants to vghsh and vghmul
should depend on Zvkgs

Is the second line supposed to say Zvkg rather than Zvkgs?

Yes, you are right @topperc , this is a typo dependency to Zvkg was intended (the assumption is that the vector-scalar(element group) variant do not make sense without the vector-vector variant.

@httoracle
Copy link

In the v0.0.7 spec, I have a question of the operation code of the vghsh.vs and vgmul.vs. Take vghsh.vs for example, It says H is common to all element groups, and you put the "let H = brev8(...)" code outside the loop. I suspect the H of all the group should be the same, and is from the first 128bit of vs2. But inside the each loop, H will be changed.
So if you write just like the operation code in the spec, the H of the next loop will not be the first 128bit of the vs2. I want to ask it is just the purpose of the instruction or is just a miss?
If you want the H to be the same value in the beginning of the each loop, you should put "let H = brev8(...)" inside the loop, so in the beginning of the loop, we can get the H value from vs2 again.

@nibrunie
Copy link

nibrunie commented Sep 10, 2024

In the v0.0.7 spec, I have a question of the operation code of the vghsh.vs and vgmul.vs. Take vghsh.vs for example, It says H is common to all element groups, and you put the "let H = brev8(...)" code outside the loop. I suspect the H of all the group should be the same, and is from the first 128bit of vs2. But inside the each loop, H will be changed. So if you write just like the operation code in the spec, the H of the next loop will not be the first 128bit of the vs2. I want to ask it is just the purpose of the instruction or is just a miss? If you want the H to be the same value in the beginning of the each loop, you should put "let H = brev8(...)" inside the loop, so in the beginning of the loop, we can get the H value from vs2 again.

@httoracle You are right.
H needs to be reset inside the outer loop. this is a mistake in the current behavior code.
It should be fixed by 9982d13.

@nibrunie
Copy link

Additional Vector Crypto extension v0.0.8 (September 9th 2024):
riscv-crypto-spec-vector-extra_v0.0.8.pdf

Changelog from v0.0.7:

  • Fixing behavior of vghsh.vs and vgmul.vs (thanks to @httoracle )
  • Upgrading alias for 0x77 opcode from OP-P to OP-VE (thanks to @wangpc-pp )

Signed-off-by: Nicolas Brunie <[email protected]>
@nibrunie nibrunie force-pushed the additional-vector-crypto branch from 9982d13 to 9b12009 Compare September 20, 2024 21:11
@nibrunie nibrunie force-pushed the additional-vector-crypto branch from 9b12009 to a12c20d Compare September 20, 2024 21:21
@wangpc-pp
Copy link

wangpc-pp commented Sep 23, 2024

Maybe a silly question: do we need zvbc8e/zvbc16e? I feel it can be useful, but I have no data to show how common CRC8/CRC16 are.

@nibrunieAtSi5
Copy link
Contributor Author

@wangpc-pp

Maybe a silly question: do we need zvbc8e/zvbc16e? I feel it can be useful, but I have no data to show how common CRC8/CRC16 are.

That is definitely no a silly question and it has been appeared a few times.
There are a few use cases for 8-bit and 16-bit vector carryless multiply (as you mentioned at least 8 and 16-bit CRCs) but when I asked in the crypto task group for motivations, no one was able to say this was a key use cases which would justify specific extensions (rather than relying on widening and existing instructions from Zvbc[32e], it seems 8 and 16-bit CRCs are less used than the 32-bit versions).

  • From a hardware point of view, I have run a few experiments and supporting 8 and 16-bit vclmul[h] on top of the 32-bit and 64-bit variants would have a minimal cost
  • No new opcode would need to be allocated since this would simply be extending the range of valid vsew for vclmul[h] to cover 0 and 1
  • Another question was: if we go down that route do we need to specific Zvbc[8,16,32]e independently or to avoid some fragmentation they would better be specified in a single extension.

@nibrunie
Copy link

v 0.0.9:
Changelog: Integration feeback, including ARC's:

  • change name OP-P to OP-Ve
  • adding justification for Zvbc32e and Zvkgs

riscv-crypto-spec-vector-extra_v0.0.9.pdf

Note:: part of the justification for Zvkgs relies on the fact that a scalar element group in a .vs vector instruction has EMUL=EGW/VLEN. This fact implies that when VLEN >= 128, one single vector register is required to hold a scalar element group for Zvkgs and it does not need to have an index aligned on LMUL (similar to what should apply for Zvkned or Zvksed). Although this can appear intuitive, I don't think this is ever mentioned explicitly in the original vector crypto specification.

@wangpc-pp
Copy link

@wangpc-pp

Maybe a silly question: do we need zvbc8e/zvbc16e? I feel it can be useful, but I have no data to show how common CRC8/CRC16 are.

That is definitely no a silly question and it has been appeared a few times. There are a few use cases for 8-bit and 16-bit vector carryless multiply (as you mentioned at least 8 and 16-bit CRCs) but when I asked in the crypto task group for motivations, no one was able to say this was a key use cases which would justify specific extensions (rather than relying on widening and existing instructions from Zvbc[32e], it seems 8 and 16-bit CRCs are less used than the 32-bit versions).

* From a hardware point of view, I have run a few experiments and supporting 8 and 16-bit `vclmul[h]` on top of the 32-bit and 64-bit variants would have a **minimal cost**

* No new opcode would need to be allocated since this would simply be extending the range of valid `vsew` for `vclmul[h]` to cover `0` and `1`

* Another question was: if we go down that route do we need to specific Zvbc[8,16,32]e independently or to avoid some fragmentation they would better be specified in a single extension.

I don't know the answer, but here are some investigations:

If the cost of supporting 8/16 bits won't be large, personally, I would like to have them added.

@nibrunieAtSi5
Copy link
Contributor Author

I don't know the answer, but here are some investigations:

If the cost of supporting 8/16 bits won't be large, personally, I would like to have them added.

That is very interesting, it would be also interesting to check if anyone has ever ported jerasure/gf-complete over RISC-V (Vector) and what would be the expected benefit overall.

@topperc
Copy link
Collaborator

topperc commented Sep 24, 2024

Is there any concern that the name Zvkgs is too similar to the existing Zvksg? Though this isn't the first time we had such similar names, we have Zbc and Zcb.

@nibrunieAtSi5
Copy link
Contributor Author

Is there any concern that the name Zvkgs is too similar to the existing Zvksg? Though this isn't the first time we had such similar names, we have Zbc and Zcb.

No concerns were voiced so far, but if you think this could be an issue we can try to come up with another name. (Zvkga for additional Zvkg instructions ?)

@wangpc-pp
Copy link

Zvbc32e may not be suitable now if we extend SEW to 8/16/32 bits. Maybe Zvbcxw(Extended Width), Zvbcxe(Extended EEW)?

@nibrunieAtSi5
Copy link
Contributor Author

nibrunieAtSi5 commented Oct 6, 2024

Zvbc32e may not be suitable now if we extend SEW to 8/16/32 bits. Maybe Zvbcxw(Extended Width), Zvbcxe(Extended EEW)?

I discussed this with @aswaterman and he suggested that the suffix 32* has already been used to mean any element width 32-bit and lower, e.g. in Zve32x, so I think Zvbc32e still works (although the e in Zve32x stands for embedded).

@aswaterman
Copy link
Member

Agreed

@wangpc-pp
Copy link

I don't agree this wording, I strongly recommond that we should change to another name.
My thinking is that 32 in Zve32x/f is Minimum VLEN, not Supported EEW. For Zvbc32e, the postfix e indicates that it is about EEW, so people may think only 32 bits is supported.

@nibrunieAtSi5
Copy link
Contributor Author

I don't agree this wording, I strongly recommond that we should change to another name. My thinking is that 32 in Zve32x/f is Minimum VLEN, not Supported EEW. For Zvbc32e, the postfix e indicates that it is about EEW, so people may think only 32 bits is supported.

I understand your concern @wangpc-pp
Even if we interpret Zve has meaning is the minimum VLEN then it also works here for Zvbc(32e/e32/e32x)
We could switch to e32 rather than 32e and have Zvbce32 (IIRC we need to avoid ending an extension name with a number to facilitate parsing ISA string and this is why I used 32e) to mean that the minimum VLEN supported by the extension is 32 and it has the same supported list of EEW as Zve32(x/f): that is 8, 16, and 32.

If I summarize, the options are:

  • Zvbc32e
  • Zvbcxw (Extended Width),
  • Zvbcxe (extended EEW)
  • Zvbce32
  • Zvbce32x

I would be in favour of keeping 32 somewhere in the name because at least it makes explicit one of the constraint (VLEN/ELEN) but I agree that Zvbc was not explicit on listing the only EEW it supports and having only 32 in the name does not make it explicit that it supports 8 and 16 bits SEW values as well.

@topperc
Copy link
Collaborator

topperc commented Oct 11, 2024

I don't agree this wording, I strongly recommond that we should change to another name. My thinking is that 32 in Zve32x/f is Minimum VLEN, not Supported EEW. For Zvbc32e, the postfix e indicates that it is about EEW, so people may think only 32 bits is supported.

I don't think the 32 in Zve32x/f is about VLEN. The focus on the Zve* extensions is about reducing ALU cost by limiting which element sizes are supported or what operations are supported on large elements. The removal of 64-bit integers elements is the important difference between Zve32x and Zve64x not the minimum VLEN.

@nibrunieAtSi5
Copy link
Contributor Author

I don't agree this wording, I strongly recommond that we should change to another name. My thinking is that 32 in Zve32x/f is Minimum VLEN, not Supported EEW. For Zvbc32e, the postfix e indicates that it is about EEW, so people may think only 32 bits is supported.

I don't think the 32 in Zve32x/f is about VLEN. The focus on the Zve* extensions is about reducing ALU cost by limiting which element sizes are supported or what operations are supported on large elements. The removal of 64-bit integers elements is the important difference between Zve32x and Zve64x not the minimum VLEN.

That was my initial interpretation: that the VLEN constraint relaxation was actually a consequence on the limitation of ELEN in general or on the largest SEW supported for floating-point elements.

@wangpc-pp
Copy link

Thanks for these explanations! I am still not convinced but I will let it be since it seems I'm in the minority. 😄

@kasanovic
Copy link
Collaborator

The 32 in Zve32 represents ELEN. The minimum VLEN is set to be the same as ELEN for these extensions (Table 62 in current vector chapter).

@nibrunie
Copy link

New version: v0.0.10:

Changelog:

  • Changing Zvbc32e definition to include 8-bit and 16-bit SEW values
  • Fixing typos (including unifying to "carry-less" (1))

riscv-crypto-spec-vector-extra_v0.0.10.pdf

(1): it seems the original vector crypto specification uses both "carryless" and "carry-less". Looking at wikipedia, both seem valid (https://en.wikipedia.org/wiki/Carry-less_product, https://en.wikipedia.org/wiki/Finite_field_arithmetic#Carryless_multiply). I chose to unify to "carry-less" but I am open to suggestion.

@nibrunie
Copy link

nibrunie commented Dec 5, 2024

New version of the spec draft (v0.0.11 Dec5th 2024):

riscv-crypto-spec-vector-extra_v0.0.11.pdf

Changelog:

  • implementing ARC feedback:
    • explicit mention of "reserved" behavior for vl / vstart not multiple of EGS
    • split into two chapters (one per extension)
  • multiple typo fixes + switch to in "development" state

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants