-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify namespacemaps used in SPDX documents and collections #403
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarifying words, Gary! However, as far as I understand it, there can be multiple SpdxDocument
s per payload, so I think this does not solve the issue I describe in this comment.
You are correct. I think of the SPDX Document as a "unit of transfer" of SPDX information where there would only be one SPDX Document per payload for the "original" creation, although downstream users may bundle multiple SPDX documents in a different payload or collection. Hence defining the namespace map to be the "original". |
@zvr - I'm getting the following error from the CI - let me know any advice on tracking it down:
Thanks |
There can be more than one SpdxDocument per payload, but only one of them refers to the payload it is contained in. (That's not strictly true, someone could create 5 SpdxDocuments referring to the same payload, but that would be stupid.) Any other SpdxDocuments refer to other payloads, allowing hashes of the others to be verified and their elements imported. A use case would be an Sbom with 10 elements split over two payloads. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good and can be merged. But would like to "downgrade" the originalNamespaces
from a reliable mechanism
to just an optional hint
.
|
||
## Description | ||
|
||
This field is used for providing a reliable mechanism in reproducing an original serialization for a given SPDX Document. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of reliable mechanism
for me it feels more like a hint
.
We probably could also highlight that its support is optional? Especially that it can be omitted, even if some magic shortened IDs while serialization. So, one could serialize using some namespaceMap like features, but is not forced to store this information in that map? Might be helpful to simplify implementation.
Especially if someone "mints" the SpdxDocument and later wants to serialize it. Otherwise, one would be forced to do the namespace choices before minting the element. And one would need to do the same choice for all serializations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It being just a hint, also resolves the concerns above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually changed my mind. I would prefer to have two PRs that are discussed independently:
- drop
namespacesMap
now from the model - add some functionality to store the
namespacesMap
, that is supported by existing serializations. We can wait with that till later as it is just an optimization. And later we can actually test our assumptions.
It's a bit easier for me to agree to both together since the original motivation for adding the namespaces was to support round-trip fidelity for the other format. That being said, waiting until a decision is made on the formats we plan to support does make sense. I'll split into 2 PR's. |
Partial solution to #390 by removing the namespace and NamespaceMap entirely from the model. Note that this was suggested as a separate PR and comment in PR #403 which still kept the NamespaceMap for SPDX Documents. Once this PR is agreed to / merged, we can create a separate PR to add a namespace map hint to SPDX documents. Signed-off-by: Gary O'Neall <[email protected]>
I agree with dropping it entirely. I don't care whether it goes back into SpdxDocument because it serves no purpose but causes no harm. An analogy would be JSON or XML data - whitespace is insignificant. An application can call json.dump with an indent=2 parameter, but there is no API to tell dump to look for an indent parameter in the data to be serialized. With parsing there's no API to return the indentation used to create the data, it's just part of the data that exists but is intended to be insignificant. So if we want to read namespaceMap when parsing and set it when dumping, the serialization spec for each format will need to define an API to do it. The API will need to examine the serialized data itself, not hints in any logical element. |
Note related PR #431 |
It looks like after merging PR #411 this PR is still valid, so moving this from draft back to proposed. |
@davaya - There is a specific use case I would like to solve for:
This is the use case I'm trying to accomplish with the SPDX Document class. The information needs to be in the model itself in order to be preserved between different deserialize / serialize forms. This use case is important for backwards compatibility with SPDX 2.3. |
@maxhbr - Can you comment on any concerns with this PR w.r.t. adding back in the functionality to store namespace maps? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding back in Namespaces as agreed in prior discussion.
My concern: it is an optional feature and might or might not be serialized. And it might or might not be correctly recovered from parsing a serialized file. Therefore parsing such an SpdxDocument Element could result in inconsistent content. Serializing to multiple formats and de-serializing with multiple tools that do or do not support this feature will result in different SpdxDocument objects. Which will mess up all hashing and consistency checks. Can we clarify the |
@maxhbr I'm not sure I completely understand the concern. I think we've always had the issue that if you deserialize and reserialize an SPDX document, it would have different hashes - that's something the canonicalization team was chartered to address. From my perspective, it is acceptable for intermediate suppliers to not support the round trip use case - I just want to make it possible to support if all parties are interested in supporting it - so optional is fine. |
I see issues when serializing and deserialize and then have an Element with a slightly different content. |
The element being the |
Every In order for this use case:
to have any effect Payload1 must include the SpdxDocument describing Payload1. If it does so and SpdxDocument has NamespaceMap, then an application can reproduce Payload2 that is bit-for-bit identical to Payload1 if you assume everything else in canonicalization has been fully defined and followed by the implementations. I still don't see this as an interesting use case (just as I don't see the benefit of reproducing JSON documents with identical indentation - what people care about is having the identical data after deserialization), but I don't object to:
|
This is currently described as:
so there is a choice. Either a generator or parser can choose to not support or use it. Or do I misunderstand this part? So, that means that after generation and parsing, depending on choices, one has a different result? |
An SPDX Document is an element, so once it is "minted", it can not change. What I intended by "purely" optional is that when minting an SPDX document, you do not need to include this property - as in it is not required. If it is present, then it should always be present in all serializations of the same element ID. @maxhbr - I think we need to discuss realtime - I don't think my description is anywhere adequate, but I'm not understanding the issues. |
I think Gary's explanation is clear.
|
|
||
This field is used for providing the ability to reproduce an original serialization for a given SPDX Document, especially if the serialization format does not natively support namespace mapping. | ||
|
||
This purely optional field is not intended for implementing or representing the NameSpaceMap for all serializations where this SPDX document may be found. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some questions regarding semantics:
- if this is not provided, is the serialization allowed to use any
NameSpaceMap
feature?- or more specific: are only the explicitly stated
NameSpaceMap
entries allowed?
- or more specific: are only the explicitly stated
- if there is some
NameSpaceMap
like feature of a serialization used, should it be added to theNameSpaceMap
of the correspondingSpdxDocument
? - What are the semantics, if a serialized blob contains not exactly one
SpdxDocument
?
Whether the optional
works depends somewhat on these questions.
And I see two levels of optional here. The entity minting the SpdxDocument might choose to not add Namespaces and the serializer might choose to not use the NameSpaceMap in the Document. Are both optional?
maybe not so relevant:
- How would this approach handle potential inconsistent constraint on mappings? There might be keys, that are allowed in some serialization but disallowed in another one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this is not provided, is the serialization allowed to use any NameSpaceMap feature?
If your question is related to the native namespace map used by the serialization format (not some property serialized data), there is no restriction whether or not the original namespaces is provided.
if there is some NameSpaceMap like feature of a serialization used, should it be added to the NameSpaceMap of the corresponding SpdxDocument?
If the creator of the SPDX data wishes to support namespace round tripping, yes.
What are the semantics, if a serialized blob contains not exactly one SpdxDocument?
Each SpdxDocument object represents one serialization of SPDX data. If you have a serialization which makes copies of multiple serializations which support the round tripping of namespace data, you can have multiple SPDX documents. Each SpdxDocument properties can then be used to recreate the original SPDX namespaces.
And I see two levels of optional here. The entity minting the SpdxDocument might choose to not add Namespaces and the serializer might choose to not use the NameSpaceMap in the Document. Are both optional?
Yes - These are to support recreating the original namespace map. The creator does not have to support this and can omit the necessary data. The consumer can ignore the data and create whatever serialization specific namespace map they want. This is only for the case where the producer wishes to have the namespace map preserved and the consumer would like to recreate it.
How would this approach handle potential inconsistent constraint on mappings? There might be keys, that are allowed in some serialization but disallowed in another one.
The producer would just record the namespace mapping they used. If the consumer can't handle the namespace mapping for their chosen serialization format, they would have to handle that in a method specific to the output serialization format.
Any subsequent serialization is allowed to use any NameSpaceMap - as long as the resulting URI is the same.
It should be added if the creator of the SpdxDocument desires the ability to recreate the original document utilizing the same NameSpaceMap. I would probably say the creator "can" use the NameSpaceMap.
The SpdxDocument should relate to one and only one "original" serialization (this could be clarified in the description). Subsequent serializations may aggregate multiple SpdxDocuments and represent each of the original serializations with separate SpdxDocuments within the same serialization. |
|
This means, that one mints a different I still think that this is an issue that we should tackle very late, when we are clear how our serializations will look like and when we are able to test our assumptions on real data. |
If one chooses to create an SpdxDocument at all, it refers to a single minted blob. If the SpdxDocument contains verifiedBy (which is its primary purpose), every blob will have its own verification information. Having a single SpdxDocument that can verify multiple blobs would be possible only if we develop canonical serialization. I would guess that the canonical form would not use compaction.
Even if SpdxDocument does not contain verifiedBy, handlers must be able to serialize a set of elements that includes zero or more SpdxDocument elements. Data to be serialized is the "data plane", serialization parameters are knobs on the "control plane" and the compaction knob can be turned without serializing SpdxDocuments. |
How would one compute the hash of something containing the hash a property? If the minted |
@davaya , you say that:
so the "verifiedBy" is its primary purpose. But the description of the
I think there is a mismatch between your understanding and the current definition. I think that we need to align on the description of SpdxDocument first. |
Yes, the serialization sub-team and the tech team need a common understanding of serialization and the model, and until recently the model made no distinction between logical element values and serialized blobs. Now it does, and discussion will elaborate on how serialized values and the logical model relate to each other.
Normally security functions have a payload with protected and unprotected header information wrapped around it. In cases where envelope values (such as a hash) are to be included inside the payload, the normal approach is to assign a fixed value (e.g., zero or an empty string) to the payload field when computing the signature/hash, then filling in the computed value before transmission. Validation reverses the process, nulling the payload field before validating. This is only necessary if serialized blobs contain their own security values, which is why its easier for serialized blobs to only contain security values for other blobs, as is the use case for "importing" elements from one blob to another. Another topic for the model/serialization discussion. |
Absolutely not - there is only one The verifiedBy is an interesting topic, but not the primary reason I wanted have an We definitely need to discuss this real-time - the thread is getting very long and we're mixing a lot of different issues. |
I agree, but today and next Tuesday we have public holidays. Will miss the meetings. In my opinion, the reasons for me derailing the discussion is, that semantics of SpdxDocument are not clear. I think we need to start earlier and at a later point this problem can be solved. |
The If there is any serialization-dependent information contained in the |
This PR is an attempt to solve issue #390 by making moving the namespaceMap to SPDX Documents rather than all collections and redefining the semantics to be specific to the round-tripping issue for serialization namespaces Signed-off-by: Gary O'Neall <[email protected]>
Signed-off-by: Gary O'Neall <[email protected]>
Signed-off-by: Gary O'Neall <[email protected]>
Proposal: I suggest that the parser that wants to persist the information for reproducing the serialization should create a new Element which is minted on parsing, which
and this Element is then linked to the parsed content. This new type would need to be added to the Model, maybe in a small profile. |
@maxhbr +1 That "new type" is already in the model as SpdxDocument - it can be minted by the parser of a payload. |
I disagree. I think the |
That's one use case. But the intent of the "X" element (SpdxDocument element, Payload element, File element) is that it can be generated by user Y who parses a Payload created last year by creator Z. Maybe we can agree on the on-the-fly element generation as a hard requirement, and then decide what the name of the X element should be, as Gary suggests.
That would be in the Bom / ElementCollection element - the creator creates a Bom element and can give it a name, and every element has creationInfo. The payload doesn't need a name other than a "filename" or "downloadLocation" hint, and wouldn't have a name at all if it's an API response rather than a file. The name/location are ephemeral, created by whoever puts a payload in a filesystem or web server. Including an "X" element in the payload it describes is only a hint for where it might be found. This is analogous to an html or markdown document which doesn't have it's own name/url as part of its content. The document does however have references to other documents. |
In general I like this approach as it removes the confusion and ambiguity between the serialization namespace map and the model version of the namespace map. It is similar to the approach where we handle this in the serialization spec but does provide a model object to store the information - even if it isn't every "serialized". The edge case that is not covered is if there is an intermediate deserialization / serialization. In that case we could loose the original name space map if the intermediate chose not follow the same namespace map. Not sure if this edge case is important, but I thought I would point this out. |
You are describing two different use cases:
Does the consumer follow Max's proposal to create a "golden namespace map" element that can be used to serialize any set of elements in any data format? In that case, the map is never lost Or does the consumer follow Max's proposal to create a payload-specific element that lists the format-specific download location and hash of the payload and the SpdxIds A, B, and C? In that case the map isn't lost either, because the "original namespace map" element that was created by parsing payload 1 can be serialized into intermediate payloads 2, 3, and 4 even if data format 2 doesn't carry namespace map in the serialized data (and thus can never be the source of a golden or original map). |
|
||
## Summary | ||
|
||
Provides a NamespaceMap representing the NamspaceMap used for the original serialization of an SpdxDocument. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: s/NamspaceMap/NamespaceMap/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
Thanks @lumjjb for the catch!
Signed-off-by: Gary O'Neall <[email protected]>
Superseded by #490 |
This PR is an attempt to solve issue #390 by making moving the namespaceMap to SPDX Documents rather than all collections and redefining the semantics to be specific to the round-tripping issue for serialization namespaces