Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propose more explicit expectations of serialization of UnixFS data #271

Closed
wants to merge 1 commit into from

Conversation

willscott
Copy link
Contributor

Continuing the conversation in ipfs/go-unixfsnode#27, to propose a spec change for allowing non protobuf-serialized data to be interpreted as UnixFS data.

@@ -78,7 +78,8 @@ This `Data` object is used for all non-leaf nodes in Unixfs.

For files that are comprised of more than a single block, the 'Type' field will be set to 'File', the 'filesize' field will be set to the total number of bytes in the file (not the graph structure) represented by this node, and 'blocksizes' will contain a list of the filesizes of each child node.

This data is serialized and placed inside the 'Data' field of the outer merkledag protobuf, which also contains the actual links to the child nodes of this object.
This data is serialized and placed inside the 'Data' field of the outer structure, which also contains the actual links to the child nodes of this object.
The final structure is most commonly serialized as a merkledag protobuf, but serialized data from other codecs may also be interpreted by the Unixfs ADL.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"serialized data from other codecs" is too broad...

An implicit constraint of the merkledag protobuf encoding is that Links is a single-level list of structs with a well defined single pointer in each.

By saying "you can replace the unxifs merkledag container with anything else" you lose the implicit constraints and need to redefine them somewhere ( here? )

Stepping back once again however - what is the real purpose of this change request to a strongly-ossified protocol? If it is just for ease of test fixtures - this is overkill. If there are other (future?) use-cases: please share them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand any of this comment. Why is that phrasing too broad?

Copy link
Member

@warpfork warpfork Mar 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking in with the sheer passage of time here: the debated phrasing is from >3 years ago.

To contextualize that: we didn't have an IPLD Data Model spec back then (or at best it was still nascent and hotly contested). We didn't have a concept of ADL back then. I'm not the original author, but ISTM it's very likely that this was defined in terms of "merkledag" and "protobuf" not because those were critically important but simply because the terminology for that was accessible at the time and seemed sufficiently clear.

We have clear language for the Data Model now, and that means clear language for nested data structures that are agnostic of codec. Defining unixfsv1 in terms of that language is now easy and clear (and implementing it that way is also easy, as demonstrated by the PR that spawned this discussion, which is incidentally tiny -- very strong evidence of the actual simplicity of this).

It's true that unixfs is a long standing piece of history, and for most of its time it's been protobuf only. But I don't really see a reason to treat that as a sacred cow.

(And so what if the first given reason is test fixtures? That's an excellent reason. If we need a second one, I humbly submit "that it makes sense". As a third: "freedom from codecs is one of the founding principles of IPLD".)

@Stebalien
Copy link
Member

This doesn't buy us anything (it just sucks worse than the current unixfs), adds a bunch of work to change our current libraries, and complicates the upgrade story for unixfs-v2 (where we can currently play some unholy "if dag-pb" games, if necessary).

@warpfork
Copy link
Member

warpfork commented Mar 8, 2022

I think it's important to contextualize that the origin of this is a PR which is in service to demonstrating some test fixtures which use dag-json to describe a unixfs structure. In the original context, I think it's deeply uncontroversial and entirely sensible. (And I will repeatedly bang the drum of: look at the code in that PR: it's very simple, and I think that is worth treating as evidence in favor of the general simplicity of what's being discussed here.)

We ended up here with a specs PR because people were uncomfortable with that change without making it described in specs somehow.

I don't know if this description is perfect, but I strongly believe that what Will's trying to accomplish is sensible.

@warpfork
Copy link
Member

warpfork commented Mar 8, 2022

It seems potentially worth noting that we have a description of the logical format of DAG-PB also published for several years now. ISTM that describing the behavior of UnixFSv1 in terms of the logical form, potentially with slightly less emphasis on DAG-PB where we can emphasize the logical form instead, is generally reasonable.

We may also still want to describe UnixFSv1 as often being triggered by or specifically restricted to DAG-PB. These are two orthogonal things.

I think the nuance of this may actually be clearer in the original implementation PR that spawned this discussion, and just hasn't been emphasized in the opening of this PR.

@aschmahmann
Copy link
Contributor

I don't know if this description is perfect, but I strongly believe that what Will's trying to accomplish is sensible.

AFAICT that thing seems to be to write specs that document some UnixFS things. Given that we are effectively breaking the spec here since all of a sudden UnixFS implementations that refuse to parse some-new-codec-unixfs-cid will now be non-compliant. Yes, we could lean on the "may" in the change to say that actually those things will still be compliant. However, this make the lives of implementations of other specs (e.g. the in-progress HTTP Gateway spec could reasonably require only resolving traditional dag-pb UnixFS nodes) miserable since now they'll need to know about the details behind this "may" in the implementations they're using.

Making new ADLs is free, it's as cheap as just using a different name. Couldn't we just leave this spec alone and instead have a new ADL for unixfs-loose or something?

and complicates the upgrade story for unixfs-v2 (where we can currently play some unholy "if dag-pb" games, if necessary).

I'm less familiar with the history here, but over the last several years I've heard that the intent with future UnixFS upgrades included potentially leveraging the controlled nature of the current spec and potentially the lack of use for dag-pb beyond UnixFS in the greater ecosystem. This could be compared in some ways to the CIDv0 vs v1 where the controlled nature of CIDv0 has generally been useful to us (instead of say allowing any multihash to be a valid CIDv0).

I'd like to not make our lives harder here when the alternative (having a second ADL that have only been proposed to be used in test fixtures anyway) is so easy.

@warpfork
Copy link
Member

warpfork commented Mar 8, 2022

I think I'd generally +1 to that. That seems congruent with the suggestion I made earlier in the implementation PR, that the library should do what it's going to do, and document it, and if things consuming the library want to have rules about only applying the logic on dag-pb data, that's entirely fine and reasonable.

(It begs the question of how we got here a bit, though. I thought that's pretty close to where the implementation PR started, and would've been more or less a reason not to come poking at the prose of this spec? Oh well. I can't complain at the outcome.)

(Nit/sidenote: some of that phrasing, specifcally about naming something differently, sorta rubs against the definition of what an ADL is. The mental orientation for ADLs is supposed to be: they're a "lens" for data, that you apply at your option and choice. So giving an ADL a different name based on how and exact where it's suggested to be signalled would be getting things a bit backwards. But, again, not really entirely topical to this specs prose PR here; just offering the comment informatively.)

@willscott
Copy link
Contributor Author

This isn't a change I feel particularly strongly about, and I don't know if we get anywhere productive out of this discussion.

I guess my general perspective here went along these lines:

  • If I'm coming from a context like a selector, where I have explicitly signaled that I want to data to be 'interpretas''s UnixFS, I probably do want that to happen if the data is in the right 'shape', and not limited to data that happens to have been serialized from a specific on-disk form.
  • In fact, I'd argue that this second, 'loose' codec better matches what I would expect as the default if I were to say {interpretAs: 'unixfs'}. (and that would have been the direct implication of go-unixfsnode#27), so this isn't purely 'for test fixtures' in that sense.

we are also for a non-insignificant amount of time in a place where making adls is not free. We demurred from having a canonical table of ADLs, so instead we have to find different uses spread through code and try to keep them in sync as much as possible for consistency.

@BigLep
Copy link
Contributor

BigLep commented Mar 22, 2022

So this has been open for a couple of weeks now. What are the next steps?

@BigLep BigLep added this to the Best Effort Track milestone Mar 22, 2022
@aschmahmann
Copy link
Contributor

IMO we should close this, if people want to archive this proposal in something like an exploration report (e.g. https://github.com/ipld/ipld/tree/0bdc7c3f870b417e35be01afb58c07c048904d21/notebook/exploration-reports) we can certainly set up a similar structure for IPFS.

While I understand the concept of building ADLs on top of generic IPLD data model structures, I think it's also quite reasonable that there will be cases where users expect their ADLs to be tied to particular codecs or underlying structures such as when building compatibility with existing well defined formats.

UnixFS has been one of those formats for a number of years and potentially complicates the UnixFSv2 upgrade story (#271 (comment)).

It'd be easier to make the case here about over night making all of our existing UnixFS libraries no longer work for all UnixFS data but instead work for some UnixFS data if there was a compelling use case other than testing (#271 (comment)).


I recognize that there is some issue here in that there's no pre-defined process for how to resolve these spec requests when there is disagreement. In this particular case my feeling is to not mess with the many year old spec unless there are particularly compelling user needs to be met. IMO there's nothing wrong with defining a looseUnixFS ADL with the relevant changes and that seems to me like an easy thing to approve since you're just defining what some new system does.

@BigLep
Copy link
Contributor

BigLep commented Apr 8, 2022

Per IPLD triage with @rvagg there was agreement with closing as we don't believe a spec change in this area for testing purposes is justified. I'm sorry :(

Good callout that we don't have a defined process for handling these disagreement cases. I am ok to move forward with closing in this case given without the process because we have someone knowledgeable about IPFS (@aschmahmann ), IPLD (@rvagg ), and outside (@Stebalien ) not supportive.

@BigLep BigLep closed this Apr 8, 2022
rvagg added a commit that referenced this pull request Jun 27, 2022
The strict coupling poses some problems for IPLD implementations that:

 1. do not retain codec information between the serialization and UnixFS (ADL)
    reification; and/or
 2. do not have a mechanism to strictly mandate that UnixFS data be _only_
    encoded in a particular codec.

A mandate in the specification that strictly defines the layering on top of
the codec makes it difficult to implement it as an ADL, which also presents
difficulty for using tooling that builds on IPLD lens-style layering.

Ref: ipfs/go-unixfsnode#27
Ref: #271
Ref: ipld/go-car#304
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Archived in project
Development

Successfully merging this pull request may close these issues.

6 participants