Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add swarm protocol code #73

Closed
bgits opened this issue Nov 13, 2018 · 25 comments
Closed

add swarm protocol code #73

bgits opened this issue Nov 13, 2018 · 25 comments

Comments

@bgits
Copy link

bgits commented Nov 13, 2018

Currently there does not exist a protocol code for Ethereum's Swarm content hashes (https://swarm-guide.readthedocs.io/en/latest/usage.html#bzz-url-schemes).

This makes it difficult to use multiaddr in Ethereum Improvement Proposal #1577 (ethereum/EIPs#1577).

@Stebalien
Copy link
Member

Is this a hash function or a transport protocol?

@Arachnid
Copy link

@Stebalien Swarm is a content addressed store, just like ipfs.

@Stebalien
Copy link
Member

Yes, but which parts need multicodecs for what. That is:

  • we have multicodecs for hash functions.
  • we have multicodecs for IPLD formats (eth, git, dagcbor, etc.)
  • we have multicodecs for protocols in multiaddrs (ip4, tcp, quic, etc.)

It looks like you need an IPLD format but I'm not sure.

@Arachnid
Copy link

We need a multiaddr protocol, so we can address, eg, /swarm/hash. The necessary hash functions are already in place.

@Arachnid
Copy link

To clarify, our end goal is to have a content field that can contain either a swarm hash or an IPFS hash (or other related content identifiers). If multiaddr isn't the appropriate way to do that, please let us know.

@Stebalien
Copy link
Member

We generally distinguish between names and addresses. Addresses generally tell you where to look for some resource (multiaddr), names generally tell you what resource to look for (peer ID, content ID).

So, a multiaddr in an ENS record would address some endpoint (network or otherwise), not a piece of content. To name a piece of content, you'd probably just want a free-form path. For example, /ipns/Qm..., /ipns/a.com, /ipfs/Qm.../path/to/file.txt, /ipld/.../path/to/data, etc.

(One of the motivations in switching away from using /ipfs/$PEER_ID in multiaddrs was to remove this ambiguity).

@Arachnid
Copy link

Is there a multiformat suitable for encoding an ipfs hash or path? Storing text isn't ideal due to the high cost of onchain storage.

@Arachnid
Copy link

It looks like multicodec might be suitable? Would it make sense to add ipfs and swarm multihash types to the public multicodec table?

@jbenet
Copy link
Member

jbenet commented Dec 3, 2018

Would it make sense to add ipfs and swarm multihash types to the public multicodec table?

I filed multiformats/multicodec#94 and multiformats/multicodec#95 for these ^

@jbenet
Copy link
Member

jbenet commented Dec 3, 2018

@Stebalien can you take ownership of these issues for ENS?

@Stebalien
Copy link
Member

@jbenet I'll see what I can do but this will need quite a bit of consensus.

@Arachnid multicodec is really just a common table of unique codes. Other multiformats then use these codes but multicodec doesn't really describe a generalized path spec.

I think what you're looking for here are CIDs. CIDs are (multicodec, multihash) tuples where the hash is the hash of the content and the codec indicates how one should interpret the content. We use this datastructure in IPLD to support arbitrary merkle-dag formats (e.g., ETH, git, etc.).

It looks like the correct path forward here is to:

  1. Assign a multicodec to swarm.
  2. Optionally (see Move CIDs to the "multiformats" project cid#26): define an IPLD format for swarm.
  3. Add a CID field to ENS.

However, there are two drawbacks to 3:

  1. It requires adding a new field to ENS.
  2. It won't support paths. This isn't really an issue for swarm but other IPLD formats expect users to be able to path through datastructures, key by key.

To handle this (and unify our path concepts) we could introduce a new generalized "multipath" spec, subsuming multiaddr. This spec would define a universal pathing system, a global conflict free namespace (basically, the string version of the multicodec table), and a "compact" path form.

This would basically just be taking the current multiaddr spec as-is and extending it slightly to say "and you can use this to address content as well".


Thoughts on this?

@Arachnid
Copy link

Arachnid commented Dec 3, 2018

I think what you're looking for here are CIDs. CIDs are (multicodec, multihash) tuples where the hash is the hash of the content and the codec indicates how one should interpret the content. We use this datastructure in IPLD to support arbitrary merkle-dag formats (e.g., ETH, git, etc.).

That sounds right. What's the value of the multicodec field, though? Or are you saying it's a multicodec type prefix, followed by multihash content? Isn't that what we're already proposing?

@Arachnid
Copy link

Arachnid commented Dec 3, 2018

Okay, I see I misunderstood multicodec - it's only a prefix, the spec doesn't include the data.

You list one disadvantage as having to add a new field to ENS, but we're already adding a field to ENS for content hashes; this discussion is trying to determine what it will look like. I assume you're not saying we'd have to add another field in addition to that. I don't think there can be an option that doesn't require adding a field to ENS?

I am a little confused, though, because I don't see a multicodec value for IPFS content-hashes anywhere in the table. What value do CIDs use?

@Stebalien
Copy link
Member

That sounds right. What's the value of the multicodec field, though? Or are you saying it's a multicodec type prefix, followed by multihash content? Isn't that what we're already proposing?

Yes. Well, specifically, it's: (<multibase>)<cid-version=1><content-multicodec><content-multihash>.

Okay, I see I misunderstood multicodec - it's only a prefix, the spec doesn't include the data.

Yes. Well, really, the term "multicodec" is a bit overloaded. It refers both to the codec itself and formats of the form <multicodec><data>. When I say multicodec, I almost always mean the codec, not formats of the form <multicodec><data>.

but we're already adding a field to ENS for content hashes; this discussion is trying to determine what it will look like.

Ah. Sorry, I thought multiaddr support in ENS was already a done deal (the EIP was merged). My point was that we could try to unify multiaddr and other paths into a master path spec if that were the case.

If not, and if you just need to be able to reference content, CIDs are the way to go.

I am a little confused, though, because I don't see a multicodec value for IPFS content-hashes anywhere in the table. What value do CIDs use?

TL;DR: There is no "IPFS" multicodec.

  1. In a CID, the multicodec describes the data being referenced, not the CID itself.
  2. The multicodec doesn't describe how or where to get the data. A CID describes the data, not where to find it.
  3. We use multiple multicodecs and multiple data formats (DagPB, DagCBOR, Raw, etc.) in IPFS/IPLD. We also have support for Git, Ethereum, Bitcoin, etc.

So, how would this work for swarm? We'd define a new multicodec for swarm and then, in ENS, you'd use CIDs of the form <cid-version=1><swarm-multicodec><multihash> (where the multihash is a sha3 mutlihash).

@Arachnid
Copy link

Arachnid commented Dec 4, 2018

Yes. Well, specifically, it's: ()<cid-version=1>.

Can we reasonably omit the multibase prefix if we specify it'll always be in binary format?

Ah. Sorry, I thought multiaddr support in ENS was already a done deal (the EIP was merged). My point was that we could try to unify multiaddr and other paths into a master path spec if that were the case.

It's merged, but still only a draft. I think we'd rather harmonise now rather than deprecate later.

If not, and if you just need to be able to reference content, CIDs are the way to go.

I think that's what we want. Are you implying the existence of other, more flexible options, though?

  1. In a CID, the multicodec describes the data being referenced, not the CID itself.
  2. The multicodec doesn't describe how or where to get the data. A CID describes the data, not where to find it.

Hm, that seems like a problem for this approach. We need to be able to look at the data in the ENS record and know which distributed storage system to query (and what identifier to use to query it); it sound like that's not going to be doable as-is, since the CID metadata doesn't describe the storage system, just the stored data?

@Stebalien
Copy link
Member

Can we reasonably omit the multibase prefix if we specify it'll always be in binary format?

Yes, sort of. See: multiformats/cid#28

Basically, you can always turn a CID into text (well, the EIP doesn't have to specify how to do this but the CID spec does). However (while some may say otherwise...) there's no reason to waste the byte if the encoding can't be anything other than "raw bytes".

It's merged, but still only a draft. I think we'd rather harmonise now rather than deprecate later.

Got it.

I think that's what we want. Are you implying the existence of other, more flexible options, though?

Not at the moment, no. My point is that CIDs address content by hash and that's all. However, multiaddrs address network endpoints. If you need something that does both, CIDs won't cut it.

Hm, that seems like a problem for this approach.

So, in this case, it shouldn't actually be an issue. Given that you only support one data format, you can just say "if it's a swarm object, look it up in swarm".

More generally, I'd be careful about bundling content addressing with location addressing. For example, what if the same data is available through multiple storage systems? What if you need to migrate from one to another?

Really, you almost want something like a magnet/meta link. That is a CID along with some description of where the content might be found. I'm not familiar enough with ENS to give concrete suggestions but I'd consider a separate field with location hints. Alternatively, we could try to come up with some way to bundle location hints with a CID but I'm not really sure about the best way to do that (usually, I'd pass those hints along out-of-band, e.g., in a separate field).

@nolash
Copy link

nolash commented Dec 4, 2018

Hm, that seems like a problem for this approach. We need to be able to look at the data in the ENS record and know which distributed storage system to query (and what identifier to use to query it);

https://github.com/ipld/cid#cidv1 specifies:

<cidv1> ::= <multibase-prefix><cid-version><multicodec-content-type><multihash-content-address>

From table.csv linked from the page multicodec one gets the impression that all the entries there actually are defined as codecs (even multicodec is in itself a codec). Is this correct? It's a bit confusing, also as suggested by @Stebalien:

Yes. Well, really, the term "multicodec" is a bit overloaded.

@Arachnid Is the plan to have multiple entries in ENS, one for each underlying "codec"? In this case, wouldn't referencing the same content in ipfs (codec 0x01a5) and swarm (if swarm codec was 0x0622) just two records comprising for example:

SWARM: z | base64(010622 | 1b20 | keccak256hash)
IPFS: z | base64(0101A5 | whatever_multihash_ipfs_uses)

@Arachnid
Copy link

Arachnid commented Dec 4, 2018

So, in this case, it shouldn't actually be an issue. Given that you only support one data format, you can just say "if it's a swarm object, look it up in swarm".

That implies "and everything else should be looked up in IPFS", though, which isn't either very neutral or very extensible.

More generally, I'd be careful about bundling content addressing with location addressing. For example, what if the same data is available through multiple storage systems? What if you need to migrate from one to another?

Really, you almost want something like a magnet/meta link. That is a CID along with some description of where the content might be found. I'm not familiar enough with ENS to give concrete suggestions but I'd consider a separate field with location hints. Alternatively, we could try to come up with some way to bundle location hints with a CID but I'm not really sure about the best way to do that (usually, I'd pass those hints along out-of-band, e.g., in a separate field).

We want to store an identifier sufficient for the end user to fetch the content. While I recognise that "this is the content hash" is distinct from "and here is the system to look for it in", realistically due to different systems having different methods of hashing, chunking, and building trees, the chances of having the same content hash accessible in different systems seems low-to-nil.

I think it makes the most sense to combine content hash and location metadata together into a single identifier for that reason.

I'm afraid this still leaves me in the dark as to what the best solution is, however - everything proposed so far seems to have significant issues.

@Stebalien
Copy link
Member

Is the plan to have multiple entries in ENS, one for each underlying "codec"? In this case, wouldn't referencing the same content in ipfs (codec 0x01a5) and swarm (if swarm codec was 0x0622) just two records comprising for example:

Not if one uses CIDs. You'd have one identifying the content and another hinting at where to find the content.

I guess you could also have multiple CIDs (for "alternative" versions of the content).


That implies "and everything else should be looked up in IPFS", though, which isn't either very neutral or very extensible.

Not really. An application wishing to resolve CIDs to data would use a pluggable (parallel and/or hierarchical) resolver. Ideally, it would have:

  • Type-specific resolvers for resolving:
    • Swarm blocks with Swarm.
    • Ethereum blocks.
    • Git objects with GitHub or GitLab.
    • etc...
  • General-purpose resolvers for resolving arbitrary blocks (e.g., IPFS's bitswap).

Now yes, the data exchange protocol IPFS uses (bitswap) can fetch arbitrary blocks, but that's just because it's a general-purpose data exchange protocol.

Really, the issue here is that CIDs are entirely neutral. They don't say anything about how the data should be retrieved (although this can sometimes be inferred from the type).

While I recognise that "this is the content hash" is distinct from "and here is the system to look for it in", realistically due to different systems having different methods of hashing, chunking, and building trees, the chances of having the same content hash accessible in different systems seems low-to-nil.

If Swarm gains traction, the chances are pretty high: go-ipfs will almost certainly get a plugin for resolving Swarm CIDs with Swarm. Once fetched, the data would be cached in the local IPFS datastore and made available over bitswap.

This is the entire point of IPLD (CIDs are a part of the IPLD spec): interoperability between merkledag systems.

However, I do agree that hints indicating where content can likely be found is important for performance. On the other hand, I'm still not convinced bundling location hints with content identity is a good idea.

@nolash
Copy link

nolash commented Dec 5, 2018

and another hinting at where to find the content.

but that would be multiple "anothers" if there are multiple locations, right? (IPFS, Swarm...)

They don't say anything about how the data should be retrieved (although this can sometimes be inferred from the type). [...] I'm still not convinced bundling location hints with content identity is a good idea.

Maybe this inference is enough for ENS as a case?

@Stebalien
Copy link
Member

but that would be multiple "anothers" if there are multiple locations, right? (IPFS, Swarm...)

We'd only need one field type. If ENS is like DNS, you'd repeat the field once per system. Alternatively, you could encode a list of systems.

Maybe this inference is enough for ENS as a case?

Maybe? For swarm, it should be.

Also, a "location" hint can always be added after the fact as a new (optional) field if that becomes an issue.


On the other hand, I'm still not convinced bundling location hints with content identity is a good idea.

On second thought, I can see a reason to bundle these if ENS supports multiple "alternative" records like DNS does. That is, can I have:

CONTENT=<swarm_content>
CONTENT=<something_else>
...

Where the client should pick the first supported data source? If so, then it make sense to bind these location hints to the records themselves (although I'm not sure what the best way to do this is).

@Arachnid
Copy link

Arachnid commented Dec 5, 2018

If Swarm gains traction, the chances are pretty high: go-ipfs will almost certainly get a plugin for resolving Swarm CIDs with Swarm. Once fetched, the data would be cached in the local IPFS datastore and made available over bitswap.

If CIDs don't encode information on where to get content, what makes something a Swarm CID, and how does IPFS know how to fetch it from swarm?

Not if one uses CIDs. You'd have one identifying the content and another hinting at where to find the content.

How does a CID hint at where to find the content?

Also, if CIDs don't do that, then what's the purpose of storing a CID in ENS over just a multihash?

@Stebalien
Copy link
Member

If CIDs don't encode information on where to get content, what makes something a Swarm CID, and how does IPFS know how to fetch it from swarm?

Swarm uses a custom merkledag format (Swarm-Hash?). CIDs pointing to swarm content would be <cidv1><swarm-hash-codec><mutlihash>.

How does a CID hint at where to find the content?

Sorry, one record/field identifying the content (with a CID) and one (or more) record(s)/field(s) indicating where content related to the ENS name can be found.

Also, if CIDs don't do that, then what's the purpose of storing a CID in ENS over just a multihash?

CIDs tell you how to interpret the referenced merkledag. A bare multihash is sufficient to identify the content but it doesn't tell you if the content is just a raw binary object, a swarm merkletree, a git object, an ethereum block, etc.

@Arachnid
Copy link

Arachnid commented Dec 5, 2018

Swarm uses a custom merkledag format (Swarm-Hash?). CIDs pointing to swarm content would be .

I really think we must be talking at cross-purposes here.

As I understand it, your position is that identifiers should be purely content identifiers, and shouldn't integrate information about the location of that content. Is that correct?

But at the same time, you seem to be suggesting using metadata about the content identifiers, like what sort of hashing they use, to identify where to find the data.

This seems like it has the same effect as including hints about content location, but less reliably, since it's possible that there could be multiple storage locations for a single content hash.

What am I misunderstanding? Can you give a concrete example of what you think an ENS record pointing to a resource that can be either IPFS or Swarm would look like?

Sorry, one record/field identifying the content (with a CID) and one (or more) record(s)/field(s) indicating where content related to the ENS name can be found.

What do you mean "content related to the ENS name"? The goal isn't to store ENS information in Swarm or IPFS, it's to point to Swarm and IPFS resources from ENS.

CIDs tell you how to interpret the referenced merkledag. A bare multihash is sufficient to identify the content but it doesn't tell you if the content is just a raw binary object, a swarm merkletree, a git object, an ethereum block, etc.

So, what is the canonical format for an IPFS identifier, such as those that users enter into their browsers today? Is it a multihash, or a CID? Isn't any of this metadata stored with the actual IPFS object?

@Stebalien
Copy link
Member

A multicodec has been added in multiformats/multicodec#104 but it's not currently a "multiaddr" codec. However, given that multiaddrs are defined as <multicodec><value>..., we can always choose to redefine this later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants