Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track number of client nodes in the IPFS DHT Network #30

Closed
yiannisbot opened this issue Feb 7, 2023 · 4 comments
Closed

Track number of client nodes in the IPFS DHT Network #30

yiannisbot opened this issue Feb 7, 2023 · 4 comments

Comments

@yiannisbot
Copy link
Member

Summarising several approaches from offband discussions here to have them documented.

Approach 1: kubo README file - idea initially circulated by @BigLep

Description: The kubo README file is stored and advertised by every node in the network (ipfs/kubo#9590 (comment)), regardless of whether the node is a client or a server in the beginning. The provider records for this README are becoming stale after a while, either because peers are categorised as clients (and are therefore unreachable), or because the leave the network (churn). But the records are still there until they expire. We could count the number of providers across the network for the kubo README CID and approximate the network-wide client vs server ratio.
Downside: This approach would only count kubo nodes (which is a good start and likely the vast majority of clients).

Approach 2: Honeypot - idea circulated by @dennis-tra

Description: We have:

  • the honeypot that tracks inbound connections/time,
  • the crawls that give us information about in how many routing tables the honeypot is.

Maybe we can estimate what share of queries should come across the honeypot and then estimate the total number of clients in the network, based on the number of unique clients the honeypot sees. This would be a low overhead setup and may allow better estimates with more honeypots.
Downside: The approach would need maintenance and infrastructure cost of the honeypot(s).

Approach 3: Baby-Hydras - idea circulated by @guillaumemichel

Description: Another approximation we could get is by running multiple DHT servers. Think of a few baby hydras. Each DHT server would log all peerids sending DHT requests, and get the % of client vs servers by correlating the logs with crawls results. This gives the % of clients and servers observed, we average the results of all DHT servers, and extrapolate this number to get the total number of client, given that we know the total number of servers.
Downside: The approach would need maintenance and infrastructure cost of the DHT servers/baby-hydras.

Approach 4: Bootstrapper + Nebula - info gathered by @yiannisbot

Description: We capture the total number of Unique PeerIDs through the bootstrapper. What this gives us is the "Total number of nodes that joined the network as either clients or servers". Given that we have the total number of DHT server nodes from the Nebula crawler, we can have a pretty good estimation of the number of clients that join the network. The calculation would simply be: Total number of Unique PeerIDs (seen by bootstrappers) - DHT Server PeerIDs (found by Nebula). In this case, clients will include: other non-kubo clients (whether based on the Go IPFS codebase, Iroh, etc.) and js-ipfs based ones too (nodejs, and maybe browser, although the browser ones shouldn't be talking to the bootstrappers anyway).
Downside: We rely on data from a central point - the bootstrappers.


Approach 4 seems like the easiest to get us quick results. All of the rest would be good to have to compare results and have extra data points.

Any other views, or suggested approaches?

@guillaumemichel
Copy link
Contributor

guillaumemichel commented Feb 8, 2023

Concerning approach 1:
Is there a limit in the number of Provider Records that a DHT Server can (1) store or (2) give in a DHT lookup response?

Concerning approach 2:
How would the nodes get to contact the honeypot?

@lidel
Copy link

lidel commented Feb 9, 2023

In case you want to explore a more generalized version of Approach 1, the majority of active client nodes may fetch at some point and be reproviding empty objects:

  • QmUNLLsPACCz1vLxQVkXqqLX5R1X345qqfHbsf67hvA3Nn (empty unixfs directory)
  • bafkreihdwdcefgh4dqkjv67uzcmw7ojee6xedzdetojuzjevtenxquvyku (empty raw block)
  • baguqeeraiqjw7i2vwntyuekgvulpp2det2kpwt6cd7tx5ayqybqpmhfk76fa (empty dag-json)
  • bafyreigbtj4x7ip5legnfznufuopl4sg4knzc2cof6duas4b3q2fy6swua (empty dag-cbor)

This comes with a nice side-effect of catching real nodes that announce their block caches, and less likely random peerids from CI etc.

@BigLep
Copy link

BigLep commented Mar 6, 2023

Noting that approach 4 is what is being followed for "Number of Client vs Server Nodes in the DHT" in https://www.notion.so/pl-strflt/IPFS-KPIs-f331f51033cc45979d5ccf50f591ee01?pvs=4#ce43d82d30b94de0848c71a9fad414ab

@yiannisbot
Copy link
Member Author

Closing this issue as for now we're following approach 4 above.

If we end up using a different approach in the future (e.g., when nodes persist their routing tables upon restart and bootstrappers end up capturing only new nodes joining), or want to get a more holistic view of clients in the IPFS network (e.g., as per: #45), we'll re-open the issue, if needed.

@yiannisbot yiannisbot changed the title Track number of client nodes in the IPFS Network Track number of client nodes in the IPFS DHT Network May 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants