-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Track number of client nodes in the IPFS DHT Network #30
Comments
Concerning approach 1: Concerning approach 2: |
In case you want to explore a more generalized version of Approach 1, the majority of active client nodes may fetch at some point and be reproviding empty objects:
This comes with a nice side-effect of catching real nodes that announce their block caches, and less likely random peerids from CI etc. |
Noting that approach 4 is what is being followed for "Number of Client vs Server Nodes in the DHT" in https://www.notion.so/pl-strflt/IPFS-KPIs-f331f51033cc45979d5ccf50f591ee01?pvs=4#ce43d82d30b94de0848c71a9fad414ab |
Closing this issue as for now we're following approach 4 above.
If we end up using a different approach in the future (e.g., when nodes persist their routing tables upon restart and bootstrappers end up capturing only new nodes joining), or want to get a more holistic view of clients in the IPFS network (e.g., as per: #45), we'll re-open the issue, if needed. |
Summarising several approaches from offband discussions here to have them documented.
Approach 1: kubo README file - idea initially circulated by @BigLep
Description: The kubo README file is stored and advertised by every node in the network (ipfs/kubo#9590 (comment)), regardless of whether the node is a client or a server in the beginning. The provider records for this README are becoming stale after a while, either because peers are categorised as clients (and are therefore unreachable), or because the leave the network (churn). But the records are still there until they expire. We could count the number of providers across the network for the kubo README CID and approximate the network-wide client vs server ratio.
Downside: This approach would only count kubo nodes (which is a good start and likely the vast majority of clients).
Approach 2: Honeypot - idea circulated by @dennis-tra
Description: We have:
Maybe we can estimate what share of queries should come across the honeypot and then estimate the total number of clients in the network, based on the number of unique clients the honeypot sees. This would be a low overhead setup and may allow better estimates with more honeypots.
Downside: The approach would need maintenance and infrastructure cost of the honeypot(s).
Approach 3: Baby-Hydras - idea circulated by @guillaumemichel
Description: Another approximation we could get is by running multiple DHT servers. Think of a few baby hydras. Each DHT server would log all peerids sending DHT requests, and get the % of client vs servers by correlating the logs with crawls results. This gives the % of clients and servers observed, we average the results of all DHT servers, and extrapolate this number to get the total number of client, given that we know the total number of servers.
Downside: The approach would need maintenance and infrastructure cost of the DHT servers/baby-hydras.
Approach 4: Bootstrapper + Nebula - info gathered by @yiannisbot
Description: We capture the total number of Unique PeerIDs through the bootstrapper. What this gives us is the "Total number of nodes that joined the network as either clients or servers". Given that we have the total number of DHT server nodes from the Nebula crawler, we can have a pretty good estimation of the number of clients that join the network. The calculation would simply be:
Total number of Unique PeerIDs (seen by bootstrappers) - DHT Server PeerIDs (found by Nebula)
. In this case, clients will include: other non-kubo clients (whether based on the Go IPFS codebase, Iroh, etc.) and js-ipfs based ones too (nodejs, and maybe browser, although the browser ones shouldn't be talking to the bootstrappers anyway).Downside: We rely on data from a central point - the bootstrappers.
Approach 4 seems like the easiest to get us quick results. All of the rest would be good to have to compare results and have extra data points.
Any other views, or suggested approaches?
The text was updated successfully, but these errors were encountered: