You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We're seeing a very large number of offline peers each week (graph below, latest graph here). Offline peers are defined as those that are seen online for 10% of time or less (https://probelab.io/ipfsdht/#availability). This might be affecting the churn that we're seeing in the network as the churn CDF shows median lifetime of ~20 minutes but in reality will be lower since churn excludes nodes we have never contacted.
Such short-lived peers do not actually contribute to the network, as they fill other peers' routing tables, but do not stay online to provide records, if they happen to store any.
This is a tracking issue for figuring out more details, together with some thoughts on what we can do to find out where this large number is coming from.
Facts
We see:
~13-20k unique peers offline each week, which make up 30-40% of all peers seen.
~1250 connection errors per crawl
What might be happening
It could be very short lived nodes whose lifetime fits between crawler runs (30m intervals).
On startup, a node contacts neighbours and they will add the new node to their routing tables, the node could then go offline and never be seen by the crawler.
Ways forward
We need to:
find what proportion of 20k have never been contacted?
catch peers with short lifetimes - get user agent and lifetime estimate
possible experiment: run instance of nebula with 5 minute crawl interval
find what is the in-degree of the unresponsive peers - how many other peers have them in their routing table?
As a solution, we could avoid adding peers to the routing table immediately after they're seen online. We could wait for some amount of time before adding them. In the meantime, new peers can be pinged more frequently when they are first added to routing table, gradually decreasing ping frequency over time as peer is known to be stable.
The primary question here would be how long should we wait before adding peers to the routing table.
Other thoughts and ideas more than welcome.
The text was updated successfully, but these errors were encountered:
yiannisbot
changed the title
Large Number of Unresponsive Peers
Large Number of Unavailable Peers
Jul 28, 2023
We pushed a fix for this in Kubo 0.21.0 (libp2p/go-libp2p-kad-dht#820)
You need to wait multiple months maybe years for this release to become prevalent in the network, currently it's very small:
Note: the current patch does not handle peers that are available for short amount of times, however given your description I'm pretty sure the nodes that are filtered libp2p/go-libp2p-kad-dht#820 would still show up in your graph above.
It makes it very hard to decern how much this is a problem we already know and how much this is a similar but different thing.
Context
We're seeing a very large number of offline peers each week (graph below, latest graph here). Offline peers are defined as those that are seen online for 10% of time or less (https://probelab.io/ipfsdht/#availability). This might be affecting the churn that we're seeing in the network as the churn CDF shows median lifetime of ~20 minutes but in reality will be lower since churn excludes nodes we have never contacted.
Such short-lived peers do not actually contribute to the network, as they fill other peers' routing tables, but do not stay online to provide records, if they happen to store any.
This is a tracking issue for figuring out more details, together with some thoughts on what we can do to find out where this large number is coming from.
Facts
We see:
What might be happening
Ways forward
We need to:
As a solution, we could avoid adding peers to the routing table immediately after they're seen online. We could wait for some amount of time before adding them. In the meantime, new peers can be pinged more frequently when they are first added to routing table, gradually decreasing ping frequency over time as peer is known to be stable.
The primary question here would be how long should we wait before adding peers to the routing table.
Other thoughts and ideas more than welcome.
The text was updated successfully, but these errors were encountered: