p2p: cherry-pick commits from geth for peering issues #1267

pratikspatil024 · 2024-06-13T04:29:07Z

Description

cherry-pickd the following PRs from geth for peering issues
ethereum/go-ethereum#29572
ethereum/go-ethereum#29864
ethereum/go-ethereum#29801
ethereum/go-ethereum#29827
ethereum/go-ethereum#29836
ethereum/go-ethereum#29844
ethereum/go-ethereum#29235

Changes

Bugfix (non-breaking change that solves an issue)
Hotfix (change that solves an urgent issue, and requires immediate attention)
New feature (non-breaking change that adds functionality)
Breaking change (change that is not backwards-compatible and/or changes current functionality)
Changes only for a subset of nodes

Breaking changes

Please complete this section if any breaking changes have been made, otherwise delete it

Nodes audience

In case this PR includes changes that must be applied only to a subset of nodes, please specify how you handled it (e.g. by adding a flag with a default value...)

Checklist

I have added at least 2 reviewer or the whole pos-v1 team
I have added sufficient documentation in code
I will be resolving comments - if any - by pushing each fix in a separate commit and linking the commit hash in the comment reply
Created a task in Jira and informed the team for implementation in Erigon client (if applicable)
Includes RPC methods changes, and the Notion documentation has been updated

Cross repository changes

This PR requires changes to heimdall
- In case link the PR here:
This PR requires changes to matic-cli
- In case link the PR here:

Testing

I have added unit tests
I have added tests to CI
I have tested this code manually on local environment
I have tested this code manually on remote devnet using express-cli
I have tested this code manually on mumbai/amoy
I have created new e2e tests into express-cli

Manual tests

Please complete this section with the steps you performed if you ran manual tests for this functionality, otherwise delete it

Additional comments

Please post additional comments in this section if you have them, otherwise delete it

Node discovery periodically revalidates the nodes in its table by sending PING, checking if they are still alive. I recently noticed some issues with the implementation of this process, which can cause strange results such as nodes dropping unexpectedly, certain nodes not getting revalidated often enough, and bad results being returned to incoming FINDNODE queries. In this change, the revalidation process is improved with the following logic: - We maintain two 'revalidation lists' containing the table nodes, named 'fast' and 'slow'. - The process chooses random nodes from each list on a randomized interval, the interval being faster for the 'fast' list, and performs revalidation for the chosen node. - Whenever a node is newly inserted into the table, it goes into the 'fast' list. Once validation passes, it transfers to the 'slow' list. If a request fails, or the node changes endpoint, it transfers back into 'fast'. - livenessChecks is incremented by one for successful checks. Unlike the old implementation, we will not drop the node on the first failing check. We instead quickly decay the livenessChecks give it another chance. - Order of nodes in bucket doesn't matter anymore. I am also adding a debug API endpoint to dump the node table content. Co-authored-by: Martin HS <[email protected]>

In #29572, I assumed the revalidation list that the node is contained in could only ever be changed by the outcome of a revalidation request. But turns out that's not true: if the node gets removed due to FINDNODE failure, it will also be removed from the list it is in. This causes a crash. The invariant is: while node is in table, it is always in exactly one of the two lists. So it seems best to store a pointer to the current list within the node itself.

enode.Node has separate accessor functions for getting the IP, UDP port and TCP port. These methods performed separate checks for attributes set in the ENR. With this PR, the accessor methods will now return cached information, and the endpoint is determined when the node is created. The logic to determine the preferred endpoint is now more correct, and considers how 'global' each address is when both IPv4 and IPv6 addresses are present in the ENR.

It seems the semantic differences between addFoundNode and addInboundNode were lost in (and are unsure if is available) whereas addInboundNode is for adding nodes that have contacted the local node and we can verify they are active. handleAddNode seems to be the consolidation of those two methods, yet it bumps the node in the bucket (updating it's IP addr) even if the node was not an inbound. This PR fixes this. It wasn't originally caught in tests like TestTable_addSeenNode because the manipulation of the node object actually modified the node value used by the test. New logic is added to reject non-inbound updates unless the sequence number of the (signed) ENR increases. Inbound updates, which are published by the updated node itself, are always accepted. If an inbound update changes the endpoint, the node will be revalidated on an expedited schedule. Co-authored-by: Felix Lange <[email protected]>

Here we clean up internal uses of type discover.node, converting most code to use enode.Node instead. The discover.node type used to be the canonical representation of network hosts before ENR was introduced. Most code worked with *node to avoid conversions when interacting with Table methods. Since *node also contains internal state of Table and is a mutable type, using *node outside of Table code is prone to data races. It's also cleaner not having to wrap/unwrap *enode.Node all the time. discover.node has been renamed to tableNode to clarify its purpose. While here, we also change most uses of net.UDPAddr into netip.AddrPort. While this is technically a separate refactoring from the *node -> *enode.Node change, it is more convenient because *enode.Node handles IP addresses as netip.Addr. The switch to package netip in discovery would've happened very soon anyway. The change to netip.AddrPort stops at certain interface points. For example, since package p2p/netutil has not been converted to use netip.Addr yet, we still have to convert to net.IP/net.UDPAddr in a few places.

Co-authored-by: Stefan <[email protected]>

fjl and others added 7 commits June 11, 2024 13:53

p2p/enode: fix TCPEndpoint (#29827)

b2c178d

p2p: fix race in dialScheduler (#29235)

642f552

Co-authored-by: Stefan <[email protected]>

pratikspatil024 added the do not squash and merge This PR will be NOT be squashed and merged label Jun 13, 2024

pratikspatil024 requested a review from a team June 13, 2024 04:29

cffls approved these changes Jun 13, 2024

View reviewed changes

manav2401 approved these changes Jun 13, 2024

View reviewed changes

pratikspatil024 merged commit bd5d54e into develop Jun 13, 2024
11 checks passed

pratikspatil024 deleted the psp-p2p-upstream branch June 13, 2024 07:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

p2p: cherry-pick commits from geth for peering issues #1267

p2p: cherry-pick commits from geth for peering issues #1267

pratikspatil024 commented Jun 13, 2024

p2p: cherry-pick commits from geth for peering issues #1267

p2p: cherry-pick commits from geth for peering issues #1267

Conversation

pratikspatil024 commented Jun 13, 2024

Description

Changes

Breaking changes

Nodes audience

Checklist

Cross repository changes

Testing

Manual tests

Additional comments