Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delayed RPC Send Using Tokens #5923

Open
wants to merge 66 commits into
base: unstable
Choose a base branch
from

Conversation

ackintosh
Copy link
Member

@ackintosh ackintosh commented Jun 13, 2024

Issue Addressed

closes #5785

Proposed Changes

The diagram below shows the differences in how the receiver (responder) behaves before and after this PR. The following sentences will detail the changes.

flowchart TD

subgraph "*** After ***"
    Start2([START]) --> AA[Receive request]
    AA --> COND1{Is there already an active request <br> with the same protocol?}
    COND1 --> |Yes| CC[Send error response]
    CC --> End2([END])
    COND1 --> |No| COND2{Request is too large?}
    COND2 --> |Yes| CC
    COND2 --> |No| DD[Process request]
    DD --> EE{Rate limit reached?}
    EE --> |Yes| FF[Wait until tokens are regenerated]
    FF --> EE
    EE --> |No| GG[Send response]
    GG --> End2
end

subgraph "*** Before ***"
    Start([START]) --> A[Receive request]
    A --> B{Rate limit reached <br> or <br> request is too large?}
    B -->|Yes| C[Send error response]
    C --> End([END])
    B -->|No| E[Process request]
    E --> F[Send response]
    F --> End
end
Loading

Is there already an active request with the same protocol?

This check is not performed in Before. This is taken from the PR in the consensus-spec, which proposes updates regarding rate limiting and response timeout.
https://github.com/ethereum/consensus-specs/pull/3767/files

The requester MUST NOT make more than two concurrent requests with the same ID.

The PR mentions the requester side. In this PR, I introduced the ActiveRequestsLimiter for the responder side to restrict more than two requests from running simultaneously on the same protocol per peer. If the limiter disallows a request, the responder sends a rate-limited error and penalizes the requester.

Request is too large?

UPDATE: I removed the RequestSizeLimiter and added RPC::is_request_size_too_large() instead, which checks if the count of requested blocks/blobs is within the number defined in the specification. That looks much simpler.

discussion: #5923 (comment)
commit log: 5a9237f

This request size check is also performed in Before using the Limiter, but in this PR, I introduced RequestSizeLimiter to handle this. Unlike the Limiter, RequestSizeLimiter is dedicated to perform this check.

The reasons why I introduced RequestSizeLimiter are:

  • In After, the rate limiter is shared between the behaviour and the handler (Arc<Mutex<RateLimiter>>>). (This is detailed at the next sentence)
  • The request size check does not heavily depend on the rate-limiting logic, so it can be separated with minimal code duplication.
  • The request size check is performed on the behaviour side. By separating the request size check from the rate limiter, we can reduce the locking of the rate limiter.

Rate limit reached? and Wait until tokens are regenerated

UPDATE: I moved the limiter logic to the behaviour side. #5923 (comment)

The rate limiter is shared between the behaviour and the handler. (Arc<Mutex<RateLimiter>>>) The handler checks the rate limit and queues the response if the limit is reached. The behaviour handles pruning.

I considered not sharing the rate limiter between the behaviour and the handler, and performing all of these either within the behaviour or handler. However, I decided against this for the following reasons:

  • Regarding performing everything within the behaviour: The behaviour is unable to recognize the response protocol when RPC::send_response() is called, especially when the response is RPCCodedResponse::Error. Therefore, the behaviour can't rate limit responses based on the response protocol.
  • Regarding performing everything within the handler: When multiple connections are established with a peer, there could be multiple handlers interacting with that peer. Thus, we cannot enforce rate limiting per peer solely within the handler. (Any ideas? 🤔 )

Additional Info

Naming

I have renamed the fields of the behaviour to make them more intuitive:

  • limiter -> response_limiter
  • self_limiter -> outbound_request_limiter

Testing

I have run beacon node with this changes for 24hours, it looks work fine.

The rate-limited error has not occurred anymore while running this branch.

image

@ackintosh ackintosh added work-in-progress PR is a work-in-progress Networking skip-ci Don't run the `test-suite` labels Jun 13, 2024
@ackintosh ackintosh force-pushed the delayed-rpc-response branch from 29e3f00 to 90361d6 Compare June 19, 2024 22:04
@ackintosh ackintosh force-pushed the delayed-rpc-response branch from 90361d6 to 7e0c630 Compare June 19, 2024 22:44
@ackintosh ackintosh removed the skip-ci Don't run the `test-suite` label Jul 1, 2024
@ackintosh ackintosh marked this pull request as ready for review July 14, 2024 23:23
@pawanjay176
Copy link
Member

pawanjay176 commented Oct 30, 2024

Even if we set the limiter to be based only on the number of concurrent requests, we still make a concurrent request that breaks the rate limit.

I did not understand this. If we get a request that breaks the concurrent limit, why can't we just wait to send that until the existing streams have concluded?

Right now, if I run this branch on a kurtosis peerdas network, the lighthouse server rightly responds with an error because the lighthouse client is breaking the concurrency limit

Oct 30 02:23:38.044 DEBG RPC Error                               direction: Outgoing, score: 0, peer_id: 16Uiu2HAmAcK84B5BJz3EiXzLK7SzQsEi7JWNsqL1YcdiAGZDj4aA, client: Lighthouse: version: v5.3.0-450326c, os_version: aarch64-linux, err: RPC response was an error: Rate limited with reason: Rate limited. There is an active request with the same protocol, protocol: data_column_sidecars_by_range, service: libp2p

@ackintosh
Copy link
Member Author

ackintosh commented Nov 2, 2024

This is the test I used (created in another branch forked from this PR). In the test, the sender maintains two (or fewer) active requests at a time, but self rate-limiting is triggered. The test is quite simple, but I think it shows that we need to limit both concurrent requests and (optionally) tokens. What do you think?

image

@ackintosh
Copy link
Member Author

The spec PR has been merged.

@jxs
Copy link
Member

jxs commented Nov 19, 2024

The spec PR has been merged.

nice! Is this ready for another review then @ackintosh?

@ackintosh
Copy link
Member Author

Let me organize the remaining tasks of this PR 🙏 :

  1. Need to limit concurrent requests on client (outbound requests) side, to comply with the spec update.
  2. Whether we also need to limit tokens is under duscussion.

Note: Currently, if we run this branch (on a Kurtosis local network), the lighthouse server responds with an error as the client side is exceeding the concurrency limit, resulting in a ban.

I think task 1. needs to be implemented in this PR because of the banning mentioned above. What do you think, @jxs? (Asking because of this comment)

@jxs
Copy link
Member

jxs commented Nov 22, 2024

Let me organize the remaining tasks of this PR 🙏 :

1. Need to limit concurrent requests on client (outbound requests) side, to comply with the spec update.

2. Whether we also need to limit tokens is under duscussion.

Note: Currently, if we run this branch (on a Kurtosis local network), the lighthouse server responds with an error as the client side is exceeding the concurrency limit, resulting in a ban.

I think task 1. needs to be implemented in this PR because of the banning mentioned above. What do you think, @jxs? (Asking because of this comment)

yeah makes sense Akihito, go for it 🚀

@AgeManning AgeManning removed the v6.0.0 New major release for hierarchical state diffs label Nov 24, 2024
@ackintosh ackintosh added work-in-progress PR is a work-in-progress and removed ready-for-review The code is ready for review labels Nov 25, 2024
@ackintosh ackintosh force-pushed the delayed-rpc-response branch from fe458ac to 2d7a679 Compare November 27, 2024 21:20
@ackintosh ackintosh force-pushed the delayed-rpc-response branch from aa57e8b to b73a336 Compare November 29, 2024 22:16
@ackintosh
Copy link
Member Author

https://github.com/sigp/lighthouse/actions/runs/12109786536/job/33759312718?pr=5923

The network-tests failure was not reproduced locally. It might be related to #6646.

@ackintosh
Copy link
Member Author

@jxs @pawanjay176 @AgeManning

This is ready for another review. 🙏

I have added a concurrency limit on the self-lmiter. Now, the self-limiter limits outbound requests based on both the number of concurrent requests and tokens (optional). Whether we also need to limit tokens in the self-limiter is still under duscussion. Let me know if you have any ideas.


(FYI)

I also ran lighthouse (this branch) on the testnet for ~24hours. During this time, the LH node responded with 21 RateLimited errors due to the number of active requests. These errors appear in the logs like the example below. Note that this is about inbound requests, not the self-limiter (outbound requests).

Dec 09 13:38:56.806 DEBG There is an active request with the same protocol, protocol: beacon_blocks_by_range, request: Blocks by range: Start Slot: 2738468, Count: 64, Step: 1, peer_id: 16Uiu2HAmERvtCC321A2Nu1pH6QmhXgvqALxgCnrp4qxr2qeVWr2P, service: libp2p_rpc, service: libp2p, module: lighthouse_network::rpc:491

@ackintosh ackintosh added ready-for-review The code is ready for review and removed work-in-progress PR is a work-in-progress labels Dec 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Networking ready-for-review The code is ready for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants