Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement basic equivocations detection loop #2367
Implement basic equivocations detection loop #2367
Changes from 1 commit
9db6dfb
d45fc94
10fb926
1d83ef5
13603d9
22d0f6c
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know - you'll be polishing the error handling in future PRs, just wanted to leave a comment here, because it maybe be confusing a bit (and I already see that the non-connection error is called the "irrecoverable" here). So in relays we have two kind of errors. Connections errors are errors that are likely network errors or
jsonrpsee
internal errors. If you see a connection error, you need to reconnect the client - no other solution is possible. Other errors (non-connection errors) are assumed to be recoverable - they are caused e.g. by us submitting the obsolete transaction or e.g. server rejecting our RPC because it is overloaded.The
relay_utils::relay_loop
function is supposed to handle the connection errors. So when you're starting loop usingrelay_utils::relay_loop
, your "run_loop" function is expected to return an error when connection error is met. Then theRelayLoop
itself reconnects to the failed client and restartsrun_loop
. Other (non-connection) errors are supposed to be handled by therun_loop
function itself (e.g. by retrying with some exponential backoff or simply sleep for some time). Right now error handling here is implemented in an opposite way - you yield toRelayLoop
on non-connection errors and do reconnect yourself.This isn't the big issue - let's see the future error-handling PR, just wanted to share that ^^^. Maybe it'll allow you to remove some code here in favor of already existing code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanations. Personally I thought that any error other than connection errors are irrecoverable. For example:
etc
I think we should take this into consideration, because otherwise we risk getting stuck in a retrial loop. But it's true that other errors could be recoverable. So I don't know what would be best here. But seems like error handling will be a complex problem for a future PR.
But for the moment changing the strategy to consider every error recoverable and just sleep a bit and skip the item that lead to the error when possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true. Normally (in other relay loops) we'll eventually break of this loop when we'll read updated state and realized that something has changed. But until then we'll keep retrying. Which is fine (imo), unless we spam RPC server with failing RPC requests without some backoff.
In your examples:
we can't generate the key ownership prof, because we're too many blocks ahead
: then the loop should just log an error it and go further (imo);let's say that the chains were updated and the relayer has old data structures, so it can't encode/decode them etc
: that's definitely a maintenance flaw - the best we could do here is to have a backoff mechanism. Like your current loop implementation would just exit the process if you encounter any non-connection error. Is it good? E.g. if we can't generate ownership proof, then we could just keep working with next justifications.But as I said - in the end it is up to you, how to handle issues :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be a loop
while self.from_block_num <= self.until_block_num {}
and do full catch-up checks in a single tick.Doesn't look like that much work to have latency concerns..
Also I would actually suggest to skip the checking full (sub)chain and just check latest (source highest/best finalized as seen by target) - only if that has different hash than block on source at same height, then you know it forked somewhere since last tick and you can iterate to find the fork. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Also @serban300 please note that
from_block_num
is initialized to zero, so it'll start from genesis - I think it should be initialized to best target block at the beginning of the loop?I think an attacker(s) may be able to submit
{ submit_finality_proof(forked_header#100), submit_message(malicious_message), submit_finality_proof(good_header#101) }
in a single block. So if you'll just look at the latest, you may skip thisforked_header
submission. So imo we should look at all headers hereThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but how would
good_header#101
have the correct hash if it's been built on top ofbad_header#100
? Once relayer introduces any forked header, all subsequent child chain will be a fork, no?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't check the connection between headers anywhere. So it could be built on top of other "normal" header. E.g. in following schema:
the relayer1 may track the bad fork and submit the
100'
first and relayer2 may then submit the101
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, ok, I thought (without checking the code) we did. (I now realize we don't check the chain continuity because we don't want to import every header and we're happy with every other Nth header as long as it has a GRANDPA justification).
valid scenario then, let's keep the equivocation check for every submitted header 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Done.
Yes, you're right. That was the intention, but somehow I forgot to implement it.
That's true. Also I think it's good to check for equivocations even if synced hash == source hash. There might be cases where we could still find equivocations. And it shouldn't add a big overhead.