-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(lit-core): LIT-4016 - Enhance error handling during epoch changes #710
base: master
Are you sure you want to change the base?
Conversation
- LitCore is now an event emitter - Events for `disconnected`, `connected` and `error` are emitted - We no longer call `_stopListeningForNewEpoch()` during `_connect()` - concurrent calls to `connect()` are chained automatically - Error handling / execution flow updated in `_handleStakingContractStateChange()`
@@ -492,11 +507,6 @@ export class LitCore { | |||
} | |||
|
|||
private async _connect() { | |||
// Ensure an ill-timed epoch change event doesn't trigger concurrent config changes while we're already doing that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed this call to stop listening -- if the code inside our listener calls connect()
, it will just chain on any already-pending connect()
logic (see connect()
logic and promise chain on `this._connectingPromise.
This doesn't make the client self-healing, but it does mean that a failure won't leave the client in a state where it is no longer listening for further epoch change events.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But will it chain the promises since we specifically check that if a pending connection is open then just return it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We always re-set _connectionPromise
to null when a connect()
call finishes -- so if we are connecting, multiple calls to connect()
will all return the same promise. This means that if we're still processing connect()
when we receive another epoch change event, it will be effectively a no-op.
It occurred to me that we could implement cancellation across this entire callstack, and then cancel the entire existing call to connect() (along with any pending fetch() calls that it is running), and then run it again from the top. I'd like to implement a much more robust network handling layer based on the v7
branch and land this as-is for now, since it's an improvement over what we've got.
…ected` event is being emitted as expected - Removed misleading globalThis console.warn -- it was backwards, in that our code doesn't actually override existing entries on `globalThis` -- it actually _skips_ initializing any that already exist
…ating across `this.connectedNodes` to build sessionSigs, but the client is disconnected - the sessionSigs map could be incomplete in this case
Can you please confirm that in the |
Currently the loading of modules in the |
Are we able to successfully & repeatedly reproduce this error @MaximusHaximus ? How has the fix been tested? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few comments
'Error while attempting to reconnect to nodes after epoch transition:', | ||
message | ||
} else { | ||
// In case of centralised networks, we don't run `connect()` flow, so we will manually update epochInfo here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is a centralized network treated differently? The only difference should be in the attestation but the handshake should be invariant to the network?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that for centralized networks, the node list doesn't change, so re-handshaking with all the nodes on epoch change would be redundant; if that's not true, we should definitely fix it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can treat them the same way as decentralized I would say we do so, specially considering the "centralization" is more like a special thing instead of something really relevant in the network design
Also this way we can test nodes change in local (centralized)
@@ -492,11 +507,6 @@ export class LitCore { | |||
} | |||
|
|||
private async _connect() { | |||
// Ensure an ill-timed epoch change event doesn't trigger concurrent config changes while we're already doing that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But will it chain the promises since we specifically check that if a pending connection is open then just return it?
@@ -2203,6 +2203,18 @@ export class LitNodeClientNodeJs | |||
|
|||
const signatures: SessionSigsMap = {}; | |||
|
|||
if (!this.ready) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay getSessionSig()
was the only function missing this.ready
check. Actually we check this in the signSessionKey()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do check it in signSessionKey()
, but that method isn't called from getSessionSigs()
-- it is only called from the authNeededCallback()
defined in getPkpSessionSigs()
and in some auth providers -- so I added it here for completeness 👍
Yes, that's right -- if any handshakes fail, the entire connect() chain rejects immediately |
Our crypto module code checks to see if globalThis. has been set, and if it has, it does nothing at all. I agree it's a bit confusing -- the good news is that in v7+, we no longer keep global state around and there is no |
Unfortunately this is a very corner-case issue that is caused by failures in entirely internal code. It also requires that we trigger epoch changes to actually verify that the fix is working :( With very creative Jest mocks, and long-running Shiva tests, I can write a test case that will reproduce it consistently and verify this fix is always present, but our current local-tests don't really facilitate this degree of testing, and we don't have live epoch-change tests yet :(. I was, however, able to test the fix by adding manual |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but it is pointing to master with v7 code. This should be updated (and use v7 errors) or point the PR to a v6 branch
BTW, cool feature to make LitCore be an event emitter 🚀
'Error while attempting to reconnect to nodes after epoch transition:', | ||
message | ||
} else { | ||
// In case of centralised networks, we don't run `connect()` flow, so we will manually update epochInfo here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can treat them the same way as decentralized I would say we do so, specially considering the "centralization" is more like a special thing instead of something really relevant in the network design
Also this way we can test nodes change in local (centralized)
this._epochState = await this._fetchCurrentEpochState( | ||
validatorData.epochInfo | ||
); | ||
if (state === StakingStates.Active) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we throw/disconnect when the new state is Paused
?
Description
Updated error handling and logic flow for handling epoch change events, and signalling errors during epoch change processing to consumers/listeners
disconnected
,connected
anderror
are emitted_stopListeningForNewEpoch()
during_connect()
; concurrent calls toconnect()
are chained automatically already._handleStakingContractStateChange()
this.ready
is now set tofalse
when epoch change events are not processed correctlyNotReady
error fromgetSessionSigs()
if we get to the point we're going to map acrossthis.connectedNodes
, but theLitCore
instance is not ready.Type of change
How Has This Been Tested?
Checklist: