Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(lit-core): LIT-4016 - Enhance error handling during epoch changes #710

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

MaximusHaximus
Copy link
Contributor

@MaximusHaximus MaximusHaximus commented Nov 4, 2024

Description

Updated error handling and logic flow for handling epoch change events, and signalling errors during epoch change processing to consumers/listeners

  • LitCore is now an event emitter
    • Used EventEmitter3 which is extremely widely used, ESM compatible, and written in ES3 for maximum compatibility (!)
  • Events for disconnected, connected and error are emitted
  • We no longer call _stopListeningForNewEpoch() during _connect(); concurrent calls to connect() are chained automatically already.
    • This fixes a corner case where failures to connect during event handling would cause LitCore to stop listening for future epoch changes.
  • Error handling / execution flow updated in _handleStakingContractStateChange()
    • Errors encountered anywhere in the process will emit error/disconnected events appropriately
    • this.ready is now set to false when epoch change events are not processed correctly
  • We now throw an explicit NotReady error from getSessionSigs() if we get to the point we're going to map across this.connectedNodes, but the LitCore instance is not ready.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

How Has This Been Tested?

  • Ran all local-tests

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules

- LitCore is now an event emitter
- Events for `disconnected`, `connected` and `error` are emitted
- We no longer call `_stopListeningForNewEpoch()` during `_connect()` - concurrent calls to `connect()` are chained automatically
- Error handling / execution flow updated in `_handleStakingContractStateChange()`
@@ -492,11 +507,6 @@ export class LitCore {
}

private async _connect() {
// Ensure an ill-timed epoch change event doesn't trigger concurrent config changes while we're already doing that
Copy link
Contributor Author

@MaximusHaximus MaximusHaximus Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this call to stop listening -- if the code inside our listener calls connect(), it will just chain on any already-pending connect() logic (see connect() logic and promise chain on `this._connectingPromise.

This doesn't make the client self-healing, but it does mean that a failure won't leave the client in a state where it is no longer listening for further epoch change events.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But will it chain the promises since we specifically check that if a pending connection is open then just return it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We always re-set _connectionPromise to null when a connect() call finishes -- so if we are connecting, multiple calls to connect() will all return the same promise. This means that if we're still processing connect() when we receive another epoch change event, it will be effectively a no-op.

It occurred to me that we could implement cancellation across this entire callstack, and then cancel the entire existing call to connect() (along with any pending fetch() calls that it is running), and then run it again from the top. I'd like to implement a much more robust network handling layer based on the v7 branch and land this as-is for now, since it's an improvement over what we've got.

…ected` event is being emitted as expected

- Removed misleading globalThis console.warn -- it was backwards, in that our code doesn't actually override existing entries on `globalThis` -- it actually _skips_ initializing any that already exist
…ating across `this.connectedNodes` to build sessionSigs, but the client is disconnected - the sessionSigs map could be incomplete in this case
@MaximusHaximus MaximusHaximus marked this pull request as ready for review November 4, 2024 17:58
@DashKash54
Copy link
Collaborator

Can you please confirm that in the _connect() even if we're unable to handshake to a single node we never set this.ready=true? @MaximusHaximus @Ansonhkg

@DashKash54
Copy link
Collaborator

Currently the loading of modules in the connect() is coupled with the node handshake? it's possible that the handshake fails for any reason so the this.ready=false but we'll still try to reload the WASM modules even though they've been loaded in the gloabalThis? Maybe multiple loads isn't that bad or shall we revisit this?

@DashKash54
Copy link
Collaborator

Are we able to successfully & repeatedly reproduce this error @MaximusHaximus ? How has the fix been tested?

Copy link
Collaborator

@DashKash54 DashKash54 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments

'Error while attempting to reconnect to nodes after epoch transition:',
message
} else {
// In case of centralised networks, we don't run `connect()` flow, so we will manually update epochInfo here
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is a centralized network treated differently? The only difference should be in the attestation but the handshake should be invariant to the network?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that for centralized networks, the node list doesn't change, so re-handshaking with all the nodes on epoch change would be redundant; if that's not true, we should definitely fix it...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can treat them the same way as decentralized I would say we do so, specially considering the "centralization" is more like a special thing instead of something really relevant in the network design

Also this way we can test nodes change in local (centralized)

@@ -492,11 +507,6 @@ export class LitCore {
}

private async _connect() {
// Ensure an ill-timed epoch change event doesn't trigger concurrent config changes while we're already doing that
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But will it chain the promises since we specifically check that if a pending connection is open then just return it?

@@ -2203,6 +2203,18 @@ export class LitNodeClientNodeJs

const signatures: SessionSigsMap = {};

if (!this.ready) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay getSessionSig() was the only function missing this.ready check. Actually we check this in the signSessionKey()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do check it in signSessionKey(), but that method isn't called from getSessionSigs() -- it is only called from the authNeededCallback() defined in getPkpSessionSigs() and in some auth providers -- so I added it here for completeness 👍

@MaximusHaximus
Copy link
Contributor Author

Can you please confirm that in the _connect() even if we're unable to handshake to a single node we never set this.ready=true? @MaximusHaximus @Ansonhkg

Yes, that's right -- if any handshakes fail, the entire connect() chain rejects immediately

@MaximusHaximus
Copy link
Contributor Author

Currently the loading of modules in the connect() is coupled with the node handshake? it's possible that the handshake fails for any reason so the this.ready=false but we'll still try to reload the WASM modules even though they've been loaded in the gloabalThis? Maybe multiple loads isn't that bad or shall we revisit this?

Our crypto module code checks to see if globalThis. has been set, and if it has, it does nothing at all. I agree it's a bit confusing -- the good news is that in v7+, we no longer keep global state around and there is no loadModules() method -- so this is temporary logic, thankfully :)

@MaximusHaximus
Copy link
Contributor Author

Are we able to successfully & repeatedly reproduce this error @MaximusHaximus ? How has the fix been tested?

Unfortunately this is a very corner-case issue that is caused by failures in entirely internal code. It also requires that we trigger epoch changes to actually verify that the fix is working :(

With very creative Jest mocks, and long-running Shiva tests, I can write a test case that will reproduce it consistently and verify this fix is always present, but our current local-tests don't really facilitate this degree of testing, and we don't have live epoch-change tests yet :(. I was, however, able to test the fix by adding manual throw Error() calls into specific spots in the SDK callstack and verifying the SDK followed the right code paths 👍

Copy link
Contributor

@FedericoAmura FedericoAmura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but it is pointing to master with v7 code. This should be updated (and use v7 errors) or point the PR to a v6 branch

BTW, cool feature to make LitCore be an event emitter 🚀

'Error while attempting to reconnect to nodes after epoch transition:',
message
} else {
// In case of centralised networks, we don't run `connect()` flow, so we will manually update epochInfo here
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can treat them the same way as decentralized I would say we do so, specially considering the "centralization" is more like a special thing instead of something really relevant in the network design

Also this way we can test nodes change in local (centralized)

this._epochState = await this._fetchCurrentEpochState(
validatorData.epochInfo
);
if (state === StakingStates.Active) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we throw/disconnect when the new state is Paused?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants