Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validator set pruning using relayed Validator sets #327

Merged
merged 7 commits into from
May 10, 2021

Conversation

jkilpatr
Copy link
Contributor

Reasoning

This pull request re-introduces pruning for validator sets but in a way that is guaranteed to be secure.

In the past we simply pruned validator sets after a given number of blocks. This solution was simple and would work well the vast majority of the time. But could result in a catastrophic failure of the bridge in some edge cases.

Lets take for example a case where the EVM chain halts for 24 hours for whatever reason. During this time the Cosmos chain continues normally and has a complete validator set change out with no hostile intent on the part of any validator. The former pruning logic would remove the signed validator sets from the history as the slashing window passed and by the time the EVM chain restarted no update signed by the validator set that controlled the chain 24 hours ago would still be in the store.

The bridge and all the funds in it would be lost at this point. Even though the signatures required to transition it to the new validator set where created and submitted, they where just deleted before they could be used.

The only solution to this edge case is to do something we've wanted to avoid for a while. Ferry the ValsetUpdated event from Ethereum to Cosmos as an Oracle claim.

We have tried to avoid this mainly because when it was around early in the development of Gravity developers had a habit of misusing it and generally assuming it's more accurate than is possible considering the constraints imposed by the separate timing of two different chains.

But without some awareness of the past state of the Ethereum chain we would have to wait weeks before we could safely assume that the validator set had been updated on the EVM chain. This was previously accepted as a 'solution' but the failure of testnet1 under mountains of validator set updates is a pretty strong takedown of that design decision.

Secondary Reasoning

In addition to our pruning problem having a reasonable idea of what validator set exists on Ethereum will resolve some audit issues. One of the audit issues Informal raised was that a Cosmos full node operator might not know that the bridge has been highjacked.

Now that we are ferrying the data we can (although this pull request does not implement it) validate that the validator set on Ethereum matches the stored value for that nonce on Cosmos. In the case of a failure the Cosmos chain can panic, or raise whatever warning is required without requiring the user to run any sort of external binary to determine that.

@jkilpatr jkilpatr changed the title Validator set pruning using relayed Validator sets WIP: Validator set pruning using relayed Validator sets Apr 13, 2021
@jkilpatr jkilpatr self-assigned this Apr 13, 2021
@jkilpatr jkilpatr marked this pull request as draft April 13, 2021 16:45
@jkilpatr jkilpatr marked this pull request as ready for review April 15, 2021 19:12
@jkilpatr jkilpatr changed the title WIP: Validator set pruning using relayed Validator sets Validator set pruning using relayed Validator sets Apr 15, 2021
@jackzampolin
Copy link
Member

Like this change. cc @fedekunze @marbar3778 this will need to be in the refactor.

Copy link
Member

@jackzampolin jackzampolin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great change!

@jackzampolin
Copy link
Member

Looks like integration tests are still broken.

Copy link
Member

@tac0turtle tac0turtle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs documentation within the PR.

// Question: what here can be epoched?
slashing(ctx, k)
attestationTally(ctx, k)
cleanupTimedOutBatches(ctx, k)
cleanupTimedOutLogicCalls(ctx, k)
createValsets(ctx, k)
pruneValsets(ctx, k, params)
// TODO: prune claims, attestations when they pass in the handler
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please open a issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -111,6 +111,14 @@ func (a AttestationHandler) Handle(ctx sdk.Context, att types.Attestation, claim

// Add to denom-erc20 mapping
a.keeper.setCosmosOriginatedDenomToERC20(ctx, claim.CosmosDenom, claim.TokenContract)
case *types.MsgValsetUpdatedClaim:
// TODO here we should check the contents of the validator set against
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please open an issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ctx.EventManager().EmitEvent(
sdk.NewEvent(
sdk.EventTypeMessage,
sdk.NewAttribute(sdk.AttributeKeyModule, msg.Type()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The message type is not the module.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is generally the pattern we use in the file. So if this is an issue we should open an issue and change it throughout the fie.

Copy link
Member

@tac0turtle tac0turtle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall looks good. It would be good to capture the design in a markdown file, this will help future teams with onboarding.

@jkilpatr jkilpatr force-pushed the jkilpatr/valset-oracle-v2 branch 3 times, most recently from 419a16c to 13a5606 Compare April 17, 2021 18:55
@jkilpatr jkilpatr mentioned this pull request Apr 27, 2021
14 tasks
@jkilpatr jkilpatr force-pushed the jkilpatr/valset-oracle-v2 branch from 13a5606 to 15af0e1 Compare May 6, 2021 18:28
jkilpatr added 6 commits May 6, 2021 14:28
This patch does somthing that we've tried to avoid for a while. Which is
adding an event nonce to the validator set updated events. So as to make
it possible to relay them into the Cosmos module oracle.

This design change is driven by problems with validator set pruning. In
the worst case validator sets can be produced every single block. But
the worst case on the EVM chain is similarly bad. It could be halted for
several hours or even days.

In order to retain enough validator sets to ensure bridge continuity and
also prune as many validator sets as safely possible we need this
information.
This patch moves the Ethereum event signatures into constants in the
gravity_utils module rather than repeated string values peppered across
the code. Should make it easier to deal with changes there in the
future.
This patch updates the event parsing in the orchestrator for the new
valset update with event nonce.
This patch adds the MsgValsetUpdatedEvent to the cosmos module logic,
this processes and stores the latest validator set as an event.

We've resisted doing this for quite a long time. With concern that
developers would want to use this value in ways it's not safe to use.
But we need it for two purposes that it is safe to use for.

1) warnings about bridge highjacking
2) efficient pruning of validator sets without pruning too many sets
This code provides the most efficient possible validator set pruning
that we can prove is correct.

By pruning the last observed validator we can keep the number of valsets
in the store to a very low value during normal operation and safely
extend the required storage in the case that the bridge becomes halted
for some reason.
Now that we are emitting validator set update events with an event nonce
we have an event that is using event nonce zero. This is a big problem
on the Cosmos side as the logic has all been developed with the
assumption that event nonce 0 is a special unused value.
@jkilpatr jkilpatr force-pushed the jkilpatr/valset-oracle-v2 branch from 15af0e1 to c13d12e Compare May 6, 2021 18:28
@jkilpatr jkilpatr enabled auto-merge (rebase) May 6, 2021 18:28
@jkilpatr
Copy link
Contributor Author

jkilpatr commented May 6, 2021

overall looks good. It would be good to capture the design in a markdown file, this will help future teams with onboarding.

@jtremback is working on the spec here. Design is also outlined in #340

@jkilpatr jkilpatr requested a review from tac0turtle May 6, 2021 18:31
@jkilpatr jkilpatr dismissed tac0turtle’s stale review May 10, 2021 18:12

ok'd with marko offline

@jkilpatr jkilpatr merged commit 006a772 into cosmos:main May 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants