Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Hermes relayer client expiry & channel instability #4310

Closed
aubrika opened this issue May 3, 2024 · 8 comments
Closed

Bug: Hermes relayer client expiry & channel instability #4310

aubrika opened this issue May 3, 2024 · 8 comments
Assignees
Labels
A-IBC Area: IBC integration with Penumbra C-bug Category: a bug ibc-hard-breaking Breaks existing connections, requires manual intervention on counterparty chains _P-high High priority security Issues or work related to security.
Milestone

Comments

@aubrika
Copy link
Contributor

aubrika commented May 3, 2024

Initial finding of expired Hermes client on Osmosis & associated discussion: https://discord.com/channels/824484045370818580/930154881040404480/1235706087542620233

There are two observed issues here:

  1. The Hermes client expired on the Osmosis testnet for unknown reasons - why did this happen? We seem to have ruled out insufficient funds as a cause.

  2. The channel id must remain stable over time, so resuming transfer functionality with a new channel id as demonstrated is not an acceptable mitigation - once we determine the root cause, how can we ensure channel ID stability?

@aubrika aubrika added A-IBC Area: IBC integration with Penumbra C-bug Category: a bug security Issues or work related to security. ibc-hard-breaking Breaks existing connections, requires manual intervention on counterparty chains labels May 3, 2024
@aubrika aubrika added this to the Sprint 6 milestone May 3, 2024
@github-project-automation github-project-automation bot moved this to Backlog in Penumbra May 3, 2024
@github-actions github-actions bot added the needs-refinement unclear, incomplete, or stub issue that needs work label May 3, 2024
@aubrika aubrika moved this from Backlog to Todo in Penumbra May 6, 2024
@zbuc
Copy link
Member

zbuc commented May 7, 2024

Putting on hold to work on #4340

@aubrika aubrika added _P-high High priority and removed needs-refinement unclear, incomplete, or stub issue that needs work labels May 8, 2024
@conorsch
Copy link
Contributor

As a stopgap measure, I've set the Hermes instance relaying between Penumbra and Osmosis to restart every 20m. We think this will trigger the automatic client-refresh logic on service start, thereby keeping the channels open. It's a hack, but let's see if it helps in the meantime, until we have a more durable solution.

@conorsch
Copy link
Contributor

conorsch commented May 13, 2024

Unfortunately, despite the frequent restarts, we're still seeing client expiry. Here's log output from a manual tx attempt, as described in our IBC testing dev docs:

2024-05-13T17:01:44.650313Z DEBUG ThreadId(35) send_messages_and_wait_commit{chain=osmo-test-5 tracking_id=ft-transfer}: sending 1 messages as 1 batches to chain osmo-test-5 in parallel
2024-05-13T17:01:44.650368Z DEBUG ThreadId(35) send_messages_and_wait_commit{chain=osmo-test-5 tracking_id=ft-transfer}:send_tx_with_account_sequence_retry{chain=osmo-test-5 account.sequence=4}: max fee, for use in tx simulation: Fee { amount: "600000uosmo", gas_limit: 4000000 }
2024-05-13T17:01:45.368568Z ERROR ThreadId(35) send_messages_and_wait_commit{chain=osmo-test-5 tracking_id=ft-transfer}:send_tx_with_account_sequence_retry{chain=osmo-test-5 account.sequence=4}:estimate_gas: failed to simulate tx. propagating error to caller: gRPC call `send_tx_simulate` failed with status: status: Unknown, message: "failed to execute message; message index: 0: cannot send packet using client (07-tendermint-3497) with status Expired: client state is not active [cosmos/ibc-go/[email protected]/modules/core/04-channel/keeper/packet.go:75] With gas wanted: '300000000' and gas used: '87721' ", details: [], metadata: MetadataMap { headers: {"server": "nginx", "date": "Mon, 13 May 2024 17:01:45 GMT", "content-type": "application/grpc", "x-cosmos-block-height": "7500523"} }
2024-05-13T17:01:45.368623Z ERROR ThreadId(35) send_messages_and_wait_commit{chain=osmo-test-5 tracking_id=ft-transfer}:send_tx_with_account_sequence_retry{chain=osmo-test-5 account.sequence=4}: gas estimation failed or encountered another unrecoverable error error=gRPC call `send_tx_simulate` failed with status: status: Unknown, message: "failed to execute message; message index: 0: cannot send packet using client (07-tendermint-3497) with status Expired: client state is not active [cosmos/ibc-go/[email protected]/modules/core/04-channel/keeper/packet.go:75] With gas wanted: '300000000' and gas used: '87721' ", details: [], metadata: MetadataMap { headers: {"server": "nginx", "date": "Mon, 13 May 2024 17:01:45 GMT", "content-type": "application/grpc", "x-cosmos-block-height": "7500523"} }ERROR transfer error: failed while submitting the Transfer message to chain osmo-test-5: gRPC call `send_tx_simulate` failed with status: status: Unknown, message: "failed to execute message; message index: 0: cannot send packet using client (07-tendermint-3497) with status Expired: client state is not active [cosmos/ibc-go/[email protected]/modules/core/04-channel/keeper/packet.go:75] With gas wanted: '300000000' and gas used: '87721' ", details: [], metadata: MetadataMap { headers: {"server": "nginx", "date": "Mon, 13 May 2024 17:01:45 GMT", "content-type": "application/grpc", "x-cosmos-block-height": "7500523"} }

@zbuc
Copy link
Member

zbuc commented May 13, 2024

The underlying issue seems to be a mismatch between client trusting periods.

Here is the configured trusting period for the Penumbra client on the Osmosis side: https://www.mintscan.io/osmosis-testnet/tx/4FD74E36AB8AE73904AA16A17237A83E127D1AA3EF7F41BDCD7CCD4826A55895?height=7450680

And the configured trusting period for Osmosis on the Penumbra side:

$ cargo run --release --bin pcli -- q ibc clients
...
    "trustingPeriod": {
      "seconds": "288000",
      "nanos": 0
    },
    "unbondingPeriod": {
      "seconds": "432000",
      "nanos": 0
    },
...

In order for the client refreshes to work properly, these values should closely match. We should be able to do this by adjusting the unbonding_delay staking parameter on the Penumbra side:

432000 seconds / 5 seconds per block = 86400 blocks

Currently this is configured as 2158 blocks, which corresponds to 5 * 2158 = 10790 seconds that we see in the above mintscan link.

https://github.com/penumbra-zone/penumbra/blob/main/crates/core/component/stake/src/params.rs#L72

However, in a multi-chain IBC context, I am not sure how this works. Does every chain align on similar unbonding periods?

@conorsch
Copy link
Contributor

We remain confident that it's the Penumbra client on the Osmosis chain that's expiring. See relevant log messages:

May 13 08:59:53 hermes hermes[2385594]: 2024-05-13T08:59:53.515511Z  INFO ThreadId(20) spawn:chain{chain=penumbra-testnet-deimos-8}:client{client=07-tendermint-0}:connection{connection=connection-0}:channel{channel=channel-0}:worker.client.refresh{client=07-tendermint-0 src_chain=osmo-test-5 dst_chain=penumbra-testnet-deimos-8}:foreign_client.refresh{client=osmo-test-5->penumbra-testnet-deimos-8:07-tendermint-0}: client needs to be refreshed elapsed=97162.760300615s refresh_period=96000s
May 13 09:00:10 hermes hermes[2385594]: 2024-05-13T09:00:10.353216Z  INFO ThreadId(20) spawn:chain{chain=penumbra-testnet-deimos-8}:client{client=07-tendermint-0}:connection{connection=connection-0}:channel{channel=channel-0}:worker.client.refresh{client=07-tendermint-0 src_chain=osmo-test-5 dst_chain=penumbra-testnet-deimos-8}:foreign_client.refresh{client=osmo-test-5->penumbra-testnet-deimos-8:07-tendermint-0}: client needs to be refreshed elapsed=97850.27514765s refresh_period=96000s
May 13 09:01:17 hermes hermes[2385594]: 2024-05-13T09:01:17.600829Z  INFO ThreadId(20) spawn:chain{chain=penumbra-testnet-deimos-8}:client{client=07-tendermint-0}:connection{connection=connection-0}:channel{channel=channel-0}:worker.client.refresh{client=07-tendermint-0 src_chain=osmo-test-5 dst_chain=penumbra-testnet-deimos-8}:foreign_client.refresh{client=osmo-test-5->penumbra-testnet-deimos-8:07-tendermint-0}: client needs to be refreshed elapsed=117812.022070053s refresh_period=96000s

@conorsch
Copy link
Contributor

The RPC endpoint we're using for osmosis has disabled tx indexing:

May 13 16:10:17 hermes hermes[2389916]: 2024-05-13T16:10:17.004508Z  WARN ThreadId(01) health_check{chain=osmo-test-5}: chain is not healthy: transaction indexing for chain 'osmo-test-5' is disabled (`node_info.other.tx_index` is off)

Or, actually, it's sometimes not enabled:

❯ curl -s https://rpc.testnet.osmosis.zone/status | jq '.result.node_info.id,.result.node_info.other.tx_index'
"c9c0ebc613b20ec8ac5844e8c8430d449b5559b4"
"off"

❯ curl -s https://rpc.testnet.osmosis.zone/status | jq '.result.node_info.id,.result.node_info.other.tx_index'
"aa3503317a63e9310e8d306b4592878f3eb85744"
"on"

Looks like there's load-balancing going on, and the backend nodes do not both have the same config. This means that sometimes when Hermes tries to look up Osmosis events, it fails to do so. Let's figure out a node RPC URL that gives us consistent results.

@conorsch
Copy link
Contributor

For now we've switched to using the Polkachu RPC endpoints, which seem to be much more reliable. We've not any Hermes code changes. With the new RPC endpoints, we confirmed working transfers via Osmosis:

Let's check again tomorrow, or any time after >3h from now, to confirm that the trusting period issues are resolved, and the channel-3 on the penumbra side remains valid. If so, we should update the chain-registry before closing out.

@conorsch
Copy link
Contributor

After migrating to the Polkachu RPC endpoints, the latest Osmosis <-> Penumbra channel remains stable: I was able to transfer both in and out this morning, well past the lapsing of the initial trusting period window, showing that client updates are working again, as expected.

As an immediate follow-up, we should update the prax-registry with the latest Osmosis channel. Medium-term we should revisit the RPC endpoints we're using: either switching to a non-LB'd first-party instance, or else running our own Osmosis testnet node. The latter entails a commitment that could be high-touch, but I don't know that for sure.

The initial problem motivating this issue is now resolved, thanks to deep debugging by @zbuc and @avahowell, so I'm closing the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-IBC Area: IBC integration with Penumbra C-bug Category: a bug ibc-hard-breaking Breaks existing connections, requires manual intervention on counterparty chains _P-high High priority security Issues or work related to security.
Projects
Archived in project
Development

No branches or pull requests

3 participants