-
Notifications
You must be signed in to change notification settings - Fork 305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: Hermes relayer client expiry & channel instability #4310
Comments
Putting on hold to work on #4340 |
As a stopgap measure, I've set the Hermes instance relaying between Penumbra and Osmosis to restart every 20m. We think this will trigger the automatic client-refresh logic on service start, thereby keeping the channels open. It's a hack, but let's see if it helps in the meantime, until we have a more durable solution. |
Unfortunately, despite the frequent restarts, we're still seeing client expiry. Here's log output from a manual tx attempt, as described in our IBC testing dev docs:
|
The underlying issue seems to be a mismatch between client trusting periods. Here is the configured trusting period for the Penumbra client on the Osmosis side: https://www.mintscan.io/osmosis-testnet/tx/4FD74E36AB8AE73904AA16A17237A83E127D1AA3EF7F41BDCD7CCD4826A55895?height=7450680 And the configured trusting period for Osmosis on the Penumbra side: $ cargo run --release --bin pcli -- q ibc clients
...
"trustingPeriod": {
"seconds": "288000",
"nanos": 0
},
"unbondingPeriod": {
"seconds": "432000",
"nanos": 0
},
... In order for the client refreshes to work properly, these values should closely match. We should be able to do this by adjusting the
Currently this is configured as https://github.com/penumbra-zone/penumbra/blob/main/crates/core/component/stake/src/params.rs#L72 However, in a multi-chain IBC context, I am not sure how this works. Does every chain align on similar unbonding periods? |
We remain confident that it's the Penumbra client on the Osmosis chain that's expiring. See relevant log messages:
|
The RPC endpoint we're using for osmosis has disabled tx indexing:
Or, actually, it's sometimes not enabled:
Looks like there's load-balancing going on, and the backend nodes do not both have the same config. This means that sometimes when Hermes tries to look up Osmosis events, it fails to do so. Let's figure out a node RPC URL that gives us consistent results. |
For now we've switched to using the Polkachu RPC endpoints, which seem to be much more reliable. We've not any Hermes code changes. With the new RPC endpoints, we confirmed working transfers via Osmosis:
Let's check again tomorrow, or any time after >3h from now, to confirm that the trusting period issues are resolved, and the |
After migrating to the Polkachu RPC endpoints, the latest Osmosis <-> Penumbra channel remains stable: I was able to transfer both in and out this morning, well past the lapsing of the initial trusting period window, showing that client updates are working again, as expected. As an immediate follow-up, we should update the prax-registry with the latest Osmosis channel. Medium-term we should revisit the RPC endpoints we're using: either switching to a non-LB'd first-party instance, or else running our own Osmosis testnet node. The latter entails a commitment that could be high-touch, but I don't know that for sure. The initial problem motivating this issue is now resolved, thanks to deep debugging by @zbuc and @avahowell, so I'm closing the issue. |
Initial finding of expired Hermes client on Osmosis & associated discussion: https://discord.com/channels/824484045370818580/930154881040404480/1235706087542620233
There are two observed issues here:
The Hermes client expired on the Osmosis testnet for unknown reasons - why did this happen? We seem to have ruled out insufficient funds as a cause.
The channel id must remain stable over time, so resuming transfer functionality with a new channel id as demonstrated is not an acceptable mitigation - once we determine the root cause, how can we ensure channel ID stability?
The text was updated successfully, but these errors were encountered: