-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reapply 8644 on 9260 #9313
base: yy-beat-itest-optimize
Are you sure you want to change the base?
Reapply 8644 on 9260 #9313
Conversation
Also updated the loggings. This new state will be used in the following commit.
This prepares the following commit where we now let the fee bumpr decides whether to broadcast immediately or not.
This commit changes how inputs are handled upon receiving a bump result. Previously the inputs are taken from the `BumpResult.Tx`, which is now instead being handled locally as we will remember the input set when sending the bump request, and handle this input set when a result is received.
This commit adds a new method `handleInitialBroadcast` to handle the initial broadcast. Previously we'd broadcast immediately inside `Broadcast`, which soon will not work after the `blockbeat` is implemented as the action to publish is now always triggered by a new block. Meanwhile, we still keep the option to bypass the block trigger so users can broadcast immediately by setting `Immediate` to true.
Previously in `markInputFailed`, we'd remove all inputs under the same group via `removeExclusiveGroup`. This is wrong as when the current sweep fails for this input, it shouldn't affect other inputs.
Also updated `handlePendingSweepsReq` to skip immature inputs so the returned results are the same as those in pre-0.18.0.
With the combination of the following commit we can have a more granular control over the bump result when handling it in the sweeper.
After previous commit, it should be clear that the tx may be failed to created in a `TxFailed` event. We now make sure to catch it to avoid panic.
This commit inits the package `chainio` and defines the interface `Blockbeat` and `Consumer`. The `Consumer` must be implemented by other subsystems if it requires block epoch subscription.
In this commit, a minimal implementation of `Blockbeat` is added to synchronize block heights, which will be used in `ChainArb`, `Sweeper`, and `TxPublisher` so blocks are processed sequentially among them.
This commit adds two methods to handle dispatching beats. These are exported methods so other systems can send beats to their managed subinstances.
This commit adds a blockbeat dispatcher which handles sending new blocks to all subscribed consumers.
This commit implements `Consumer` on `TxPublisher`, `UtxoSweeper`, `ChainArbitrator` and `ChannelArbitrator`.
This commit removes the independent block subscriptions in `UtxoSweeper` and `TxPublisher`. These subsystems now listen to the `BlockbeatChan` for new blocks.
This commit removes the hack introduced in lightningnetwork#4851. Previously we had this issue because the chain notifier was stopped before the sweeper, which was changed a while back and we now always stop the chain notifier last. In addition, since we no longer subscribe to the block epoch chan directly, this issue can no longer happen.
The sweeper can handle the waiting so there's no need to wait for blocks inside the resolvers. By offering the inputs prior to their mature heights also guarantees the inputs with the same deadline are aggregated.
This commit removes the block subscriptions used in `ChainArbitrator` and replaced them with the blockbeat managed by `BlockbeatDispatcher`.
This commit removes the block subscriptions used in `ChannelArbitrator`, replaced them with the blockbeat managed by `BlockbeatDispatcher`.
This `immediate` flag was added as a hack so during a restart, the pending resolvers would offer the inputs to the sweeper and ask it to sweep them immediately. This is no longer need due to `blockbeat`, as now during restart, a block is always sent to all subsystems via the flow `ChainArb` -> `ChannelArb` -> resolvers -> sweeper. Thus, when there are pending inputs offered, they will be processed by the sweeper immediately.
To avoid calling GetBestBlock again.
This is needed so the consumers have an initial state about the current block.
In this commit we start to break up the starting process into smaller pieces, which is needed in the following commit to initialize blockbeat consumers.
Refactor the `Start` method to fix the linter error: ``` contractcourt/chain_arbitrator.go:568: Function 'Start' is too long (242 > 200) (funlen) ```
Looks like a postgres itest failed. Looking into it. Doesn't appear to be related to the btcwallet deadlock fixed recently, so that's good! |
It looks like the shutdown came too early and the sweeper in This seems to happen rarely enough that I can't seem to reproduce it locally. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really good! I think we just need some comments and perhaps some direct coverage for the batch
package, then it's good to go!
// sqldb retry and still re-execute the | ||
// failing request individually. | ||
dbErr := sqldb.MapSQLError(err) | ||
if !sqldb.IsSerializationError(dbErr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: It'd be nice to cover batch with a simple unit test to make sure the serialization errors are correctly handled and we don't regress later.
Makefile
Outdated
# each can run concurrently. Note that many of the settings here are | ||
# specifically for integration testing and are not fit for running | ||
# production nodes. | ||
docker run --name lnd-postgres -e POSTGRES_PASSWORD=postgres -p 6432:5432 -d postgres:13-alpine -N 1500 -c max_pred_locks_per_transaction=1024 -c max_locks_per_transaction=128 -c jit=off -c work_mem=8MB -c checkpoint_timeout=10min -c enable_seqscan=off |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does settings work_mem=8MB
and jit=off
add to the test case stability? If yes could you please add some comments why these were changed (along with other params)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got those settings from @djkazic's suggestions. I think it's likely they're unnecessary for the itests because the databases they're working with are pretty small. Will try running without them to see how it goes, and add comments for the rest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, for context those are the settings I'm using in my production postgres backed LND. DB size is ~5GB. jit=off
should help in all scenarios as JIT'ing mostly benefits long-running queries whereas postgres_kvdb tends to spawn many small and fast-executing queries. work_mem=8MB
on the other hand benefits larger DBs as it gives postgres more breathing room for each individual query to do sorts etc.
9452cf8
to
a94aa1c
Compare
Updated to address comments, add release notes, and do a CI run. I'm still running this locally to see if the new DB settings give me any trouble. |
Got a postgres unit test failure. Looking into it... Error is here. Looks similar to errors (chain notification-related) that happened on both postgres and sqlite unit test runs on #9260, which doesn't have parallel DB transactions re-enabled. I believe this is actually related to a flake in chainntnfs where a certain series of events makes the notifier attempt to send one more event than the (unread) channel has room for. The easiest fix is to add 2 to the channel allocation here, which has eliminated similar errors from all of our (Lightspark) CI runs in our private fork. This is in lieu of actually tracking down the flake further and fixing it "properly." I'll push it as soon as this CI run is complete. The updated DB settings seem to be working OK locally on my machine as well. |
Thanks a lot for digging into the unit test flake! That certainly was an easy way to fix and was originally used in #9258. Then I realized there's another place we need the dedup check which was fixed in 4632044. Since this PR is based on #9260 and that PR has this commit in its upstream, I'd expect it to be fixed properly there but it seems not? The other related fix is #9309 which fixed another flake. Moving forward, I think we can focus on getting the postgres-related merged, then open new PRs to fix the unit test flakes. |
I'm off until Monday but will remove the latest commit then. Thanks! |
This reverts commit 67419a7.
To make this itest work reliably with multiple parallel SQL transactions, we need to count both the settle and final HTLC events. Otherwise, sometimes the final events from earlier forwards are counted before the forward events from later forwards, causing a miscount of the settle events. If we expect both the settle and final event for each forward, we don't miscount.
a94aa1c
to
fb50cec
Compare
I tried to take off the latest commit and force-push but GitHub got stuck on |
The only test that failed looks irrelevant to the DB (protofsm unit-race), woohoo! We did get some coverage reductions, but the workflow isn't tracking coverage with itests that use the postgres backend. Maybe we should turn it on? |
cd62b3a
to
ad8c68a
Compare
Rebase of #9242 on #9260. Now includes btcsuite/btcwallet#967.