Reapply 8644 on 9260 #9313

aakselrod · 2024-11-27T02:42:46Z

Rebase of #9242 on #9260. Now includes btcsuite/btcwallet#967.

Also updated the loggings. This new state will be used in the following commit.

This prepares the following commit where we now let the fee bumpr decides whether to broadcast immediately or not.

This commit changes how inputs are handled upon receiving a bump result. Previously the inputs are taken from the `BumpResult.Tx`, which is now instead being handled locally as we will remember the input set when sending the bump request, and handle this input set when a result is received.

This commit adds a new method `handleInitialBroadcast` to handle the initial broadcast. Previously we'd broadcast immediately inside `Broadcast`, which soon will not work after the `blockbeat` is implemented as the action to publish is now always triggered by a new block. Meanwhile, we still keep the option to bypass the block trigger so users can broadcast immediately by setting `Immediate` to true.

Previously in `markInputFailed`, we'd remove all inputs under the same group via `removeExclusiveGroup`. This is wrong as when the current sweep fails for this input, it shouldn't affect other inputs.

Also updated `handlePendingSweepsReq` to skip immature inputs so the returned results are the same as those in pre-0.18.0.

With the combination of the following commit we can have a more granular control over the bump result when handling it in the sweeper.

After previous commit, it should be clear that the tx may be failed to created in a `TxFailed` event. We now make sure to catch it to avoid panic.

This commit inits the package `chainio` and defines the interface `Blockbeat` and `Consumer`. The `Consumer` must be implemented by other subsystems if it requires block epoch subscription.

In this commit, a minimal implementation of `Blockbeat` is added to synchronize block heights, which will be used in `ChainArb`, `Sweeper`, and `TxPublisher` so blocks are processed sequentially among them.

This commit adds two methods to handle dispatching beats. These are exported methods so other systems can send beats to their managed subinstances.

This commit adds a blockbeat dispatcher which handles sending new blocks to all subscribed consumers.

This commit implements `Consumer` on `TxPublisher`, `UtxoSweeper`, `ChainArbitrator` and `ChannelArbitrator`.

This commit removes the independent block subscriptions in `UtxoSweeper` and `TxPublisher`. These subsystems now listen to the `BlockbeatChan` for new blocks.

This commit removes the hack introduced in lightningnetwork#4851. Previously we had this issue because the chain notifier was stopped before the sweeper, which was changed a while back and we now always stop the chain notifier last. In addition, since we no longer subscribe to the block epoch chan directly, this issue can no longer happen.

The sweeper can handle the waiting so there's no need to wait for blocks inside the resolvers. By offering the inputs prior to their mature heights also guarantees the inputs with the same deadline are aggregated.

This commit removes the block subscriptions used in `ChainArbitrator` and replaced them with the blockbeat managed by `BlockbeatDispatcher`.

This commit removes the block subscriptions used in `ChannelArbitrator`, replaced them with the blockbeat managed by `BlockbeatDispatcher`.

This `immediate` flag was added as a hack so during a restart, the pending resolvers would offer the inputs to the sweeper and ask it to sweep them immediately. This is no longer need due to `blockbeat`, as now during restart, a block is always sent to all subsystems via the flow `ChainArb` -> `ChannelArb` -> resolvers -> sweeper. Thus, when there are pending inputs offered, they will be processed by the sweeper immediately.

To avoid calling GetBestBlock again.

This is needed so the consumers have an initial state about the current block.

In this commit we start to break up the starting process into smaller pieces, which is needed in the following commit to initialize blockbeat consumers.

Refactor the `Start` method to fix the linter error: ``` contractcourt/chain_arbitrator.go:568: Function 'Start' is too long (242 > 200) (funlen) ```

aakselrod · 2024-11-27T19:25:19Z

Looks like a postgres itest failed. Looking into it. Doesn't appear to be related to the btcwallet deadlock fixed recently, so that's good!

aakselrod · 2024-11-27T21:07:23Z

It looks like the shutdown came too early and the sweeper in monitorFeeBumpResult didn't get time to process the result of the sweep broadcast (exited at sweeper.go:1665). I don't think that's related to the SQL DB?

This seems to happen rarely enough that I can't seem to reproduce it locally.

bhandras

Looks really good! I think we just need some comments and perhaps some direct coverage for the batch package, then it's good to go!

bhandras · 2024-11-27T21:13:12Z

batch/batch.go

+					// sqldb retry and still re-execute the
+					// failing request individually.
+					dbErr := sqldb.MapSQLError(err)
+					if !sqldb.IsSerializationError(dbErr) {


nit: It'd be nice to cover batch with a simple unit test to make sure the serialization errors are correctly handled and we don't regress later.

channeldb/graph.go

sqldb/sqlerrors.go

bhandras · 2024-11-27T21:17:57Z

Makefile

+	# each can run concurrently. Note that many of the settings here are
+	# specifically for integration testing and are not fit for running
+	# production nodes.
+	docker run --name lnd-postgres -e POSTGRES_PASSWORD=postgres -p 6432:5432 -d postgres:13-alpine -N 1500 -c max_pred_locks_per_transaction=1024 -c max_locks_per_transaction=128 -c jit=off -c work_mem=8MB -c checkpoint_timeout=10min -c enable_seqscan=off


Does settings work_mem=8MB and jit=off add to the test case stability? If yes could you please add some comments why these were changed (along with other params)?

I got those settings from @djkazic's suggestions. I think it's likely they're unnecessary for the itests because the databases they're working with are pretty small. Will try running without them to see how it goes, and add comments for the rest.

Yeah, for context those are the settings I'm using in my production postgres backed LND. DB size is ~5GB. jit=off should help in all scenarios as JIT'ing mostly benefits long-running queries whereas postgres_kvdb tends to spawn many small and fast-executing queries. work_mem=8MB on the other hand benefits larger DBs as it gives postgres more breathing room for each individual query to do sorts etc.

aakselrod · 2024-11-27T23:35:00Z

Updated to address comments, add release notes, and do a CI run. I'm still running this locally to see if the new DB settings give me any trouble.

aakselrod · 2024-11-28T00:20:28Z

Got a postgres unit test failure. Looking into it...

Error is here. Looks similar to errors (chain notification-related) that happened on both postgres and sqlite unit test runs on #9260, which doesn't have parallel DB transactions re-enabled.

I believe this is actually related to a flake in chainntnfs where a certain series of events makes the notifier attempt to send one more event than the (unread) channel has room for. The easiest fix is to add 2 to the channel allocation here, which has eliminated similar errors from all of our (Lightspark) CI runs in our private fork. This is in lieu of actually tracking down the flake further and fixing it "properly." I'll push it as soon as this CI run is complete.

The updated DB settings seem to be working OK locally on my machine as well.

yyforyongyu · 2024-11-28T05:15:13Z

I believe this is actually related to a flake in chainntnfs where a certain series of events makes the notifier attempt to send one more event than the (unread) channel has room for.

Thanks a lot for digging into the unit test flake! That certainly was an easy way to fix and was originally used in #9258. Then I realized there's another place we need the dedup check which was fixed in 4632044. Since this PR is based on #9260 and that PR has this commit in its upstream, I'd expect it to be fixed properly there but it seems not? The other related fix is #9309 which fixed another flake.

Moving forward, I think we can focus on getting the postgres-related merged, then open new PRs to fix the unit test flakes.

aakselrod · 2024-11-28T19:24:02Z

Moving forward, I think we can focus on getting the postgres-related merged, then open new PRs to fix the unit test flakes.

I'm off until Monday but will remove the latest commit then. Thanks!

This reverts commit 67419a7.

To make this itest work reliably with multiple parallel SQL transactions, we need to count both the settle and final HTLC events. Otherwise, sometimes the final events from earlier forwards are counted before the forward events from later forwards, causing a miscount of the settle events. If we expect both the settle and final event for each forward, we don't miscount.

aakselrod · 2024-12-02T21:24:12Z

I tried to take off the latest commit and force-push but GitHub got stuck on processing latest update so I rearranged the order of the remaining commits a little and force-pushed again.

aakselrod · 2024-12-02T23:58:57Z

The only test that failed looks irrelevant to the DB (protofsm unit-race), woohoo! We did get some coverage reductions, but the workflow isn't tracking coverage with itests that use the postgres backend. Maybe we should turn it on?

yyforyongyu and others added 30 commits November 25, 2024 13:49

sweep: add new state TxFatal for erroneous sweepings

d361ba6

Also updated the loggings. This new state will be used in the following commit.

sweep: add new error ErrZeroFeeRateDelta

2217843

sweep: add new interface method Immediate

d51bde9

This prepares the following commit where we now let the fee bumpr decides whether to broadcast immediately or not.

sweep: remove redundant error from Broadcast

3671a38

sweep: add method handleBumpEventError and fix markInputFailed

da3a6ab

Previously in `markInputFailed`, we'd remove all inputs under the same group via `removeExclusiveGroup`. This is wrong as when the current sweep fails for this input, it shouldn't affect other inputs.

sweep: add method isMature on SweeperInput

37536c7

Also updated `handlePendingSweepsReq` to skip immature inputs so the returned results are the same as those in pre-0.18.0.

sweep: make sure defaultDeadline is derived from the mature height

de76205

sweep: remove redundant loopvar assign

0216b9c

sweep: break initialBroadcast into two steps

4abd9ba

With the combination of the following commit we can have a more granular control over the bump result when handling it in the sweeper.

sweep: make sure nil tx is handled

e783439

After previous commit, it should be clear that the tx may be failed to created in a `TxFailed` event. We now make sure to catch it to avoid panic.

chainio: introduce chainio to handle block synchronization

46e0a43

This commit inits the package `chainio` and defines the interface `Blockbeat` and `Consumer`. The `Consumer` must be implemented by other subsystems if it requires block epoch subscription.

chainio: implement Blockbeat

c77c230

In this commit, a minimal implementation of `Blockbeat` is added to synchronize block heights, which will be used in `ChainArb`, `Sweeper`, and `TxPublisher` so blocks are processed sequentially among them.

chainio: add helper methods to dispatch beats

08520b0

This commit adds two methods to handle dispatching beats. These are exported methods so other systems can send beats to their managed subinstances.

chainio: add BlockbeatDispatcher to dispatch blockbeats

0fa9406

This commit adds a blockbeat dispatcher which handles sending new blocks to all subscribed consumers.

chainio: add partial implementation of Consumer interface

029aa31

multi: implement Consumer on subsystems

26e034a

This commit implements `Consumer` on `TxPublisher`, `UtxoSweeper`, `ChainArbitrator` and `ChannelArbitrator`.

sweep: remove block subscription in UtxoSweeper and TxPublisher

a00120f

This commit removes the independent block subscriptions in `UtxoSweeper` and `TxPublisher`. These subsystems now listen to the `BlockbeatChan` for new blocks.

contractcourt: remove waitForHeight in resolvers

bb78060

The sweeper can handle the waiting so there's no need to wait for blocks inside the resolvers. By offering the inputs prior to their mature heights also guarantees the inputs with the same deadline are aggregated.

contractcourt: remove block subscription in chain arbitrator

58153f7

This commit removes the block subscriptions used in `ChainArbitrator` and replaced them with the blockbeat managed by `BlockbeatDispatcher`.

contractcourt: remove block subscription in channel arbitrator

981e9c2

This commit removes the block subscriptions used in `ChannelArbitrator`, replaced them with the blockbeat managed by `BlockbeatDispatcher`.

contractcourt: start channel arbitrator with blockbeat

a24ca9e

To avoid calling GetBestBlock again.

multi: start consumers with a starting blockbeat

952869c

This is needed so the consumers have an initial state about the current block.

lnd: add new method startLowLevelServices

bb42ff5

In this commit we start to break up the starting process into smaller pieces, which is needed in the following commit to initialize blockbeat consumers.

lnd: start blockbeatDispatcher and register consumers

1292c9b

contractcourt: fix linter funlen

ac409a0

Refactor the `Start` method to fix the linter error: ``` contractcourt/chain_arbitrator.go:568: Function 'Start' is too long (242 > 200) (funlen) ```

multi: improve loggings

4005e2a

saubyk requested review from bhandras and Roasbeef November 27, 2024 20:51

saubyk added this to the v0.19.0 milestone Nov 27, 2024

bhandras reviewed Nov 27, 2024

View reviewed changes

aakselrod force-pushed the reapply-8644-on-9260 branch from 9452cf8 to a94aa1c Compare November 27, 2024 23:34

aakselrod added 11 commits December 2, 2024 13:21

go.mod: update to latest btcwallet

b645f9c

go.mod: use local kvdb to reapply removal of global postgres lock

70529e1

Reapply "kvdb/postgres: remove global application level lock"

563fff9

This reverts commit 67419a7.

batch: handle serialization errors correctly

235de40

channeldb: handle previously-unhandled errors

77a1817

sqldb: improve serialization error handling

c9e1fd6

Makefile: tune params for db-instance for postgres itests

440c151

Makefile: log to file instead of console

3b04232

github workflow: save postgres log to zip file

6cca186

docs: update release-notes for 0.19.0

fb50cec

aakselrod force-pushed the reapply-8644-on-9260 branch from a94aa1c to fb50cec Compare December 2, 2024 21:21

yyforyongyu force-pushed the yy-beat-itest-optimize branch 5 times, most recently from cd62b3a to ad8c68a Compare December 4, 2024 06:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reapply 8644 on 9260 #9313

Reapply 8644 on 9260 #9313

aakselrod commented Nov 27, 2024 •

edited

Loading

aakselrod commented Nov 27, 2024

aakselrod commented Nov 27, 2024 •

edited

Loading

bhandras left a comment

bhandras Nov 27, 2024

bhandras Nov 27, 2024

aakselrod Nov 27, 2024

djkazic Nov 28, 2024 •

edited

Loading

aakselrod commented Nov 27, 2024

aakselrod commented Nov 28, 2024 •

edited

Loading

yyforyongyu commented Nov 28, 2024

aakselrod commented Nov 28, 2024

aakselrod commented Dec 2, 2024

aakselrod commented Dec 2, 2024 •

edited

Loading

Reapply 8644 on 9260 #9313

Are you sure you want to change the base?

Reapply 8644 on 9260 #9313

Conversation

aakselrod commented Nov 27, 2024 • edited Loading

aakselrod commented Nov 27, 2024

aakselrod commented Nov 27, 2024 • edited Loading

bhandras left a comment

Choose a reason for hiding this comment

bhandras Nov 27, 2024

Choose a reason for hiding this comment

bhandras Nov 27, 2024

Choose a reason for hiding this comment

aakselrod Nov 27, 2024

Choose a reason for hiding this comment

djkazic Nov 28, 2024 • edited Loading

Choose a reason for hiding this comment

aakselrod commented Nov 27, 2024

aakselrod commented Nov 28, 2024 • edited Loading

yyforyongyu commented Nov 28, 2024

aakselrod commented Nov 28, 2024

aakselrod commented Dec 2, 2024

aakselrod commented Dec 2, 2024 • edited Loading

aakselrod commented Nov 27, 2024 •

edited

Loading

aakselrod commented Nov 27, 2024 •

edited

Loading

djkazic Nov 28, 2024 •

edited

Loading

aakselrod commented Nov 28, 2024 •

edited

Loading

aakselrod commented Dec 2, 2024 •

edited

Loading