Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reapply 8644 on 9260 #9313

Open
wants to merge 155 commits into
base: yy-beat-itest-optimize
Choose a base branch
from

Conversation

aakselrod
Copy link
Contributor

@aakselrod aakselrod commented Nov 27, 2024

Rebase of #9242 on #9260. Now includes btcsuite/btcwallet#967.

yyforyongyu and others added 30 commits November 25, 2024 13:49
Also updated the loggings. This new state will be used in the following
commit.
This prepares the following commit where we now let the fee bumpr
decides whether to broadcast immediately or not.
This commit changes how inputs are handled upon receiving a bump result.
Previously the inputs are taken from the `BumpResult.Tx`, which is now
instead being handled locally as we will remember the input set when
sending the bump request, and handle this input set when a result is
received.
This commit adds a new method `handleInitialBroadcast` to handle the
initial broadcast. Previously we'd broadcast immediately inside
`Broadcast`, which soon will not work after the `blockbeat` is
implemented as the action to publish is now always triggered by a new
block. Meanwhile, we still keep the option to bypass the block trigger
so users can broadcast immediately by setting `Immediate` to true.
Previously in `markInputFailed`, we'd remove all inputs under the same
group via `removeExclusiveGroup`. This is wrong as when the current
sweep fails for this input, it shouldn't affect other inputs.
Also updated `handlePendingSweepsReq` to skip immature inputs so the
returned results are the same as those in pre-0.18.0.
With the combination of the following commit we can have a more granular
control over the bump result when handling it in the sweeper.
After previous commit, it should be clear that the tx may be failed to
created in a `TxFailed` event. We now make sure to catch it to avoid
panic.
This commit inits the package `chainio` and defines the interface
`Blockbeat` and `Consumer`. The `Consumer` must be implemented by other
subsystems if it requires block epoch subscription.
In this commit, a minimal implementation of `Blockbeat` is added to
synchronize block heights, which will be used in `ChainArb`, `Sweeper`,
and `TxPublisher` so blocks are processed sequentially among them.
This commit adds two methods to handle dispatching beats. These are
exported methods so other systems can send beats to their managed
subinstances.
This commit adds a blockbeat dispatcher which handles sending new blocks
to all subscribed consumers.
This commit implements `Consumer` on `TxPublisher`, `UtxoSweeper`,
`ChainArbitrator` and `ChannelArbitrator`.
This commit removes the independent block subscriptions in `UtxoSweeper`
and `TxPublisher`. These subsystems now listen to the `BlockbeatChan`
for new blocks.
This commit removes the hack introduced in lightningnetwork#4851. Previously we had this
issue because the chain notifier was stopped before the sweeper, which
was changed a while back and we now always stop the chain notifier last.
In addition, since we no longer subscribe to the block epoch chan
directly, this issue can no longer happen.
The sweeper can handle the waiting so there's no need to wait for blocks
inside the resolvers. By offering the inputs prior to their mature
heights also guarantees the inputs with the same deadline are
aggregated.
This commit removes the block subscriptions used in `ChainArbitrator`
and replaced them with the blockbeat managed by `BlockbeatDispatcher`.
This commit removes the block subscriptions used in `ChannelArbitrator`,
replaced them with the blockbeat managed by `BlockbeatDispatcher`.
This `immediate` flag was added as a hack so during a restart, the
pending resolvers would offer the inputs to the sweeper and ask it to
sweep them immediately. This is no longer need due to `blockbeat`, as
now during restart, a block is always sent to all subsystems via the
flow `ChainArb` -> `ChannelArb` -> resolvers -> sweeper. Thus, when
there are pending inputs offered, they will be processed by the sweeper
immediately.
To avoid calling GetBestBlock again.
This is needed so the consumers have an initial state about the current
block.
In this commit we start to break up the starting process into smaller
pieces, which is needed in the following commit to initialize blockbeat
consumers.
Refactor the `Start` method to fix the linter error:
```
contractcourt/chain_arbitrator.go:568: Function 'Start' is too long (242 > 200) (funlen)
```
@aakselrod
Copy link
Contributor Author

Looks like a postgres itest failed. Looking into it. Doesn't appear to be related to the btcwallet deadlock fixed recently, so that's good!

@saubyk saubyk added this to the v0.19.0 milestone Nov 27, 2024
@aakselrod
Copy link
Contributor Author

aakselrod commented Nov 27, 2024

It looks like the shutdown came too early and the sweeper in monitorFeeBumpResult didn't get time to process the result of the sweep broadcast (exited at sweeper.go:1665). I don't think that's related to the SQL DB?

This seems to happen rarely enough that I can't seem to reproduce it locally.

Copy link
Collaborator

@bhandras bhandras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good! I think we just need some comments and perhaps some direct coverage for the batch package, then it's good to go!

// sqldb retry and still re-execute the
// failing request individually.
dbErr := sqldb.MapSQLError(err)
if !sqldb.IsSerializationError(dbErr) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It'd be nice to cover batch with a simple unit test to make sure the serialization errors are correctly handled and we don't regress later.

channeldb/graph.go Show resolved Hide resolved
channeldb/graph.go Show resolved Hide resolved
sqldb/sqlerrors.go Show resolved Hide resolved
Makefile Outdated
# each can run concurrently. Note that many of the settings here are
# specifically for integration testing and are not fit for running
# production nodes.
docker run --name lnd-postgres -e POSTGRES_PASSWORD=postgres -p 6432:5432 -d postgres:13-alpine -N 1500 -c max_pred_locks_per_transaction=1024 -c max_locks_per_transaction=128 -c jit=off -c work_mem=8MB -c checkpoint_timeout=10min -c enable_seqscan=off
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does settings work_mem=8MB and jit=off add to the test case stability? If yes could you please add some comments why these were changed (along with other params)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got those settings from @djkazic's suggestions. I think it's likely they're unnecessary for the itests because the databases they're working with are pretty small. Will try running without them to see how it goes, and add comments for the rest.

Copy link
Contributor

@djkazic djkazic Nov 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, for context those are the settings I'm using in my production postgres backed LND. DB size is ~5GB. jit=off should help in all scenarios as JIT'ing mostly benefits long-running queries whereas postgres_kvdb tends to spawn many small and fast-executing queries. work_mem=8MB on the other hand benefits larger DBs as it gives postgres more breathing room for each individual query to do sorts etc.

@aakselrod
Copy link
Contributor Author

Updated to address comments, add release notes, and do a CI run. I'm still running this locally to see if the new DB settings give me any trouble.

@aakselrod
Copy link
Contributor Author

aakselrod commented Nov 28, 2024

Got a postgres unit test failure. Looking into it...

Error is here. Looks similar to errors (chain notification-related) that happened on both postgres and sqlite unit test runs on #9260, which doesn't have parallel DB transactions re-enabled.

I believe this is actually related to a flake in chainntnfs where a certain series of events makes the notifier attempt to send one more event than the (unread) channel has room for. The easiest fix is to add 2 to the channel allocation here, which has eliminated similar errors from all of our (Lightspark) CI runs in our private fork. This is in lieu of actually tracking down the flake further and fixing it "properly." I'll push it as soon as this CI run is complete.

The updated DB settings seem to be working OK locally on my machine as well.

@yyforyongyu
Copy link
Member

I believe this is actually related to a flake in chainntnfs where a certain series of events makes the notifier attempt to send one more event than the (unread) channel has room for.

Thanks a lot for digging into the unit test flake! That certainly was an easy way to fix and was originally used in #9258. Then I realized there's another place we need the dedup check which was fixed in 4632044. Since this PR is based on #9260 and that PR has this commit in its upstream, I'd expect it to be fixed properly there but it seems not? The other related fix is #9309 which fixed another flake.

Moving forward, I think we can focus on getting the postgres-related merged, then open new PRs to fix the unit test flakes.

@aakselrod
Copy link
Contributor Author

Moving forward, I think we can focus on getting the postgres-related merged, then open new PRs to fix the unit test flakes.

I'm off until Monday but will remove the latest commit then. Thanks!

To make this itest work reliably with multiple parallel SQL
transactions, we need to count both the settle and final HTLC
events. Otherwise, sometimes the final events from earlier
forwards are counted before the forward events from later
forwards, causing a miscount of the settle events. If we
expect both the settle and final event for each forward,
we don't miscount.
@aakselrod
Copy link
Contributor Author

I tried to take off the latest commit and force-push but GitHub got stuck on processing latest update so I rearranged the order of the remaining commits a little and force-pushed again.

@aakselrod
Copy link
Contributor Author

aakselrod commented Dec 2, 2024

The only test that failed looks irrelevant to the DB (protofsm unit-race), woohoo! We did get some coverage reductions, but the workflow isn't tracking coverage with itests that use the postgres backend. Maybe we should turn it on?

@yyforyongyu yyforyongyu force-pushed the yy-beat-itest-optimize branch 5 times, most recently from cd62b3a to ad8c68a Compare December 4, 2024 06:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

5 participants