Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: use many tasks to order streams and discover undelivered events at startup #620

Merged
merged 9 commits into from
Dec 5, 2024

Conversation

dav1do
Copy link
Contributor

@dav1do dav1do commented Nov 27, 2024

After an IPFS migration, we have to review all the data in the database to make sure we have the complete stream history before we can send events out of the API. We originally kept it simple and would read events and process them, then repeat in a single task. This was taking far too long on large datasets (e.g. 100s of GBs). Now we spawn multiple tasks to read the events from the database, and they send events over a channel to the ordering task (like we do during normal operation). This task was also modified to spawn multiple tasks to process events by stream and order them. Both changes appeared necessary during testing as one side would waiting on the other. This allows us to keep a solid rate of processing going and we've seen a substantial improvement in runtime (~60-100x faster).

On the discovery side: At startup, we spawn 16 tasks to read batches from the database. The number of events read each time was reduced to 250, as 1000 was taking seconds. The values are slightly arbitrary but this seemed like a "fast enough" choice during testing (the goal is simply to keep the channel full). The event data is partitioned using (rowid % number_tasks) = task_number so we don't have to do anything clever to split the data into batches up front. Each task starts from the beginning and pick up any events that have been missed. Once it finishes, the subsequent runs are fast, so we spawn the tasks regardless of whether they're needed.

On the ordering side a few changes were made. First, the channel size was reduced to 10000 (the previous value was far too large) and we try to empty it before doing any ordering since we have more tasks to process the set, and any events found may avoid database reads if they're for the same stream. Once events are grouped by stream, we split the streams into batches and spawn 1-16 tasks to process each batch. This processing has cpu bound work, but also requires database reads so multiple tasks have been beneficial. The tasks then send their ordered data back to the manager, which handles writing to the database. During testing, I made a change to remove a RO connection from the pool (and allow it to grow afterward) for each of these tasks. It didn't seem to make an obvious difference, but it may be useful to revisit.

@dav1do dav1do force-pushed the feat/parallelize-undelivered branch from a8bef06 to aa30168 Compare November 27, 2024 05:03
@dav1do dav1do marked this pull request as ready for review November 27, 2024 18:06
@dav1do dav1do requested review from nathanielc and a team as code owners November 27, 2024 18:06
@dav1do dav1do requested review from sam701 and removed request for a team November 27, 2024 18:06
Base automatically changed from chore/undelivered-logs to main November 27, 2024 18:42
@dav1do dav1do force-pushed the feat/parallelize-undelivered branch from aa30168 to 7031f11 Compare November 27, 2024 22:19
@dav1do dav1do requested a review from stbrody as a code owner November 27, 2024 22:19
@dav1do dav1do changed the base branch from main to chore/db-optimize November 27, 2024 22:19
@dav1do dav1do force-pushed the feat/parallelize-undelivered branch from 7031f11 to d6b152e Compare November 27, 2024 22:20
Base automatically changed from chore/db-optimize to main November 28, 2024 06:03
@dav1do dav1do force-pushed the feat/parallelize-undelivered branch 2 times, most recently from 5f1f615 to 83308f9 Compare December 2, 2024 22:02
@dav1do dav1do marked this pull request as draft December 2, 2024 23:25
@dav1do dav1do force-pushed the feat/parallelize-undelivered branch from ee59405 to df8e38f Compare December 3, 2024 00:07
@dav1do dav1do force-pushed the feat/parallelize-undelivered branch from df8e38f to 1d03275 Compare December 3, 2024 01:38
@dav1do dav1do force-pushed the feat/parallelize-undelivered branch from 1d03275 to c52992f Compare December 3, 2024 02:35
@dav1do dav1do changed the base branch from main to fix/sqlite-config December 3, 2024 02:41
@dav1do dav1do force-pushed the feat/parallelize-undelivered branch from c52992f to 8601811 Compare December 3, 2024 03:24
@dav1do dav1do force-pushed the fix/sqlite-config branch from 404974b to dfdac38 Compare December 3, 2024 03:40
@dav1do dav1do force-pushed the feat/parallelize-undelivered branch from 8601811 to 7d16585 Compare December 3, 2024 03:41
@dav1do dav1do force-pushed the feat/parallelize-undelivered branch from 7d16585 to f389ea2 Compare December 4, 2024 16:20
Base automatically changed from fix/sqlite-config to main December 5, 2024 17:08
… start up

This will help some as we're able to do all the sorting/reading of event history in one task while the
 other finds new events that need to be added. It is similar to the insert/ordering task flow now.
we can process each stream individually, so we spawn tasks to handle batches
of streams so we can do db reads in parallel.
@dav1do dav1do force-pushed the feat/parallelize-undelivered branch from f389ea2 to cd39825 Compare December 5, 2024 19:21
@dav1do dav1do changed the title feat: use two tasks to process undelivered events at startup feat: use many tasks to order streams and discover undelivered events at startup Dec 5, 2024
@dav1do dav1do marked this pull request as ready for review December 5, 2024 20:03
@dav1do dav1do added this pull request to the merge queue Dec 5, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Dec 5, 2024
@dav1do dav1do added this pull request to the merge queue Dec 5, 2024
Merged via the queue into main with commit c959cc3 Dec 5, 2024
5 checks passed
@dav1do dav1do deleted the feat/parallelize-undelivered branch December 5, 2024 23:03
@smrz2001 smrz2001 mentioned this pull request Dec 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants