compute: feed MV sink frontier from persist write handle #30662

teskje · 2024-11-29T14:50:45Z

This PR changes the way the MV sink produces updates to the shared sink frontier. Previously it would inspect the persist stream and forward that stream's frontier. The issue with this approach is that, in the face of ingestion spikes, the persist stream could fall behind, which would also make the MV sink's reported write frontier fall behind.

This led to a regression in the CrossJoin feature benchmark. In this benchmark, a large constant MV is created. As soon as the MV dataflow has produced its snapshot and has reported the empty frontier, the controller instructs the replica to drop it. If the reporting of the empty frontier is delayed because the MV sink is busy reading back the entire snapshot from persist, the controller's drop command also gets delayed. The result is that queries against the cluster get slowed down by the MV dataflow's processing, more than they would have been had the MV dataflow been dropped quickly.

Attaching the shared write frontier to the WriteHandle-based frontier stream used by the mint operator avoids the issue.

This PR also makes two minor changes to the MV sink (commits 1 and 3):

adjusts trace logging to include the sink ID
re-applies changes I originally made to address Dan's review comments in compute: MV sink refresh #26745 but then lost in a rebase

Motivation

This PR fixes a recognized bug.

Fixes https://github.com/MaterializeInc/database-issues/issues/8795

Tips for reviewer

Commits are meaningful and can be reviewed separately.

Checklist

This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

This commit improves the trace logging in the MV sink v2 by adding a `sink_id` field.

This commit changes the way the MV sink produces updates to the shared sink frontier. Previously it would inspect the `persist` stream and forward that stream's frontier. The issue with this approach is that, in the face of ingestion spikes, the `persist` stream could fall behind, which would also make the MV sink's reported write frontier fall behind. This lead to a regression in the `CrossJoin` feature benchmark. In this benchmark, a large constant MV is created. As soon as the MV dataflow has produced its snapshot and has reported the empty frontier, the controller instructs the replica to drop it. If the reporting of the empty frontier is delayed because the MV sink is busy reading back the entire snapshot from persist, the controller's drop command also gets delayed. The result is that queries against the cluster get slowed down by the MV dataflow's processing, more than they would have been had the MV dataflow been dropped quickly. Attaching the shared write frontier to the `WriteHandle`-based frontier stream used by the `mint` operator avoids the issue.

This commit re-applies three minor changes that were the result of Dan's review comments on MaterializeInc#26745 but that somehow got lost in a rebase before that PR merged. * Replace `debug_assert`s with `assert`s. * Don't unnecessarily box operator tokens. * Add a clarifying comment about valid batch descriptions.

teskje · 2024-12-01T15:01:48Z

All nightlies failures are unrelated, afaict. In particular the upsert test failure also occurred in https://buildkite.com/materialize/nightly/builds/10582#_, which ran on a different PR.

teskje marked this pull request as ready for review November 30, 2024 08:15

teskje requested a review from a team as a code owner November 30, 2024 08:15

teskje requested a review from antiguru November 30, 2024 08:16

teskje and others added 4 commits November 30, 2024 12:21

compute: improve trace logging in MV sink v2

88c61c1

This commit improves the trace logging in the MV sink v2 by adding a `sink_id` field.

upsert test: Save previous workflows' logs

ec965a9

teskje force-pushed the faster-mv-sink-frontier-reporting branch from 234d320 to ec965a9 Compare November 30, 2024 11:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

compute: feed MV sink frontier from persist write handle #30662

compute: feed MV sink frontier from persist write handle #30662

teskje commented Nov 29, 2024 •

edited

Loading

teskje commented Dec 1, 2024

compute: feed MV sink frontier from persist write handle #30662

Are you sure you want to change the base?

compute: feed MV sink frontier from persist write handle #30662

Conversation

teskje commented Nov 29, 2024 • edited Loading

Motivation

Tips for reviewer

Checklist

teskje commented Dec 1, 2024

teskje commented Nov 29, 2024 •

edited

Loading