Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: track progress for SourceBackfill (blocking DDL) #18112

Merged
merged 2 commits into from
Sep 6, 2024

Conversation

xxchan
Copy link
Member

@xxchan xxchan commented Aug 19, 2024

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

Support blocking DDL for source #18338

Core changes:

  1. Add SourceScan to tracking_progress_actor_ids of a MV. So a MV will not enter Created state until the SourceBackfillExecutor's actor reported progress.
  2. In SourceBackfillExecutor, the progress reporting is a dummy one: it simply reused MV's mechanism, and only uses finish.

A subtle case to consider: whether to report finish after entering SourceCatchingUp state, or after Finished state?

  • Finished might be more intuitive and safe. However, it may make a foreground DDL blocked forever until upstream messages come. (See state diagram in Let source backfill finish when there's no data from upstream #18299. We will be in SourceCatchingUp.) It may seem strange: why with refactor: use high watermark to finish backfill faster #18342 (end at high watermark), we still cannot enter Finished state? Or why no messages come from upstream when we already backfilled to high watermark?

    Here's a illustration: Assume there's no messages from upstream, and we reached high_watermark. Now we cannot distinguish whether upstream has reached high_watermark, so we enter SourceCatchingUp state. If it hasn't caught up, we can enter Finished when it catches up. If it's already at latest, we will not be able to transition state. (Let source backfill finish when there's no data from upstream #18299's point 4 mentioned that we might distinguish this situation if we can do has_message_available)
    image.png

  • So it seems we'd better end at SourceCatchingUp to avoid blocking forever. It's also reasonable, since now we already exited backfill stream. But the problem is that after reporting finish, and the DDL returns, online scaling may happen now. But we need to handle it carefully (Problem of online scaling for source backfill #18300).

    Note: for scaling, "backfill finished" == "progress reported finish". Not exactly the same as the loop code in the executor. So we can't use the old hack like scanning whole state table to delay finished. Now we do want to finish earlier! But need to make it scaling-safe.

    So the solution is to decouple 2 timings:

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • I have added test labels as necessary. See details.
  • I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
  • My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
  • All checks passed in ./risedev check (or alias, ./risedev c)
  • My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)
  • My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

  • My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

Copy link
Member Author

xxchan commented Aug 19, 2024

@xxchan xxchan force-pushed the 08-15-refactor_prost_optimize_some_debug_representation branch from 2a91599 to 9460191 Compare August 23, 2024 05:39
@xxchan xxchan force-pushed the 08-16-feat_track_progress_for_sourcebackfill branch 2 times, most recently from d32fec9 to 222775f Compare August 23, 2024 05:47
Base automatically changed from 08-15-refactor_prost_optimize_some_debug_representation to 08-14-refactor_add_some_comments_for_mv_progress_tracking August 23, 2024 05:48
@xxchan xxchan force-pushed the 08-14-refactor_add_some_comments_for_mv_progress_tracking branch from 8bf4b4d to 9e34c99 Compare August 23, 2024 05:53
@xxchan xxchan changed the base branch from 08-14-refactor_add_some_comments_for_mv_progress_tracking to xxxxxxxx August 23, 2024 05:55
@xxchan xxchan force-pushed the 08-16-feat_track_progress_for_sourcebackfill branch from 222775f to 3c893e7 Compare August 23, 2024 06:01
@xxchan xxchan changed the base branch from xxxxxxxx to 08-14-refactor_add_some_comments_for_mv_progress_tracking August 23, 2024 06:01
Base automatically changed from 08-14-refactor_add_some_comments_for_mv_progress_tracking to main August 23, 2024 06:25
@xxchan xxchan force-pushed the 08-16-feat_track_progress_for_sourcebackfill branch from 3c893e7 to 2d650db Compare September 2, 2024 08:08
@xxchan xxchan changed the base branch from main to 08-24-refactor_use_high_watermark_to_finish_backfill_faster September 2, 2024 08:08
@xxchan xxchan force-pushed the 08-16-feat_track_progress_for_sourcebackfill branch from 2d650db to b3d554c Compare September 2, 2024 08:58
@xxchan xxchan marked this pull request as ready for review September 2, 2024 09:06
@xxchan xxchan requested a review from a team as a code owner September 2, 2024 09:06
@xxchan xxchan requested review from BugenZhao and removed request for a team September 2, 2024 09:07
@xxchan xxchan force-pushed the 08-16-feat_track_progress_for_sourcebackfill branch from b3d554c to a0ddd6d Compare September 2, 2024 09:08
Comment on lines +365 to +367
& (FragmentTypeFlag::Values as u32
| FragmentTypeFlag::StreamScan as u32
| FragmentTypeFlag::SourceScan as u32))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is core change 1

@xxchan xxchan force-pushed the 08-24-refactor_use_high_watermark_to_finish_backfill_faster branch from efc2a77 to dc3aae9 Compare September 3, 2024 04:19
@xxchan xxchan force-pushed the 08-16-feat_track_progress_for_sourcebackfill branch 4 times, most recently from 2ed4c17 to 6328de6 Compare September 3, 2024 06:00
Comment on lines -678 to +707
true,
false,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a minor bug. (not related

Comment on lines -422 to +435
if !self.backfill_finished(&backfill_stage.states).await? {
{
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the early exit condition, let it only exit in one place to make it safer & simpler

@xxchan xxchan requested review from chenzl25 and fuyufjh September 3, 2024 06:06
@xxchan xxchan changed the title feat: track progress for SourceBackfill feat: track progress for SourceBackfill (blocking DDL) Sep 3, 2024
Copy link
Member

@fuyufjh fuyufjh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM

// For now, we just rely on the same code path, and for source backfill, the progress will always be 99.99%.
tracing::info!("progress finish");
let epoch = barrier.epoch;
self.progress.finish(epoch, 114514);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 2nd argument of progress.finish() is current_consumed_rows, which I think it's easy to count and report an accurate number. Why not doing it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. Not did it just because no total_consumed_rows and thought it meaningless

Copy link
Member

@fuyufjh fuyufjh Sep 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following the #18112 (comment) will make it more meaningful. I think we can do the counting part in this PR.

if self.backfill_finished(&backfill_stage.states).await? {
break 'backfill_loop;
if self.should_report_finished(&backfill_stage.states) {
// TODO: use a specialized progress for source
Copy link
Member

@fuyufjh fuyufjh Sep 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An idea: We may report the process and total rows in the progress as well.

current: (show jobs)

  Id  |                     Statement                      | Progress
------+----------------------------------------------------+----------
 1010 | CREATE MATERIALIZED VIEW mv1 AS SELECT * FROM mv1  | 2.21%

expected:

  Id  |                     Statement                         | Processed | Total | Progress
------+-------------------------------------------------------+-----------+-------+---------
 1010 | CREATE MATERIALIZED VIEW mv1 AS SELECT * FROM mv1     | 221       | 1000  | 2.21% 
 1012 | CREATE MATERIALIZED VIEW mv2 AS SELECT * FROM source1 | 114514    |       | 

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found #17735 already did similar thing when set streaming_use_snapshot_backfill = true. It just changed the Progress string.

dev=> show jobs;
┌────┬────────────────────────────────────────────────┬───────────────────┐
│ Id │                   Statement                    │     Progress      │
├────┼────────────────────────────────────────────────┼───────────────────┤
│ 29 │ CREATE MATERIALIZED VIEW mv AS SELECT * FROM t │ Snapshot [32.08%] │
└────┴────────────────────────────────────────────────┴───────────────────┘
(1 row)

pub(super) fn gen_ddl_progress(&self) -> DdlProgress {
let progress = match &self.status {
CreatingStreamingJobStatus::ConsumingSnapshot {
create_mview_tracker,
..
} => {
if create_mview_tracker.has_pending_finished_jobs() {
"Snapshot finished".to_string()
} else {
let progress = create_mview_tracker
.gen_ddl_progress()
.remove(&self.info.table_fragments.table_id().table_id)
.expect("should exist");
format!("Snapshot [{}]", progress.progress)
}
}
CreatingStreamingJobStatus::ConsumingLogStore {
start_consume_log_store_epoch,
..
} => {
let max_collected_epoch = max(
self.barrier_control.max_collected_epoch().unwrap_or(0),
self.backfill_epoch,
);
let lag = Duration::from_millis(
Epoch(*start_consume_log_store_epoch)
.physical_time()
.saturating_sub(Epoch(max_collected_epoch).physical_time()),
);
format!(
"LogStore [remain lag: {:?}, epoch cnt: {}]",
lag,
self.barrier_control.inflight_barrier_count()
)
}
CreatingStreamingJobStatus::ConsumingUpstream { .. } => {
format!(
"Upstream [unattached: {}, epoch cnt: {}]",
self.barrier_control.unattached_epochs().count(),
self.barrier_control.inflight_barrier_count(),
)
}
CreatingStreamingJobStatus::Finishing { .. } => {
format!(
"Finishing [epoch count: {}]",
self.barrier_control.inflight_barrier_count()
)
}
};
DdlProgress {
id: self.info.table_fragments.table_id().table_id as u64,
statement: self.info.definition.clone(),
progress,
}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me do this in a separate PR, because it's more complex than I expected. Since one MV may have multiple backfill upstreams, and I want to distinguish MV backfill rows and source backfill rows.

@xxchan xxchan force-pushed the 08-24-refactor_use_high_watermark_to_finish_backfill_faster branch from 4a5c008 to 1a663a9 Compare September 3, 2024 10:10
@xxchan xxchan force-pushed the 08-16-feat_track_progress_for_sourcebackfill branch from 6328de6 to e2fd423 Compare September 3, 2024 10:10
Base automatically changed from 08-24-refactor_use_high_watermark_to_finish_backfill_faster to main September 4, 2024 02:26
@graphite-app graphite-app bot requested a review from a team September 4, 2024 02:26
@xxchan xxchan force-pushed the 08-16-feat_track_progress_for_sourcebackfill branch from e2fd423 to a54a89f Compare September 4, 2024 02:32
@xxchan xxchan enabled auto-merge September 4, 2024 02:33
@xxchan xxchan disabled auto-merge September 4, 2024 02:33
@xxchan xxchan added this pull request to the merge queue Sep 6, 2024
Merged via the queue into main with commit 6489e9a Sep 6, 2024
31 of 32 checks passed
@xxchan xxchan deleted the 08-16-feat_track_progress_for_sourcebackfill branch September 6, 2024 05:28
@xxchan xxchan mentioned this pull request Sep 11, 2024
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants