Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arrangement backfill fails recovery test, get_row could not find vnode=0 #14910

Closed
kwannoel opened this issue Feb 1, 2024 · 4 comments · Fixed by #15787
Closed

Arrangement backfill fails recovery test, get_row could not find vnode=0 #14910

kwannoel opened this issue Feb 1, 2024 · 4 comments · Fixed by #15787
Assignees
Labels
type/bug Something isn't working
Milestone

Comments

@kwannoel
Copy link
Contributor

kwannoel commented Feb 1, 2024

Branch to reproduce: #14931 (please fork the branch if you want to do any debugging, to make sure we always have a reproducible case).

See #14888 (comment).

@kwannoel kwannoel added the type/bug Something isn't working label Feb 1, 2024
@github-actions github-actions bot added this to the release-1.7 milestone Feb 1, 2024
@kwannoel
Copy link
Contributor Author

kwannoel commented Feb 8, 2024

Here the root cause:

Currently we will deregister the table or mv anyway before sending a barrier to stop the actors, and then the metadata update will be broadcast to CN and cause inconsistency before the actors get dropped.

And the solution:

correct. I think we can try to send and collect the barrier, and then deregister the table after the stop actor barrier is successfully collected or a recovery is triggered.

@kwannoel
Copy link
Contributor Author

kwannoel commented Feb 8, 2024

Seems like it can happen when barrier latency is high, and we drop the sink.

It occurs in arrangement backfill recovery test, when we create a sink and drop it subsequently when there's high barrier latency.

See e2e_test/streaming/rate_limit/snapshot_amplification.slt

@kwannoel
Copy link
Contributor Author

@kwannoel
Copy link
Contributor Author

This is edge case which can happen when drop sink under high barrier. We can fix it in next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Something isn't working
Projects
None yet
3 participants