Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: stuck in recovery simulation test #15296

Closed
hzxa21 opened this issue Feb 27, 2024 · 9 comments · Fixed by #15908
Closed

bug: stuck in recovery simulation test #15296

hzxa21 opened this issue Feb 27, 2024 · 9 comments · Fixed by #15908
Assignees
Labels
type/bug Something isn't working
Milestone

Comments

@hzxa21
Copy link
Collaborator

hzxa21 commented Feb 27, 2024

Describe the bug

In the CI run of #15262 (cherry-pick #15260), the recovery simulation test is stuck when running batch e2e with MADSIM_TEST_SEED=5.

Because #15260 can pass all CI tests and can be successfully while the release-1.7 cherry-pick PR #15262 is stuck in recovery simulation test, after a discussion with @wangrunji0408 , we believe that the PR triggers a bug that is not related to the PR changes themselves.

Error message/log

Logs can be found in artifacts in https://buildkite.com/risingwavelabs/pull-request/builds/42771#018de4d6-1fcb-4023-a97e-63007ab1720c

To Reproduce

No response

Expected behavior

No response

How did you deploy RisingWave?

No response

The version of RisingWave

git commit: 14531cd

Additional context

No response

@hzxa21 hzxa21 added the type/bug Something isn't working label Feb 27, 2024
@github-actions github-actions bot added this to the release-1.7 milestone Feb 27, 2024
@hzxa21
Copy link
Collaborator Author

hzxa21 commented Feb 27, 2024

Note that re-run doesn't help and the recovery simulation test is always stuck when running with MADSIM_TEST_SEED=5.

@kwannoel
Copy link
Contributor

kwannoel commented Feb 27, 2024

Previously fixed issue: #14104.

The fix should be already in 1.7. But this bug could be due to similar reasons.

@hzxa21
Copy link
Collaborator Author

hzxa21 commented Feb 27, 2024

Previously fixed issue: #14104.

The fix should be already in 1.7. But this bug could be due to similar reasons.

The log indeed stops, indicating that the cluster is stuck or very slow. I cannot reproduce it locally though.

@hzxa21
Copy link
Collaborator Author

hzxa21 commented Feb 27, 2024

@kwannoel
Copy link
Contributor

kwannoel commented Mar 7, 2024

It's a deadlock somewhere in DROP again. Here all all the tail logs:

2022-04-05T06:39:21.652236Z DEBUG risingwave_simulation::slt: Running cmd=Drop
2022-04-05T06:39:21.652236Z DEBUG risingwave_simulation::slt: retry count iteration=0
2022-04-05T06:39:21.726798Z  INFO risingwave_simulation::cluster: kill compute-2
2022-04-05T06:39:21.726801Z  INFO risingwave_meta::barrier::recovery: recovery start!
2022-04-05T06:39:21.726801Z DEBUG risingwave_meta::manager::catalog: creating_tables ids: []
2022-04-05T06:39:21.734000Z DEBUG risingwave_meta::barrier::recovery: clean dirty table fragments: {}
2022-04-05T06:39:21.748186Z  INFO risingwave_meta::barrier::recovery: recovering mview progress
2022-04-05T06:39:21.762740Z  INFO risingwave_meta::barrier::recovery: recovered mview progress
2022-04-05T06:39:21.892556Z  INFO risingwave_simulation::cluster: kill compute-1
2022-04-05T06:39:22.043266Z  INFO risingwave_simulation::cluster: kill meta-1
2022-04-05T06:39:22.088142Z  INFO risingwave_simulation::cluster: kill compute-3
2022-04-05T06:39:22.103008Z  INFO risingwave_simulation::cluster: restart compute-3
2022-04-05T06:39:22.120371Z  INFO risingwave_simulation::cluster: restart meta-1
2022-04-05T06:39:22.185730Z  INFO risingwave_simulation::cluster: kill compactor-1
2022-04-05T06:39:22.299107Z  INFO risingwave_simulation::cluster: kill compactor-2
2022-04-05T06:39:22.384536Z  INFO risingwave_simulation::cluster: restart compute-1
2022-04-05T06:39:22.452952Z  INFO risingwave_simulation::cluster: restart compactor-2
2022-04-05T06:39:23.051853Z  INFO risingwave_simulation::cluster: restart compactor-1
2022-04-05T06:39:42.109102Z  INFO risingwave_simulation::cluster: restart compute-2

I gave it 15 mins more to execute here: https://github.com/risingwavelabs/risingwave/pull/15512/files.

Notice the DROP command:

2022-04-05T06:39:21.652236Z DEBUG risingwave_simulation::slt: Running cmd=Drop

Something is blocking it from completion.

@kwannoel
Copy link
Contributor

kwannoel commented Mar 7, 2024

Another issue that could be related: #14910

@kwannoel
Copy link
Contributor

kwannoel commented Mar 26, 2024

Screenshot 2024-03-26 at 11 26 02 AM

Root cause is deadlock between barrier recovery and drop stream job.

@hzxa21
Copy link
Collaborator Author

hzxa21 commented Mar 26, 2024

Screenshot 2024-03-26 at 11 26 02 AM

Root cause is deadlock between barrier recovery and drop stream job.

Great finding! Just curious: how do you debug the issue? Did you use any deadlock detection tool?

@kwannoel
Copy link
Contributor

kwannoel commented Mar 26, 2024

Great finding! Just curious: how do you debug the issue? Did you use any deadlock detection tool?

Nope I just logged it, since I suspect deadlock due to drop 🤣 Any tools to share, perhaps we should add some to our toolbox.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants