-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better handling of Lustre volume that go to unrecoverable Failed status #239
Comments
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
/remove-lifecycle rotten |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
@jacobwolfaws This I haven't seen happening for a long while, though not sure if has to do with changes on the AWS side or that I just didn't happen to run into capacity issues. Feel free to close it, as it seems also that there are no other people reporting it. |
@kanor1306 thanks updating this! Going to leave this thread open, because I do agree we need to improve our capacity shortage messaging |
/lifecycle frozen |
Is your feature request related to a problem?/Why is this needed
From time to time I get Lustre volumes in "Failed" state, usually with the next message:
which I assume is due to lack of capacity in the AWS side (as I am not reaching my quota limits).
The issue comes with the PVC, as it stays in Pending state forever, while the volume it represents is in a state that is not recoverable. See:
And the file system in the AWS console:
The PVC keeps looping waiting for the volume to be created, while the FSx volume is just failed, and that will not change.
/feature
Describe the solution you'd like in detail
My proposed solution does not solve the issue directly, but at least allows you to manage the problem yourself. I would like that the PVC will go to a different state, where it is clear that it is in an unrecoverable situation. This way you could handle the "Failed" situation from within your software without using the AWS API, just by chcking the status of the PVC
Describe alternatives you've considered
Additional context
Another effect of the current situation is that when the PVC is removed, it leaves behind the Lustre volume in AWS, so you need to cleanup manually.
Also note that if you remove the Failed volume, the driver will create a new one, becoming healthy again (if the new one does not also go to Failed state)
Edit: add the Additional context
The text was updated successfully, but these errors were encountered: