Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

https://issues.redhat.com/browse/ACM-15319 CIM regression KI #7296

Merged
merged 4 commits into from
Dec 12, 2024

Conversation

oafischer
Copy link
Contributor

2.12 only

@oafischer oafischer requested a review from trewest December 2, 2024 15:15
=== Nodes shut down after removing `BareMetalHost` resource
//2.12:ACM-15319

If you remove the `BareMetalHost` resource from a managed cluster, the nodes shut down.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be the BMH in the hub cluster, right?

cc @trewest

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct. This issue occurs when you remove the relevant spoke BMH from the hub cluster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching that, updated!

Copy link
Contributor

@swopebe swopebe Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@carbonin @trewest @oafischer -- for a simple known issue like this, we still need to ask ourselves what the user needs to know and when so that they don't have to stop and think: What did I miss?

  1. There is no information about how to get out of this.
  2. If there is no way out, we really need to say at some point in the doc:

Do not remove the BareMetalHost resource.

If you already say that in the main documentation and they are still doing it, we need to make it clearer there and not here in the known issues.

  1. Let's say only a few users would do this and get caught up, but that most are no impacted and that is why we choose to doc it here. I think we still need to say something more:

You must reinstall the resource to get your nodes to run....
You can manually restart the nodes by....

Something to tell them how to get out of it or what the next step is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bug so ideally we would allow the user to remove that BMH and not have the node shut down so I don't think it makes sense for us to tell them not to remove the BMH in the main doc.

The user shouldn't need to reinstall anything to get the node back, just powering it back on will do the job, but we don't know anything about how they do power management outside of the BMO integration. But if you think it's valuable to say "power the node back on" then I think it's fine to put that here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed explanation @carbonin. I'll add the part about powering the node back on. Let me know if that's sufficient @swopebe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation should also prevent a bug; prevent a customer call. We also don't want to make the user have to stop and think what they did wrong or how they start over. @oafischer let me know what you decide to add here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in meeting, I'll add the recovery step of powering the node back on, so that we don't have to remember to remove any notes in the main doc once this issue is fixed.

@oafischer oafischer requested a review from carbonin December 2, 2024 17:17
@openshift-ci openshift-ci bot removed the lgtm label Dec 12, 2024
Copy link

openshift-ci bot commented Dec 12, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: carbonin, oafischer, swopebe

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot removed the lgtm label Dec 12, 2024
Copy link

openshift-ci bot commented Dec 12, 2024

New changes are detected. LGTM label has been removed.

@oafischer oafischer merged commit 7ce2b67 into 2.12_stage Dec 12, 2024
1 of 2 checks passed
@oafischer oafischer deleted the of-15319 branch December 12, 2024 16:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants