Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

https://issues.redhat.com/browse/ACM-15319 CIM regression KI #7296

Merged
merged 4 commits into from
Dec 12, 2024
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions clusters/release_notes/mce_known_issues.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -453,3 +453,9 @@ If the `managed-serviceaccount` add-on is required, you can work around the issu
. Update the `managed-serviceaccount` `ManagedClusterAddon` in the local cluster namespace to use the `addonDeploymentConfig` custom resource you created.

See link:../../add-ons/configure_nodeselector_tolerations_addons.adoc#configure-nodeselector-tolerations-addons[Configuring nodeSelectors and tolerations for klusterlet add-ons] to learn more about how to use the `addonDeploymentConfig` custom resource to configure `tolerations` and `nodeSelector` for add-ons.

[#nodes-shut-down-bmh]
=== Nodes shut down after removing `BareMetalHost` resource
//2.12:ACM-15319

If you remove the `BareMetalHost` resource from a managed cluster, the nodes shut down.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be the BMH in the hub cluster, right?

cc @trewest

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct. This issue occurs when you remove the relevant spoke BMH from the hub cluster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching that, updated!

Copy link
Contributor

@swopebe swopebe Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@carbonin @trewest @oafischer -- for a simple known issue like this, we still need to ask ourselves what the user needs to know and when so that they don't have to stop and think: What did I miss?

  1. There is no information about how to get out of this.
  2. If there is no way out, we really need to say at some point in the doc:

Do not remove the BareMetalHost resource.

If you already say that in the main documentation and they are still doing it, we need to make it clearer there and not here in the known issues.

  1. Let's say only a few users would do this and get caught up, but that most are no impacted and that is why we choose to doc it here. I think we still need to say something more:

You must reinstall the resource to get your nodes to run....
You can manually restart the nodes by....

Something to tell them how to get out of it or what the next step is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bug so ideally we would allow the user to remove that BMH and not have the node shut down so I don't think it makes sense for us to tell them not to remove the BMH in the main doc.

The user shouldn't need to reinstall anything to get the node back, just powering it back on will do the job, but we don't know anything about how they do power management outside of the BMO integration. But if you think it's valuable to say "power the node back on" then I think it's fine to put that here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed explanation @carbonin. I'll add the part about powering the node back on. Let me know if that's sufficient @swopebe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation should also prevent a bug; prevent a customer call. We also don't want to make the user have to stop and think what they did wrong or how they start over. @oafischer let me know what you decide to add here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in meeting, I'll add the recovery step of powering the node back on, so that we don't have to remember to remove any notes in the main doc once this issue is fixed.