-
Notifications
You must be signed in to change notification settings - Fork 243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster: Remove all the lease from cluster #3886
Conversation
…configmap OCP 4.14 moves to use LeasesResourceLock instead of ConfigMapsLeasesResourceLock and it doesn't have this `machine-config-controller` configmap and `crc start` failing on those bundle. This PR make sure if configmap doesn't present then don't send the error. - openshift/machine-config-operator#3842
lease is a mechanism to lock shared resources in k8s and since we start a stopped cluster better to remove all the lease so that lock is not around for longer period of time in case older lease expire and only renew once respective resource ask for it. Also during 4.14 testing we found out if we only remove lease from `openshift-machine-config-operator` namespace then having the pull-secret on the disk takes a longer time then usual. I am still not sure where it is stuck but https://issues.redhat.com/browse/OCPBUGS-7583 is closed as expected behaviour and till 4.13 we used the workaround. From 4.14 that workaround doesn't work because there is no configmap and just removing lease of this namespace not helpful so I am removing all the lease from older cluster which atleast perfrom better.
/retest |
This PR passes for pre-release version |
if _, _, err := ocConfig.RunOcCommand("delete", "-n", "openshift-machine-config-operator", "configmap", "machine-config-controller"); err != nil { | ||
return err | ||
|
||
if _, stderr, err := ocConfig.RunOcCommand("delete", "-n", "openshift-machine-config-operator", "configmap", "machine-config-controller"); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does oc
make a difference between "not found" and "network error" (for example) through the error code it returns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is the same error code
$ oc get cm foo
Error from server (NotFound): configmaps "foo" not found
$ echo $?
1
$ oc get ns
Unable to connect to the server: dial tcp: lookup api.cr.testing: Temporary failure in name resolution
$ echo $?
1
@@ -507,6 +507,6 @@ func DeleteMCOLeaderLease(ctx context.Context, ocConfig oc.Config) error { | |||
if err := WaitForOpenshiftResource(ctx, ocConfig, "lease"); err != nil { | |||
return err | |||
} | |||
_, _, err := ocConfig.RunOcCommand("delete", "-n", "openshift-machine-config-operator", "lease", "--all") | |||
_, _, err := ocConfig.RunOcCommand("delete", "-A", "lease", "--all") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From 4.14 that workaround doesn't work because there is no configmap and just removing lease of this namespace not helpful so I am removing all the lease from older cluster which at least perform better.
were you able to find out what extra lease(s) have to be removed?
How many leases does -A
removes in additions to the one(s) we want to remove for this specific issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It remove all the leases in all the namespaces, I didn't count but lease creation is not time consuming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we can't know that removing all the leases is 100% harmless now, and/or won't cause issues in the future, can we? Shouldn't we be more specific in what we remove?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we can't know that removing all the leases is 100% harmless now, and/or won't cause issues in the future, can we?
kubernetes lease object lock shared resources and coordinate activity between members of a set (https://kubernetes.io/docs/concepts/architecture/leases/), I am hoping the lease objects would be created as soon it deleted due to cluster reconcile.
Shouldn't we be more specific in what we remove?
I tried to remove mco specific leases but that didn't work at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am hoping the lease objects would be created as soon it deleted due to cluster reconcile.
sounds like we don't know for sure this won't cause any problems?
I tried to remove mco specific leases but that didn't work at all.
Since removing all the leases help, there must be a set of leases which fix the problem when you remove them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the leases which are being removed by the -A
call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
$ oc get lease -A
NAMESPACE NAME HOLDER AGE
kube-node-lease crc-rb68f-master-0 crc-rb68f-master-0 19d
kube-system cluster-policy-controller-lock crc-rb68f-bootstrap_aee6cf0c-afce-497f-b60b-097560769571 19d
kube-system kube-apiserver-sgkskaacrm2qjpzjkwtoelunve kube-apiserver-sgkskaacrm2qjpzjkwtoelunve_813cc02d-b985-42ac-b65d-c027ddb2efdb 19d
kube-system kube-controller-manager crc-rb68f-master-0_24eae1a8-50c9-4456-a6b5-2ce4bf610dfa 19d
kube-system kube-scheduler 19d
openshift-apiserver-operator openshift-apiserver-operator-lock openshift-apiserver-operator-7949f48785-2gxlp_74b9fa94-33c1-481b-bd15-f159ceb3a089 19d
openshift-authentication-operator cluster-authentication-operator-lock authentication-operator-66f4f5cd8b-bjstd_793364e7-b299-4846-b448-7178bb13c6d0 19d
openshift-cloud-credential-operator cloud-credential-operator-leader 19d
openshift-cluster-machine-approver cluster-machine-approver-leader crc-rb68f-master-0_2d6d2116-bf57-44bf-934d-2ea04eb5bb3c 19d
openshift-cluster-version version crc-rb68f-master-0_f1182ed3-6505-41e3-8ecf-9229e2dbe51a 19d
openshift-config-operator config-operator-lock openshift-config-operator-6f8b6bc78b-45z2s_94bbb9d8-8355-4237-b1a9-6196a4f6e5e4 19d
openshift-console-operator console-operator-lock console-operator-97895dfb9-m8bp6_58eea847-1ec0-4458-8374-53567246672c 19d
openshift-controller-manager-operator openshift-controller-manager-operator-lock openshift-controller-manager-operator-5565fcdcfc-4dbjr_1e2bbadb-fa00-470f-9ba5-388e25c0e901 19d
openshift-controller-manager openshift-master-controllers controller-manager-fb96b945f-pfllb 19d
openshift-etcd-operator openshift-cluster-etcd-operator-lock etcd-operator-695544bd94-tjlf5_06c44f0b-fdd7-409e-a3aa-013630c95e19 19d
openshift-image-registry openshift-master-controllers cluster-image-registry-operator-575b79c6cc-nbn6t_f3285285-536d-43a1-b86a-44eab35a91de 19d
openshift-kube-apiserver-operator kube-apiserver-operator-lock kube-apiserver-operator-7d77ccf95c-jhxxq_7e7dd169-ea1f-406f-a7fe-666fd717bb02 19d
openshift-kube-apiserver cert-regeneration-controller-lock crc-rb68f-master-0_327ff250-5f29-4aaa-b3e9-c2d972b1cb34 19d
openshift-kube-controller-manager-operator kube-controller-manager-operator-lock kube-controller-manager-operator-758d9d7f7b-6jxpb_93499777-27bd-4183-a05c-5d5ae67fd4b2 19d
openshift-kube-controller-manager cert-recovery-controller-lock crc-rb68f-master-0_571b71aa-da5b-4741-a833-cb21cd8cd144 19d
openshift-kube-controller-manager cluster-policy-controller-lock crc-rb68f-master-0_03d7d0bb-d5b8-441d-9d34-80c9ea3c261c 19d
openshift-kube-scheduler-operator openshift-cluster-kube-scheduler-operator-lock openshift-kube-scheduler-operator-7c88d54f4f-kwrj7_61c2a2df-89d7-4409-b75b-ac913477acd8 19d
openshift-kube-scheduler cert-recovery-controller-lock crc-rb68f-master-0_a626382f-6e44-4b18-bd5d-05d949a4e015 19d
openshift-kube-scheduler kube-scheduler crc-rb68f-master-0_7ecee300-2653-40b1-83f7-ae139a1ad70e 19d
openshift-machine-api cluster-api-provider-healthcheck-leader machine-api-controllers-7dcc8d4786-66fvp_b5ec0e33-7b19-4e4c-b401-799c0ea492dc 19d
openshift-machine-api cluster-api-provider-libvirt-leader machine-api-controllers-7dcc8d4786-66fvp_ff60f420-d747-41ba-96c0-bee25776da99 19d
openshift-machine-api cluster-api-provider-machineset-leader machine-api-controllers-7dcc8d4786-66fvp_1bf1da6e-e0f1-4e84-b075-18ff0c76e208 19d
openshift-machine-api cluster-api-provider-nodelink-leader machine-api-controllers-7dcc8d4786-66fvp_6789b0bd-6268-4d3d-884a-2003a86b3d81 19d
openshift-machine-api control-plane-machine-set-leader control-plane-machine-set-operator-7f8bdc8669-z9qgz_ac722c46-b860-4c43-b125-6c0df7b182e4 19d
openshift-machine-api machine-api-operator machine-api-operator-65ffbd6cbb-7f7gg_d75be0e5-4384-426f-96db-038d80c075a6 19d
openshift-machine-config-operator machine-config machine-config-operator-5d7dd48bcf-7kwl5_a3637b9d-af16-4f4c-a0ab-eb7e8665d4d5 5h14m
openshift-machine-config-operator machine-config-controller machine-config-controller-6578468c58-n2zn4_9507711e-faec-41d7-8464-feeb71a339a5 5h14m
openshift-marketplace marketplace-operator-lock marketplace-operator-687c6f9688-v8jf7 19d
openshift-network-operator network-operator-lock crc-rb68f-master-0_8ccd6c8e-25aa-4adc-8df1-c003547fc02c 19d
openshift-operator-lifecycle-manager packageserver-controller-lock package-server-manager-7b475647f5-zvcfd_7d4ef4dc-5904-49b7-b847-8f479963bbff 19d
openshift-route-controller-manager openshift-route-controllers route-controller-manager-94444cf76-45zl2 19d
openshift-sdn openshift-network-controller crc-rb68f-master-0 19d
openshift-service-ca-operator service-ca-operator-lock service-ca-operator-78d8578bd9-xvtf7_81dad111-c2ed-4372-bc93-cab3d5225daf 19d
openshift-service-ca service-ca-controller-lock service-ca-865665db86-glkb8_f103c0be-1e77-4d68-a693-f6749d52ef72 19d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think best option is to take this as it is and create a follow up issue to track and work on that when we have some time?
We don't know for sure if it is harmless, and we are not fully sure what fixes/avoid the problem we are seeing. I don't feel really confident that this won't be causing unrelated issues, which will cost more time to debug/fix than spending a bit more time now to at least only remove the necessary leases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't know for sure if it is harmless, and we are not fully sure what fixes/avoid the problem we are seeing. I don't feel really confident that this won't be causing unrelated issues, which will cost more time to debug/fix than spending a bit more time now to at least only remove the necessary leases.
Yes, we are not sure but we also don't know from where to debug and if we want to go with OCP-4.14.x in this release then I don't think any alternative for time being. Option is either allocate more time to this and again have some conversation with internal team (if they willing to help us to debug) or have separate issue to track and put effort when have time without blocking current release. @crc-org/crc-team wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should at least be doable without impacting our release schedule too much to not use oc delete -A
and to remove a minimal set of leases.
I think we have to release with this and take a possible risk of this not being completely harmless not taking this change in and facing a long startup/reconcile penalty and possible timeouts, or a risk of an issue... but n timeouts and reconcile in a few seconds? I am for the last option... and investigate further based on comments. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: gbraad The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
lease is a mechanism to lock shared resources in k8s and since we start
a stopped cluster better to remove all the lease so that lock is not
around for longer period of time in case older lease expire and only
renew once respective resource ask for it.
Also during 4.14 testing we found out if we only remove lease from
openshift-machine-config-operator
namespace then having thepull-secret on the disk takes a longer time then usual. I am still not
sure where it is stuck but https://issues.redhat.com/browse/OCPBUGS-7583
is closed as expected behaviour and till 4.13 we used the workaround.
From 4.14 that workaround doesn't work because there is no configmap and
just removing lease of this namespace not helpful so I am removing all
the lease from older cluster which atleast perfrom better.
fixes: #3883