Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster: Remove all the lease from cluster #3886

Merged
merged 2 commits into from
Nov 7, 2023

Conversation

praveenkumar
Copy link
Member

lease is a mechanism to lock shared resources in k8s and since we start
a stopped cluster better to remove all the lease so that lock is not
around for longer period of time in case older lease expire and only
renew once respective resource ask for it.

Also during 4.14 testing we found out if we only remove lease from
openshift-machine-config-operator namespace then having the
pull-secret on the disk takes a longer time then usual. I am still not
sure where it is stuck but https://issues.redhat.com/browse/OCPBUGS-7583
is closed as expected behaviour and till 4.13 we used the workaround.
From 4.14 that workaround doesn't work because there is no configmap and
just removing lease of this namespace not helpful so I am removing all
the lease from older cluster which atleast perfrom better.

fixes: #3883

…configmap

OCP 4.14 moves to use LeasesResourceLock instead of ConfigMapsLeasesResourceLock and it
doesn't have this `machine-config-controller` configmap and `crc start`
failing on those bundle. This PR make sure if configmap doesn't present
then don't send the error.

- openshift/machine-config-operator#3842
lease is a mechanism to lock shared resources in k8s and since we start
a stopped cluster better to remove all the lease so that lock is not
around for longer period of time in case older lease expire and only
renew once respective resource ask for it.

Also during 4.14 testing we found out if we only remove lease from
`openshift-machine-config-operator` namespace then having the
pull-secret on the disk takes a longer time then usual. I am still not
sure where it is stuck but https://issues.redhat.com/browse/OCPBUGS-7583
is closed as expected behaviour and till 4.13 we used the workaround.
From 4.14 that workaround doesn't work because there is no configmap and
just removing lease of this namespace not helpful so I am removing all
the lease from older cluster which atleast perfrom better.
@praveenkumar
Copy link
Member Author

/retest

@adrianriobo
Copy link
Contributor

This PR passes for pre-release version 4.14.0-rc.7 results can be checked at https://crcqe-asia.s3.amazonaws.com/nightly/ocp/4.14.0-rc.7/qe-results/e2e-non-ux.xml

if _, _, err := ocConfig.RunOcCommand("delete", "-n", "openshift-machine-config-operator", "configmap", "machine-config-controller"); err != nil {
return err

if _, stderr, err := ocConfig.RunOcCommand("delete", "-n", "openshift-machine-config-operator", "configmap", "machine-config-controller"); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does oc make a difference between "not found" and "network error" (for example) through the error code it returns?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the same error code

 $ oc get cm foo
Error from server (NotFound): configmaps "foo" not found
 $ echo $?
1

$ oc get ns
Unable to connect to the server: dial tcp: lookup api.cr.testing: Temporary failure in name resolution
$ echo $?
1

@@ -507,6 +507,6 @@ func DeleteMCOLeaderLease(ctx context.Context, ocConfig oc.Config) error {
if err := WaitForOpenshiftResource(ctx, ocConfig, "lease"); err != nil {
return err
}
_, _, err := ocConfig.RunOcCommand("delete", "-n", "openshift-machine-config-operator", "lease", "--all")
_, _, err := ocConfig.RunOcCommand("delete", "-A", "lease", "--all")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From 4.14 that workaround doesn't work because there is no configmap and just removing lease of this namespace not helpful so I am removing all the lease from older cluster which at least perform better.

were you able to find out what extra lease(s) have to be removed?
How many leases does -A removes in additions to the one(s) we want to remove for this specific issue?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It remove all the leases in all the namespaces, I didn't count but lease creation is not time consuming.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we can't know that removing all the leases is 100% harmless now, and/or won't cause issues in the future, can we? Shouldn't we be more specific in what we remove?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we can't know that removing all the leases is 100% harmless now, and/or won't cause issues in the future, can we?

kubernetes lease object lock shared resources and coordinate activity between members of a set (https://kubernetes.io/docs/concepts/architecture/leases/), I am hoping the lease objects would be created as soon it deleted due to cluster reconcile.

Shouldn't we be more specific in what we remove?

I tried to remove mco specific leases but that didn't work at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am hoping the lease objects would be created as soon it deleted due to cluster reconcile.

sounds like we don't know for sure this won't cause any problems?

I tried to remove mco specific leases but that didn't work at all.

Since removing all the leases help, there must be a set of leases which fix the problem when you remove them?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the leases which are being removed by the -A call?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$ oc get lease -A
NAMESPACE                                    NAME                                             HOLDER                                                                                        AGE
kube-node-lease                              crc-rb68f-master-0                               crc-rb68f-master-0                                                                            19d
kube-system                                  cluster-policy-controller-lock                   crc-rb68f-bootstrap_aee6cf0c-afce-497f-b60b-097560769571                                      19d
kube-system                                  kube-apiserver-sgkskaacrm2qjpzjkwtoelunve        kube-apiserver-sgkskaacrm2qjpzjkwtoelunve_813cc02d-b985-42ac-b65d-c027ddb2efdb                19d
kube-system                                  kube-controller-manager                          crc-rb68f-master-0_24eae1a8-50c9-4456-a6b5-2ce4bf610dfa                                       19d
kube-system                                  kube-scheduler                                                                                                                                 19d
openshift-apiserver-operator                 openshift-apiserver-operator-lock                openshift-apiserver-operator-7949f48785-2gxlp_74b9fa94-33c1-481b-bd15-f159ceb3a089            19d
openshift-authentication-operator            cluster-authentication-operator-lock             authentication-operator-66f4f5cd8b-bjstd_793364e7-b299-4846-b448-7178bb13c6d0                 19d
openshift-cloud-credential-operator          cloud-credential-operator-leader                                                                                                               19d
openshift-cluster-machine-approver           cluster-machine-approver-leader                  crc-rb68f-master-0_2d6d2116-bf57-44bf-934d-2ea04eb5bb3c                                       19d
openshift-cluster-version                    version                                          crc-rb68f-master-0_f1182ed3-6505-41e3-8ecf-9229e2dbe51a                                       19d
openshift-config-operator                    config-operator-lock                             openshift-config-operator-6f8b6bc78b-45z2s_94bbb9d8-8355-4237-b1a9-6196a4f6e5e4               19d
openshift-console-operator                   console-operator-lock                            console-operator-97895dfb9-m8bp6_58eea847-1ec0-4458-8374-53567246672c                         19d
openshift-controller-manager-operator        openshift-controller-manager-operator-lock       openshift-controller-manager-operator-5565fcdcfc-4dbjr_1e2bbadb-fa00-470f-9ba5-388e25c0e901   19d
openshift-controller-manager                 openshift-master-controllers                     controller-manager-fb96b945f-pfllb                                                            19d
openshift-etcd-operator                      openshift-cluster-etcd-operator-lock             etcd-operator-695544bd94-tjlf5_06c44f0b-fdd7-409e-a3aa-013630c95e19                           19d
openshift-image-registry                     openshift-master-controllers                     cluster-image-registry-operator-575b79c6cc-nbn6t_f3285285-536d-43a1-b86a-44eab35a91de         19d
openshift-kube-apiserver-operator            kube-apiserver-operator-lock                     kube-apiserver-operator-7d77ccf95c-jhxxq_7e7dd169-ea1f-406f-a7fe-666fd717bb02                 19d
openshift-kube-apiserver                     cert-regeneration-controller-lock                crc-rb68f-master-0_327ff250-5f29-4aaa-b3e9-c2d972b1cb34                                       19d
openshift-kube-controller-manager-operator   kube-controller-manager-operator-lock            kube-controller-manager-operator-758d9d7f7b-6jxpb_93499777-27bd-4183-a05c-5d5ae67fd4b2        19d
openshift-kube-controller-manager            cert-recovery-controller-lock                    crc-rb68f-master-0_571b71aa-da5b-4741-a833-cb21cd8cd144                                       19d
openshift-kube-controller-manager            cluster-policy-controller-lock                   crc-rb68f-master-0_03d7d0bb-d5b8-441d-9d34-80c9ea3c261c                                       19d
openshift-kube-scheduler-operator            openshift-cluster-kube-scheduler-operator-lock   openshift-kube-scheduler-operator-7c88d54f4f-kwrj7_61c2a2df-89d7-4409-b75b-ac913477acd8       19d
openshift-kube-scheduler                     cert-recovery-controller-lock                    crc-rb68f-master-0_a626382f-6e44-4b18-bd5d-05d949a4e015                                       19d
openshift-kube-scheduler                     kube-scheduler                                   crc-rb68f-master-0_7ecee300-2653-40b1-83f7-ae139a1ad70e                                       19d
openshift-machine-api                        cluster-api-provider-healthcheck-leader          machine-api-controllers-7dcc8d4786-66fvp_b5ec0e33-7b19-4e4c-b401-799c0ea492dc                 19d
openshift-machine-api                        cluster-api-provider-libvirt-leader              machine-api-controllers-7dcc8d4786-66fvp_ff60f420-d747-41ba-96c0-bee25776da99                 19d
openshift-machine-api                        cluster-api-provider-machineset-leader           machine-api-controllers-7dcc8d4786-66fvp_1bf1da6e-e0f1-4e84-b075-18ff0c76e208                 19d
openshift-machine-api                        cluster-api-provider-nodelink-leader             machine-api-controllers-7dcc8d4786-66fvp_6789b0bd-6268-4d3d-884a-2003a86b3d81                 19d
openshift-machine-api                        control-plane-machine-set-leader                 control-plane-machine-set-operator-7f8bdc8669-z9qgz_ac722c46-b860-4c43-b125-6c0df7b182e4      19d
openshift-machine-api                        machine-api-operator                             machine-api-operator-65ffbd6cbb-7f7gg_d75be0e5-4384-426f-96db-038d80c075a6                    19d
openshift-machine-config-operator            machine-config                                   machine-config-operator-5d7dd48bcf-7kwl5_a3637b9d-af16-4f4c-a0ab-eb7e8665d4d5                 5h14m
openshift-machine-config-operator            machine-config-controller                        machine-config-controller-6578468c58-n2zn4_9507711e-faec-41d7-8464-feeb71a339a5               5h14m
openshift-marketplace                        marketplace-operator-lock                        marketplace-operator-687c6f9688-v8jf7                                                         19d
openshift-network-operator                   network-operator-lock                            crc-rb68f-master-0_8ccd6c8e-25aa-4adc-8df1-c003547fc02c                                       19d
openshift-operator-lifecycle-manager         packageserver-controller-lock                    package-server-manager-7b475647f5-zvcfd_7d4ef4dc-5904-49b7-b847-8f479963bbff                  19d
openshift-route-controller-manager           openshift-route-controllers                      route-controller-manager-94444cf76-45zl2                                                      19d
openshift-sdn                                openshift-network-controller                     crc-rb68f-master-0                                                                            19d
openshift-service-ca-operator                service-ca-operator-lock                         service-ca-operator-78d8578bd9-xvtf7_81dad111-c2ed-4372-bc93-cab3d5225daf                     19d
openshift-service-ca                         service-ca-controller-lock                       service-ca-865665db86-glkb8_f103c0be-1e77-4d68-a693-f6749d52ef72                              19d

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think best option is to take this as it is and create a follow up issue to track and work on that when we have some time?

We don't know for sure if it is harmless, and we are not fully sure what fixes/avoid the problem we are seeing. I don't feel really confident that this won't be causing unrelated issues, which will cost more time to debug/fix than spending a bit more time now to at least only remove the necessary leases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't know for sure if it is harmless, and we are not fully sure what fixes/avoid the problem we are seeing. I don't feel really confident that this won't be causing unrelated issues, which will cost more time to debug/fix than spending a bit more time now to at least only remove the necessary leases.

Yes, we are not sure but we also don't know from where to debug and if we want to go with OCP-4.14.x in this release then I don't think any alternative for time being. Option is either allocate more time to this and again have some conversation with internal team (if they willing to help us to debug) or have separate issue to track and put effort when have time without blocking current release. @crc-org/crc-team wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should at least be doable without impacting our release schedule too much to not use oc delete -A and to remove a minimal set of leases.

@gbraad
Copy link
Contributor

gbraad commented Nov 2, 2023

I think we have to release with this and take a possible risk of this not being completely harmless

not taking this change in and facing a long startup/reconcile penalty and possible timeouts, or a risk of an issue... but n timeouts and reconcile in a few seconds? I am for the last option... and investigate further based on comments.

Copy link

openshift-ci bot commented Nov 7, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gbraad

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved label Nov 7, 2023
@anjannath anjannath merged commit 9354dd4 into crc-org:main Nov 7, 2023
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[release-4.14] crc start fails when using an ocp pre-release vesion 4.14.0-rc.X
5 participants