Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RHOAIENG-8390: feat(odh-nbc): search for imagestreams only in the namespace the controller runs in #375

Conversation

shalberd
Copy link

@shalberd shalberd commented Aug 2, 2024

https://issues.redhat.com/browse/RHOAIENG-8390

We would like to think whether we can improve the way how we search for the ImageStreams in case the cluster doesn't have an internal image registry enabled. Not all cluster deployments run with the standard name opendatahub or redhat-ods-applications

Description

Instead of hardcoding the central namespace to look for imagestreams, since all central components run in the main namespace where also by default the imagestreams are located:

  • determine the namespace that the controller runs in
  • use that namespace to continue with imagestream search

https://kubernetes.io/docs/tasks/run-application/access-api-from-pod/#directly-accessing-the-rest-api

"the default namespace to be used for namespaced API operations is placed in a file at /var/run/secrets/kubernetes.io/serviceaccount/namespace in each container."

How Has This Been Tested?

not tested yet. I propose testing with built PR image ...

  • quay.io/opendatahub/kubeflow-notebook-controller:pr-375
    • no changes went into this one
  • quay.io/opendatahub/odh-notebook-controller:pr-375

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

Copy link

openshift-ci bot commented Aug 2, 2024

Hi @shalberd. Thanks for your PR.

I'm waiting for a opendatahub-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@jiridanek
Copy link
Member

This does not even compile, but I think it's reasonable to give it ok to test now so the images build after issues are fixed.

@shalberd
Copy link
Author

shalberd commented Aug 2, 2024

This does not even compile, but I think it's reasonable to give it ok to test now so the images build after issues are fixed.

working on it, one reason is the collegue from Red Hat Zurich @bartoszmajsak also declared this in his networkpolicy and istio work at

https://github.com/shalberd/kubeflow/blob/v1.7-branch/components/odh-notebook-controller/controllers/notebook_network.go#L213

albeit suboptimal in my opinion.

Working on it, will fix these issues

@shalberd
Copy link
Author

shalberd commented Aug 2, 2024

@jiridanek not sure what do do about unit test failure with regards to this part of getControllerNamespace

    // Check if running inside a Kubernetes cluster
    if _, err := rest.InClusterConfig(); err != nil {
        return "", fmt.Errorf("not running inside a Kubernetes cluster")
    }

What I prefer not to do is have a fallback namespace like redhat-odh-applications ...
During runtime, controller always runs on Openshift / in K8s, does it not? I guess the unit tests do not.

@shalberd shalberd force-pushed the odh_notebook_ctrl_dynamic_single_namespace_for_imagestream_search branch from ff5c01a to 8065cdb Compare August 2, 2024 16:33
@shalberd
Copy link
Author

shalberd commented Aug 2, 2024

@shalberd shalberd changed the title RHOAIENG-8390 odh notebook controller imagestreams logic: namespace imagestreams from variable namespace controller runs in [WIP] RHOAIENG-8390 odh notebook controller imagestreams logic: namespace imagestreams from variable namespace controller runs in Aug 2, 2024
@jiridanek
Copy link
Member

Any idea what's up with the unittests? Also, why it's passing on github actions (https://github.com/opendatahub-io/kubeflow/actions/runs/10218934134/job/28276148238?pr=375#step:4:211) but failing on openshift-ci? I guess I can take a look too, if necessary.

I never properly understood what openshift-ci is doing, maybe it's time for me to change that.

@shalberd
Copy link
Author

shalberd commented Aug 26, 2024

@shalberd shalberd changed the title [WIP] RHOAIENG-8390 odh notebook controller imagestreams logic: namespace imagestreams from variable namespace controller runs in RHOAIENG-8390 odh notebook controller imagestreams logic: namespace imagestreams from variable namespace controller runs in Aug 26, 2024
@jiridanek
Copy link
Member

/cherrypick v1.9-branch

@openshift-cherrypick-robot

@jiridanek: once the present PR merges, I will cherry-pick it on top of v1.9-branch in a new PR and assign it to you.

In response to this:

/cherrypick v1.9-branch

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@jiridanek
Copy link
Member

/cherrypick stable

@openshift-cherrypick-robot

@jiridanek: once the present PR merges, I will cherry-pick it on top of stable in a new PR and assign it to you.

In response to this:

/cherrypick stable

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@shalberd
Copy link
Author

shalberd commented Oct 8, 2024

same here, shall I change this to v1.9 branch?

@jiridanek
Copy link
Member

same answer, main

@shalberd shalberd changed the base branch from v1.7-branch to main October 8, 2024 11:38
@shalberd
Copy link
Author

shalberd commented Oct 9, 2024

/retest

@shalberd
Copy link
Author

shalberd commented Oct 9, 2024

@harshad16
I have had the Openshift CI PROW unit and e2e tests fail before, then suceed again ... no idea, but thinking there is something periodically wrong with the underlying infrastructure.

@jiridanek
Copy link
Member

weird indeed; the tests should not be this flaky, also, they seem to be flaky only for you ;P

/test odh-notebook-controller-unit

see what happens now

@openshift-ci openshift-ci bot added the lgtm label Nov 26, 2024
@jiridanek
Copy link
Member

The e2e job seems to be failing in the same way here as it does on https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/opendatahub-io_kubeflow/469/pull-ci-opendatahub-io-kubeflow-main-odh-notebook-controller-e2e/1861352298442133504

image

So that's fine

@shalberd
Copy link
Author

/retest

@jstourac
Copy link
Member

/override ci/prow/odh-notebook-controller-e2e

The e2e tests won't pass as the 2 test failures are expected as discussed above.

Copy link

openshift-ci bot commented Nov 26, 2024

@jstourac: Overrode contexts on behalf of jstourac: ci/prow/odh-notebook-controller-e2e

In response to this:

/override ci/prow/odh-notebook-controller-e2e

The e2e tests won't pass as the 2 test failures are expected as discussed above.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@jstourac
Copy link
Member

LGTM in general, but for the whole context - I still don't think I have enough knowledge about the whole codebase here 🙂

/lgtm

@jiridanek
Copy link
Member

LGTM in general, but for the whole context - I still don't think I have enough knowledge about the whole codebase here 🙂

It implements my suggestions from #375 (comment), so I have to be happy, right?

And one of the ideas of Go was that the code should be unsurprising and one should be able to reason about parts of the codebase in isolation https://go.dev/doc/faq#overloading Guess that's well and good, until you start building on top of Kubernetes ;p

@jiridanek
Copy link
Member

/override ci/prow/odh-notebook-controller-e2e

Copy link

openshift-ci bot commented Nov 26, 2024

@jiridanek: Overrode contexts on behalf of jiridanek: ci/prow/odh-notebook-controller-e2e

In response to this:

/override ci/prow/odh-notebook-controller-e2e

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@atheo89
Copy link
Member

atheo89 commented Nov 27, 2024

Thank you Sven for you patience waiting a review! 🙂
I started to check this new enhancements on a cluster with rhoiai-2.16 operator. I will come back soon with updates

      devFlags:
        manifests:
          - contextDir: components/odh-notebook-controller/config
            uri: 'https://github.com/atheo89/kubeflow/tarball/review-375'

Folks, feel free to use this patch on the operator to check the new behavior.

@shalberd
Copy link
Author

all good, just one small comment with regards to notebook_controller_test unit test ... there is a comment that in that form is not correct

https://github.com/opendatahub-io/kubeflow/blob/main/components/odh-notebook-controller/controllers/notebook_controller_test.go#L465

I suggest changing the lines

								// Since for unit tests we do not have context,
								// namespace will fallback to test pod namespace
								// if run in CI or `redhat-ods-applications` if run locally

to

                                                         // Since for unit tests we do not have context,
							// namespace will fallback to test pod namespace
							// i.e. `redhat-ods-applications` as defined in suite_test.go

see

https://github.com/opendatahub-io/kubeflow/pull/375/files#diff-02e5ce127a538e5eb5f63b210d7c1e98f84b2bb1cdd7adca39378dd24b6546fbR74

and

https://github.com/opendatahub-io/kubeflow/pull/375/files#diff-02e5ce127a538e5eb5f63b210d7c1e98f84b2bb1cdd7adca39378dd24b6546fbR191

I mean, unit test namespace is always redhat-ods-applications now for lack of context, is it not?

@openshift-ci openshift-ci bot removed the lgtm label Nov 27, 2024
@jiridanek
Copy link
Member

/override ci/prow/odh-notebook-controller-e2e

same as before

Copy link

openshift-ci bot commented Nov 27, 2024

@jiridanek: Overrode contexts on behalf of jiridanek: ci/prow/odh-notebook-controller-e2e

In response to this:

/override ci/prow/odh-notebook-controller-e2e

same as before

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@atheo89
Copy link
Member

atheo89 commented Nov 27, 2024

All changes look good after my inspection across several scenarios. 💯

I also tested with the internal registry enabled and disabled to ensure full functionality, and everything worked as expected.

However, I did notice a slight latency in behavior when switching the internal registry to Removed. Specifically, after triggering the creation of a new workbench (just a few seconds after status.dockerImageRepository was updated to empty), the logs indicated that the controller hit the if condition, resulting in an ImagePullOff state.

I retried shortly after, and this time, it correctly handled the case without a registry.

/lgtm

@jiridanek
Copy link
Member

slight latency in behavior when switching the internal registry to Removed

That would be a latency in https://github.com/opendatahub-io/odh-dashboard, then, which did not notice yet that registry disappeared and still used "image-registry.openshift-image-registry.svc:5000". Since this is coming purely from dashboard, no?

@jiridanek
Copy link
Member

Thanks for the reviews and ofc for the work! + 💯

/approve

Copy link

openshift-ci bot commented Nov 27, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jiridanek

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jiridanek
Copy link
Member

Let's test more as Jira moves to testing, then!

@openshift-merge-bot openshift-merge-bot bot merged commit bb2dd26 into opendatahub-io:main Nov 27, 2024
12 checks passed
@openshift-cherrypick-robot

@jiridanek: new pull request created: #474

In response to this:

/cherrypick v1.9-branch

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/kubeflow that referenced this pull request Nov 27, 2024
…espace the controller runs in (opendatahub-io#375) * [Fix] search for imagestreams only in the namespace the controller runs in.

- Take namespace that the controller runs in
- Log that dynamically-determined namespace once at startup.
- Use hard-coded namespace for unit tests.
@openshift-cherrypick-robot

@jiridanek: #375 failed to apply on top of branch "stable":

Applying: RHOAIENG-8390: feat(odh-nbc): search for imagestreams only in the namespace the controller runs in (#375) * [Fix] search for imagestreams only in the namespace the controller runs in.
Using index info to reconstruct a base tree...
M	components/odh-notebook-controller/controllers/notebook_controller.go
M	components/odh-notebook-controller/controllers/notebook_controller_test.go
M	components/odh-notebook-controller/controllers/notebook_network.go
M	components/odh-notebook-controller/controllers/notebook_webhook.go
M	components/odh-notebook-controller/controllers/suite_test.go
Falling back to patching base and 3-way merge...
Auto-merging components/odh-notebook-controller/controllers/suite_test.go
CONFLICT (content): Merge conflict in components/odh-notebook-controller/controllers/suite_test.go
Auto-merging components/odh-notebook-controller/controllers/notebook_webhook.go
Auto-merging components/odh-notebook-controller/controllers/notebook_network.go
CONFLICT (content): Merge conflict in components/odh-notebook-controller/controllers/notebook_network.go
Auto-merging components/odh-notebook-controller/controllers/notebook_controller_test.go
CONFLICT (content): Merge conflict in components/odh-notebook-controller/controllers/notebook_controller_test.go
Auto-merging components/odh-notebook-controller/controllers/notebook_controller.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Patch failed at 0001 RHOAIENG-8390: feat(odh-nbc): search for imagestreams only in the namespace the controller runs in (#375) * [Fix] search for imagestreams only in the namespace the controller runs in.

In response to this:

/cherrypick stable

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@jiridanek
Copy link
Member

we don't care for ^^^, that was set when we had a different process, we don't cherrypick anymore

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants