Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kfp-api: Waiting for relational-db data #632

Open
ACodingfreak opened this issue Dec 5, 2024 · 5 comments
Open

kfp-api: Waiting for relational-db data #632

ACodingfreak opened this issue Dec 5, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@ACodingfreak
Copy link

Bug Description

I have 2 nodes in a microk8s cluster and I am trying to install kubeflow charm 1.9 using juju. This deployment is not complete even after 2 hours with current status

$ juju status --color | grep -E "blocked|error|maintenance|waiting|App|Unit"
App                      Version                  Status       Scale  Charm                    Channel          Rev  Address         Exposed  Message
kfp-api                                           waiting          1  kfp-api                  2.2/stable      1611  10.152.183.218  no       installing agent
kfp-db                                            waiting          1  mysql-k8s                8.0/stable       180  10.152.183.202  no       installing agent
kfp-persistence                                   waiting          1  kfp-persistence          2.2/stable      1560  10.152.183.231  no       installing agent
kfp-schedwf                                       maintenance      1  kfp-schedwf              2.2/stable      1571  10.152.183.105  no       Reconciling charm: executing component kubernetes:auth-and-crds
kfp-ui                                            waiting          1  kfp-ui                   2.2/stable      1555  10.152.183.141  no       installing agent
kubeflow-volumes                                  maintenance      1  kubeflow-volumes         1.9/stable       348  10.152.183.146  no       Reconciling charm: executing component kubernetes:auth
pvcviewer-operator                                waiting          1  pvcviewer-operator       1.9/stable       204  10.152.183.217  no       installing agent
Unit                        Workload     Agent  Address       Ports          Message
kfp-api/0*                  maintenance  idle   10.1.121.210                 Creating K8S resources
kfp-persistence/0*          blocked      idle   10.1.121.212                 [relation:kfp-api] Expected data from exactly 1 related applications - got 0.
kfp-schedwf/0*              maintenance  idle   10.1.121.214                 Reconciling charm: executing component kubernetes:auth-and-crds
kfp-ui/0*                   blocked      idle   10.1.69.153                  [relation:kfp-api] Expected data from exactly 1 related applications - got 0.
kubeflow-volumes/0*         maintenance  idle   10.1.121.222                 Reconciling charm: executing component kubernetes:auth
pvcviewer-operator/0*       maintenance  idle   10.1.69.157                  Reconciling charm: executing component pvc-viewer-pebble-service

Initially I have created the bug in Bundled Kubeflow but on troubleshooting the issue seems to be because of KFP-API
canonical/bundle-kubeflow#1168

In the past similar issue was raised in this github which is closed because of no activity
#382

To Reproduce

I am just running the instructions provided in

https://documentation.ubuntu.com/charmed-mlflow/en/latest/tutorial/mlflow-kubeflow/

Environment

Both the machines are following configuration
16 CPU cores
64GB Ram

Ubuntu 22.04
Microk8s 1.29
Juju 3.4
Kubeflow 1.9

Relevant Log Output

2024-12-05T14:44:07.499Z [container-agent] 2024-12-05 14:44:07 INFO juju-log HTTP Request: PATCH https://10.152.183.1/api/v1/namespaces/kubeflow/serviceaccounts/kfp-api-sa?force=true&fieldManager=l
2024-12-05T14:44:07.689Z [container-agent] 2024-12-05 14:44:07 INFO juju-log HTTP Request: PATCH https://10.152.183.1/apis/rbac.authorization.k8s.io/v1/clusterroles/kfp-api-role?force=true&fieldMan
2024-12-05T14:44:07.903Z [container-agent] 2024-12-05 14:44:07 INFO juju-log HTTP Request: PATCH https://10.152.183.1/apis/rbac.authorization.k8s.io/v1/clusterroles/kubeflow-pipelines-edit?force=tr
2024-12-05T14:44:08.110Z [container-agent] 2024-12-05 14:44:08 INFO juju-log HTTP Request: PATCH https://10.152.183.1/apis/rbac.authorization.k8s.io/v1/clusterroles/kubeflow-pipelines-view?force=tr
2024-12-05T14:44:08.319Z [container-agent] 2024-12-05 14:44:08 INFO juju-log HTTP Request: PATCH https://10.152.183.1/apis/rbac.authorization.k8s.io/v1/clusterroles/aggregate-to-kubeflow-pipelines-
2024-12-05T14:44:08.528Z [container-agent] 2024-12-05 14:44:08 INFO juju-log HTTP Request: PATCH https://10.152.183.1/apis/rbac.authorization.k8s.io/v1/clusterroles/aggregate-to-kubeflow-pipelines-
2024-12-05T14:44:08.724Z [container-agent] 2024-12-05 14:44:08 INFO juju-log HTTP Request: PATCH https://10.152.183.1/apis/rbac.authorization.k8s.io/v1/clusterroles/argo-aggregate-to-admin?force=tr
2024-12-05T14:44:08.916Z [container-agent] 2024-12-05 14:44:08 INFO juju-log HTTP Request: PATCH https://10.152.183.1/apis/rbac.authorization.k8s.io/v1/clusterroles/argo-aggregate-to-edit?force=tru
2024-12-05T14:44:09.102Z [container-agent] 2024-12-05 14:44:09 INFO juju-log HTTP Request: PATCH https://10.152.183.1/apis/rbac.authorization.k8s.io/v1/clusterroles/argo-aggregate-to-view?force=tru
2024-12-05T14:44:09.290Z [container-agent] 2024-12-05 14:44:09 INFO juju-log HTTP Request: PATCH https://10.152.183.1/apis/rbac.authorization.k8s.io/v1/clusterrolebindings/kfp-api-binding?force=tru
2024-12-05T14:44:09.480Z [container-agent] 2024-12-05 14:44:09 INFO juju-log HTTP Request: PATCH https://10.152.183.1/api/v1/namespaces/kubeflow/services/ml-pipeline?force=true&fieldManager=lightku
2024-12-05T14:44:09.682Z [container-agent] 2024-12-05 14:44:09 INFO juju-log HTTP Request: PATCH https://10.152.183.1/api/v1/namespaces/kubeflow/services/minio-service?force=true&fieldManager=light
2024-12-05T14:44:09.776Z [container-agent] 2024-12-05 14:44:09 INFO juju-log Reconcile completed successfully
2024-12-05T14:44:10.171Z [container-agent] 2024-12-05 14:44:10 INFO juju-log Found empty relation data for relational-db relation.
2024-12-05T14:44:10.186Z [container-agent] 2024-12-05 14:44:10 ERROR juju-log Failed to generate container configuration.
2024-12-05T14:44:10.474Z [container-agent] 2024-12-05 14:44:10 ERROR juju-log Failed to handle <UpdateStatusEvent via KfpApiOperator/on/update_status[4132]> with error: Waiting for relational-db da
2024-12-05T14:44:10.883Z [container-agent] 2024-12-05 14:44:10 INFO juju.worker.uniter.operation runhook.go:186 ran "update-status" hook (via hook dispatching script: dispatch)

Additional Context

No response

@ACodingfreak ACodingfreak added the bug Something isn't working label Dec 5, 2024
Copy link

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-6650.

This message was autogenerated

@mvlassis
Copy link
Contributor

mvlassis commented Dec 11, 2024

Hello! If possible, can you provide the status and the logs of kfp-db:

juju status kfp-db
juju debug-log -i kfp-db --replay

Thanks!

@ACodingfreak
Copy link
Author

ACodingfreak commented Dec 11, 2024

Please find the requested logs

$ juju status kfp-db
Model     Controller          Cloud/Region             Version  SLA          Timestamp
kubeflow  juju-kf-controller  juju-kf-cloud/localhost  3.4.6    unsupported  07:16:51-08:00

App     Version  Status   Scale  Charm      Channel     Rev  Address         Exposed  Message
kfp-db           waiting      1  mysql-k8s  8.0/stable  180  10.152.183.202  no       installing agent

Unit       Workload  Agent  Address      Ports  Message
kfp-db/0*  unknown   idle   10.1.69.152    
$ juju debug-log -i kfp-db --replay | more
unit-kfp-db-0: 23:59:41 WARNING unit.kfp-db/0.juju-log Failed to check if cluster metadata exists from_instance='kfp-db-0.kfp-db-endpoints.kubeflow.svc.cluster.local'
unit-kfp-db-0: 23:59:41 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-kfp-db-0: 00:04:18 WARNING unit.kfp-db/0.juju-log Failed to check if cluster metadata exists from_instance='kfp-db-0.kfp-db-endpoints.kubeflow.svc.cluster.local'
unit-kfp-db-0: 00:04:19 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-kfp-db-0: 00:10:02 WARNING unit.kfp-db/0.juju-log Failed to check if cluster metadata exists from_instance='kfp-db-0.kfp-db-endpoints.kubeflow.svc.cluster.local'
unit-kfp-db-0: 00:10:02 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-kfp-db-0: 00:14:53 WARNING unit.kfp-db/0.juju-log Failed to check if cluster metadata exists from_instance='kfp-db-0.kfp-db-endpoints.kubeflow.svc.cluster.local'
unit-kfp-db-0: 00:14:54 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-kfp-db-0: 00:20:22 WARNING unit.kfp-db/0.juju-log Failed to check if cluster metadata exists from_instance='kfp-db-0.kfp-db-endpoints.kubeflow.svc.cluster.local'
unit-kfp-db-0: 00:20:23 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-kfp-db-0: 00:25:14 WARNING unit.kfp-db/0.juju-log Failed to check if cluster metadata exists from_instance='kfp-db-0.kfp-db-endpoints.kubeflow.svc.cluster.local'
unit-kfp-db-0: 00:25:15 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-kfp-db-0: 00:31:06 WARNING unit.kfp-db/0.juju-log Failed to check if cluster metadata exists from_instance='kfp-db-0.kfp-db-endpoints.kubeflow.svc.cluster.local'
unit-kfp-db-0: 00:31:07 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-kfp-db-0: 00:36:57 WARNING unit.kfp-db/0.juju-log Failed to check if cluster metadata exists from_instance='kfp-db-0.kfp-db-endpoints.kubeflow.svc.cluster.local'
unit-kfp-db-0: 00:36:57 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-kfp-db-0: 00:41:32 WARNING unit.kfp-db/0.juju-log Failed to check if cluster metadata exists from_instance='kfp-db-0.kfp-db-endpoints.kubeflow.svc.cluster.local'
unit-kfp-db-0: 00:41:33 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-kfp-db-0: 00:46:22 WARNING unit.kfp-db/0.juju-log Failed to check if cluster metadata exists from_instance='kfp-db-0.kfp-db-endpoints.kubeflow.svc.cluster.local'
unit-kfp-db-0: 00:46:22 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-kfp-db-0: 00:51:42 WARNING unit.kfp-db/0.juju-log Failed to check if cluster metadata exists from_instance='kfp-db-0.kfp-db-endpoints.kubeflow.svc.cluster.local'
unit-kfp-db-0: 00:51:42 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-kfp-db-0: 00:56:27 WARNING unit.kfp-db/0.juju-log Failed to check if cluster metadata exists from_instance='kfp-db-0.kfp-db-endpoints.kubeflow.svc.cluster.local'
unit-kfp-db-0: 00:56:27 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-kfp-db-0: 01:01:42 WARNING unit.kfp-db/0.juju-log Failed to check if cluster metadata exists from_instance='kfp-db-0.kfp-db-endpoints.kubeflow.svc.cluster.local'
unit-kfp-db-0: 01:01:42 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-kfp-db-0: 01:05:56 WARNING unit.kfp-db/0.juju-log Failed to check if cluster metadata exists from_instance='kfp-db-0.kfp-db-endpoints.kubeflow.svc.cluster.local'
unit-kfp-db-0: 01:05:57 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-kfp-db-0: 01:10:47 WARNING unit.kfp-db/0.juju-log Failed to check if cluster metadata exists from_instance='kfp-db-0.kfp-db-endpoints.kubeflow.svc.cluster.local'
unit-kfp-db-0: 01:10:47 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-kfp-db-0: 01:16:01 WARNING unit.kfp-db/0.juju-log Failed to check if cluster metadata exists from_instance='kfp-db-0.kfp-db-endpoints.kubeflow.svc.cluster.local'
unit-kfp-db-0: 01:16:02 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-kfp-db-0: 01:20:36 WARNING unit.kfp-db/0.juju-log Failed to check if cluster metadata exists from_instance='kfp-db-0.kfp-db-endpoints.kubeflow.svc.cluster.local'
unit-kfp-db-0: 01:20:36 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-kfp-db-0: 01:26:31 WARNING unit.kfp-db/0.juju-log Failed to check if cluster metadata exists from_instance='kfp-db-0.kfp-db-endpoints.kubeflow.svc.cluster.local'
unit-kfp-db-0: 01:26:32 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-kfp-db-0: 01:31:47 WARNING unit.kfp-db/0.juju-log Failed to check if cluster metadata exists from_instance='kfp-db-0.kfp-db-endpoints.kubeflow.svc.cluster.local'
unit-kfp-db-0: 01:31:47 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-kfp-db-0: 01:36:33 WARNING unit.kfp-db/0.juju-log Failed to check if cluster metadata exists from_instance='kfp-db-0.kfp-db-endpoints.kubeflow.svc.cluster.local'
unit-kfp-db-0: 01:36:34 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-kfp-db-0: 01:42:17 WARNING unit.kfp-db/0.juju-log Failed to check if cluster metadata exists from_instance='kfp-db-0.kfp-db-endpoints.kubeflow.svc.cluster.local'
unit-kfp-db-0: 01:42:18 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-kfp-db-0: 01:47:58 WARNING unit.kfp-db/0.juju-log Failed to check if cluster metadata exists from_instance='kfp-db-0.kfp-db-endpoints.kubeflow.svc.cluster.local'
unit-kfp-db-0: 01:47:59 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)

@ACodingfreak
Copy link
Author

Also after a week I can see the latest status as

$ juju status 
Model     Controller          Cloud/Region             Version  SLA          Timestamp
kubeflow  juju-kf-controller  juju-kf-cloud/localhost  3.4.6    unsupported  07:22:05-08:00

App                      Version                  Status   Scale  Charm                    Channel          Rev  Address         Exposed  Message
admission-webhook                                 active       1  admission-webhook        1.9/stable       344  10.152.183.246  no       
argo-controller                                   waiting      1  argo-controller          3.4/stable       600  10.152.183.177  no       installing agent
dex-auth                                          active       1  dex-auth                 2.39/stable      588  10.152.183.214  no       
envoy                                             active       1  envoy                    2.2/stable       310  10.152.183.95   no       
istio-ingressgateway                              active       1  istio-gateway            1.22/stable     1280  10.152.183.111  no       
istio-pilot                                       active       1  istio-pilot              1.22/stable     1169  10.152.183.154  no       
jupyter-controller                                active       1  jupyter-controller       1.9/stable      1083  10.152.183.164  no       
jupyter-ui                                        active       1  jupyter-ui               1.9/stable       961  10.152.183.197  no       
katib-controller                                  active       1  katib-controller         0.17/stable      813  10.152.183.70   no       
katib-db                 8.0.37-0ubuntu0.22.04.3  active       1  mysql-k8s                8.0/stable       180  10.152.183.106  no       
katib-db-manager                                  active       1  katib-db-manager         0.17/stable      713  10.152.183.219  no       
katib-ui                                          active       1  katib-ui                 0.17/stable      713  10.152.183.22   no       
kfp-api                                           waiting      1  kfp-api                  2.2/stable      1611  10.152.183.218  no       installing agent
kfp-db                                            waiting      1  mysql-k8s                8.0/stable       180  10.152.183.202  no       installing agent
kfp-metadata-writer                               active       1  kfp-metadata-writer      2.2/stable       617  10.152.183.190  no       
kfp-persistence                                   waiting      1  kfp-persistence          2.2/stable      1560  10.152.183.231  no       installing agent
kfp-profile-controller                            active       1  kfp-profile-controller   2.2/stable      1518  10.152.183.104  no       
kfp-schedwf                                       active       1  kfp-schedwf              2.2/stable      1571  10.152.183.105  no       
kfp-ui                                            waiting      1  kfp-ui                   2.2/stable      1555  10.152.183.141  no       installing agent
kfp-viewer                                        active       1  kfp-viewer               2.2/stable      1586  10.152.183.209  no       
kfp-viz                                           active       1  kfp-viz                  2.2/stable      1504  10.152.183.50   no       
knative-eventing                                  active       1  knative-eventing         1.12/stable      459  10.152.183.19   no       
knative-operator                                  active       1  knative-operator         1.12/stable      496  10.152.183.27   no       
knative-serving                                   active       1  knative-serving          1.12/stable      487  10.152.183.135  no       
kserve-controller                                 active       1  kserve-controller        0.13/stable      655  10.152.183.213  no       
kubeflow-dashboard                                active       1  kubeflow-dashboard       1.9/stable       659  10.152.183.240  no       
kubeflow-profiles                                 active       1  kubeflow-profiles        1.9/stable       458  10.152.183.232  no       
kubeflow-roles                                    active       1  kubeflow-roles           1.9/stable       240  10.152.183.113  no       
kubeflow-volumes                                  active       1  kubeflow-volumes         1.9/stable       348  10.152.183.146  no       
metacontroller-operator                           active       1  metacontroller-operator  3.0/stable       352  10.152.183.221  no       
minio                    res:oci-image@220b31a    active       1  minio                    ckf-1.9/stable   383  10.152.183.237  no       
mlmd                                              active       1  mlmd                     ckf-1.9/stable   213  10.152.183.220  no       
oidc-gatekeeper                                   active       1  oidc-gatekeeper          ckf-1.9/stable   423  10.152.183.239  no       
pvcviewer-operator                                active       1  pvcviewer-operator       1.9/stable       204  10.152.183.217  no       
tensorboard-controller                            active       1  tensorboard-controller   1.9/stable       355  10.152.183.76   no       
tensorboards-web-app                              active       1  tensorboards-web-app     1.9/stable       343  10.152.183.58   no       
training-operator                                 active       1  training-operator        1.8/stable       545  10.152.183.96   no       

Unit                        Workload     Agent      Address       Ports          Message
admission-webhook/0*        active       idle       10.1.121.202                 
argo-controller/0*          blocked      idle       10.1.121.203                 [kubernetes:crds-cm-and-secrets] Not all resources found in cluster.  This may be transient if we haven't tried to de...
dex-auth/0*                 active       idle       10.1.121.204                 
envoy/0*                    active       idle       10.1.69.145                  
istio-ingressgateway/0*     active       idle       10.1.121.205                 
istio-pilot/0*              active       idle       10.1.69.146                  
jupyter-controller/0*       active       idle       10.1.69.147                  
jupyter-ui/0*               active       idle       10.1.121.207                 
katib-controller/0*         active       idle       10.1.69.148                  
katib-db-manager/0*         active       idle       10.1.69.149                  
katib-db/0*                 active       executing  10.1.121.209                 Primary
katib-ui/0*                 active       idle       10.1.69.150                  
kfp-api/0*                  waiting      idle       10.1.121.210                 Waiting for relational-db data
kfp-db/0*                   unknown      idle       10.1.69.152                  
kfp-metadata-writer/0*      active       idle       10.1.121.211                 
kfp-persistence/0*          maintenance  idle       10.1.121.212                 Reconciling charm: executing component sa-token:persistenceagent
kfp-profile-controller/0*   active       idle       10.1.121.213                 
kfp-schedwf/0*              active       idle       10.1.121.214                 
kfp-ui/0*                   blocked      idle       10.1.69.153                  [relation:kfp-api] Expected data from exactly 1 related applications - got 0.
kfp-viewer/0*               active       idle       10.1.121.215                 
kfp-viz/0*                  active       idle       10.1.69.154                  
knative-eventing/0*         active       idle       10.1.69.156                  
knative-operator/0*         active       idle       10.1.121.219                 
knative-serving/0*          active       idle       10.1.69.158                  
kserve-controller/0*        active       idle       10.1.69.161                  
kubeflow-dashboard/0*       active       idle       10.1.121.218                 
kubeflow-profiles/0*        active       idle       10.1.121.223                 
kubeflow-roles/0*           active       idle       10.1.69.159                  
kubeflow-volumes/0*         active       idle       10.1.121.222                 
metacontroller-operator/0*  active       idle       10.1.121.220                 
minio/0*                    active       idle       10.1.121.217  9000-9001/TCP  
mlmd/0*                     active       idle       10.1.69.164                  
oidc-gatekeeper/0*          active       idle       10.1.69.163                  
pvcviewer-operator/0*       active       idle       10.1.69.157                  
tensorboard-controller/0*   active       idle       10.1.121.221                 
tensorboards-web-app/0*     active       idle       10.1.69.162                  
training-operator/0*        active       idle       10.1.69.155     

@mvlassis
Copy link
Contributor

Hello! Unfortunately, I could not reproduce on a machine with the same environment (22.04, microk8s 1.29, Juju 3.4, CKF 1.9). If you still have the cluster, I would suggest using juju refresh to redeploy the kfp-db charm which seems to cause the issue for kfp-api. You can also try a different revision of the charm using the --revision option, see the command's documentation for more details.

We will track this issue and link to here if we encounter this as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants