Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

workflows list page delay in showing retry status from details page #12868

Open
3 of 4 tasks
amfage opened this issue Apr 2, 2024 · 5 comments
Open
3 of 4 tasks

workflows list page delay in showing retry status from details page #12868

amfage opened this issue Apr 2, 2024 · 5 comments
Labels
area/ui P3 Low priority type/bug type/regression Regression from previous behavior (a specific type of bug)

Comments

@amfage
Copy link

amfage commented Apr 2, 2024

Pre-requisites

  • I have double-checked my configuration
  • I can confirm the issue exists when I tested with :latest
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what did you expect to happen?

If a workflow fails, when viewing the workflow and clicking "retry", when returning to the main workflows listing page, the workflow still shows failed, although the workflow is running. Sometimes the workflow listing page will, after a second or maybe 5 seconds, show the correct status, but sometimes it stays displaying the failed status.
retry-workflow-argo-3-5-5
The expected behaviour is that the icon shows the blue circle running icon, and the started, finished, duration and message areas show that the workflow is running, as it would if the workflow was running for the first time. This is how it behaved in the previous version we were running which was 3.4.11.

Version

v3.5.5

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

# This example sleeps for 30 seconds then fails, so it can be retried
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: test-simple-fail-for-retry-
spec:
  entrypoint: template-fail
  templates:
  - name: template-fail
    script:
      image: python:alpine3.6
      command: ["python"]
      # fail
      source: |
        import sys;
        import time;
        time.sleep(30); 
        sys.exit(1)

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded
@amfage amfage added the type/bug label Apr 2, 2024
@agilgur5 agilgur5 changed the title Argo Workflows main workflows list page does not show workflow retry is running workflows list page does not show workflow retry is running Apr 2, 2024
@agilgur5 agilgur5 changed the title workflows list page does not show workflow retry is running workflows list page delay in showing retry status from details page Apr 3, 2024
@agilgur5
Copy link
Contributor

agilgur5 commented Apr 3, 2024

after a second or maybe 5 seconds, show the correct status

That sounds correct. The Workflows List page runs a ListWatch on the visible k8s resources, so there can be a time delay between when the retry occurs and when it shows up in the ListWatch.

The Workflows Details page can have similar issues as it relies on a RetryWatch.

Albeit there is a difference, due to a copy+paste issue, when a retry is started, the page refreshes, same as with resubmit (the code is copied from resubmit, but a retry does not create a new Workflow, so there isn't actually a need to change pages. and a page change should happen via the internal router as well instead of the browser URL). The retry does not necessarily happen immediately though (especially not after #12538), so it might not be caught during the refresh either.

The main way to improve that would be for the UI to keep a cache of Workflows and reconcile them with the Server, that way the Details page and the List page could re-use the data for a faster update, but that would require a sizeable refactor.

@agilgur5 agilgur5 added the P3 Low priority label Apr 3, 2024
@amfage
Copy link
Author

amfage commented Apr 3, 2024

It is definitely the case that sometimes the status does not change to running at all - that happened the first time I submitted a retry on version 3.5.5 and I have seen it several times since.

@agilgur5
Copy link
Contributor

agilgur5 commented Apr 4, 2024

but sometimes it stays displaying the failed status.

It is definitely the case that sometimes the status does not change to running at all - that happened the first time I submitted a retry on version 3.5.5 and I have seen it several times since.

You mentioned that before, thanks for confirming it's happened multiple times. That's odd, that really shouldn't happen that an update is missed... there are some race conditions that can happen, but not frequently 🤔 Do you have a rough estimate of how often that happens? Like "1/20 retries" for example

I also checked the UI merge logic which seems to be correct afaict

This is how it behaved in the previous version we were running which was 3.4.11.

This part is interesting, suggesting a regression. A few things changed in 3.5 here. The options panel was added for retries in #11632, which also added the hard refresh after a retry. And most of the UI was refactored to use React hooks; in particular, I did a massive refactor of the Workflows List page in #11891 which made it substantially more efficient as there were many duplicate network calls. That PR has had a few (less than one-liner) bugs in it though.
This sounds like one of the duplicate network calls may have been catching this in 3.4.x 🤔 IIRC one of them was a full re-list, which should refresh the state... However the hard refresh after a retry should also end up refreshing the state

technically speaking, the back-end k8s client Informer libraries rebuild their cache on a certain timeframe as well (the "resync period" is every 20min by default), specifically because they might miss an update (cache invalidation is hard)... I wonder if we should perhaps do some similar logic here in the UI 🤔

@agilgur5 agilgur5 added the type/regression Regression from previous behavior (a specific type of bug) label Apr 4, 2024
@amfage
Copy link
Author

amfage commented Apr 16, 2024

You mentioned that before, thanks for confirming it's happened multiple times. That's odd, that really shouldn't happen that an update is missed... there are some race conditions that can happen, but not frequently 🤔 Do you have a rough estimate of how often that happens? Like "1/20 retries" for example

I am not sure how often it happens but it did happen last week and was consistent for at least 10 minutes over several workflows, but then reverted to the behaviour of updating the status after about 5 seconds just as I was about to troubleshoot it. I do know that some of those workflow retries were stuck in a "Pending" state (InvalidImageName error, but the workflow didn't fail) but I haven't been able to reproduce it since.

@tooptoop4
Copy link
Contributor

@amfage i see a related PR was merged, have u tested on latest version?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ui P3 Low priority type/bug type/regression Regression from previous behavior (a specific type of bug)
Projects
None yet
Development

No branches or pull requests

4 participants