Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

List workflows sometimes fail with out of sort memory error #13229

Closed
3 of 4 tasks
kangyanzhou opened this issue Jun 20, 2024 · 4 comments
Closed
3 of 4 tasks

List workflows sometimes fail with out of sort memory error #13229

kangyanzhou opened this issue Jun 20, 2024 · 4 comments
Labels
area/api Argo Server API area/upstream This is an issue with an upstream dependency, not Argo itself area/workflow-archive type/bug

Comments

@kangyanzhou
Copy link

Pre-requisites

  • I have double-checked my configuration
  • I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what did you expect to happen?

In the workflows page, if I select the most recent 500 workflows, sometimes fail with

Internal Server Error: Error 1038 (HY001): Out of sort memory, consider increasing server sort buffer size

Searching on web is suggesting it's a MySQL issue, but not sure whether this is sth.

Unfortunately it only fails in our prod deployment, and I couldn't upgrade the version there to see whether it still have the same issue in v3.5.8.

Some example error logs I can find in the argo server log:

time="2024-06-20T21:35:12.559Z" level=error msg="finished unary call with code Internal" error="rpc error: code = Internal desc = Error 1038 (HY001): Out of sort memory, consider increasing server sort buffer size" grpc.code=Internal grpc.method=ListWorkflows grpc.service=workflow.WorkflowService grpc.start_time="2024-06-20T21:35:11Z" grpc.time_ms=797.957 span.kind=server system=grpc

Version

v3.5.7

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

N/A

Logs from the workflow controller

N/A

Logs from in your workflow's wait container

N/A
@agilgur5 agilgur5 added the area/api Argo Server API label Jun 22, 2024
@agilgur5
Copy link
Contributor

agilgur5 commented Jun 22, 2024

Unfortunately it only fails in our prod deployment, and I couldn't upgrade the version there to see whether it still have the same issue in v3.5.8.

3.5.8 fixed a pretty severe bug in #13166 which affects the new in-memory SQLiteDB in 3.5.7 from #13021 / #12736, so I would really suggest upgrading.

Was there a stack trace in the logs? Can you provide the preceding logs and the logs after this error?

Searching on web is suggesting it's a MySQL issue, but not sure whether this is sth.

Can you provide your ConfigMap? You didn't mention an important detail here; are you using the Workflow Archive or status offloading features that require an external DB? In your case, sounds like MySQL?

If it's a MySQL resource allocation issue, I'm not sure there's anything Argo can do about that; you would have to change your configuration of MySQL

@agilgur5 agilgur5 added the problem/more information needed Not enough information has been provide to diagnose this issue. label Jun 22, 2024
@agilgur5
Copy link
Contributor

agilgur5 commented Jun 22, 2024

Indeed this does seem like a specific MySQL issue per this SO answer which links to a MySQL bug report which links to another and so forth. Unfortunately, if I followed the thread correctly, it seems like they closed it with a documentation update instead of fixing the regression in MySQL >= 8.0.18 😕
What version of MySQL are you on?

It seems to specifically affect sorts on tables with JSON columns, especially those >1MB, and the Workflow status is stored as a JSON column which can exceed 1MB, especially for node status offloading cases (which quite literally exists to workaround the etcd 1MB limit)

@agilgur5 agilgur5 added the area/upstream This is an issue with an upstream dependency, not Argo itself label Jun 22, 2024
@kangyanzhou
Copy link
Author

kangyanzhou commented Jun 24, 2024

Unfortunately, if I followed the thread correctly, it seems like they closed it with a documentation update instead of fixing the regression in MySQL >= 8.0.18 😕 What version of MySQL are you on?

It seems to specifically affect sorts on tables with JSON columns, especially those >1MB, and the Workflow status is stored as a JSON column which can exceed 1MB, especially for node status offloading cases (which quite literally exists to workaround the etcd 1MB limit)

Our MySQL db version is 8.0.35, and we use workflow archive but not the node status offload feature

@agilgur5
Copy link
Contributor

agilgur5 commented Jun 24, 2024

8.0.35 >= 8.0.18, so you would indeed be affected by that MySQL regression.

If you're not using status offloading, then you're probably under 1MB (you might still have compressed nodes, which might be uncompressed in the archive, I'm not sure).

Unfortunately the MySQL issue is not limited to >1MB, it just seems to happen more often in those cases, per the linked threads.

It's out of Argo's hands at this point; you have to configure your DB to have more sort memory or use one of the other workarounds in the MySQL threads and docs.
It sounds like more people need to complain to MySQL to fix the regression in the optimizer that causes the sort memory calculation to underestimate though 😕 So I'd suggest upvoting and adding comments to those issues on their bug tracker. Or if you have commercial support, contact your provider and otherwise push them to fix it.

@agilgur5 agilgur5 closed this as not planned Won't fix, can't repro, duplicate, stale Jun 24, 2024
@agilgur5 agilgur5 removed the problem/more information needed Not enough information has been provide to diagnose this issue. label Jun 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/api Argo Server API area/upstream This is an issue with an upstream dependency, not Argo itself area/workflow-archive type/bug
Projects
None yet
Development

No branches or pull requests

2 participants