-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: don't load entire archived workflow into memory in list APIs #12912
fix: don't load entire archived workflow into memory in list APIs #12912
Conversation
Signed-off-by: Jiacheng Xu <[email protected]>
Signed-off-by: Jiacheng Xu <[email protected]>
Signed-off-by: Jiacheng Xu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
StartedAt: v1.Time{Time: md.StartedAt}, | ||
FinishedAt: v1.Time{Time: md.FinishedAt}, | ||
Progress: wfv1.Progress(md.Progress), | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What did you base the metadata requirements on? Based on workflows-service.ts
and reports.tsx
, we may also want status.message
, status.estimatedDuration
, status.resourceDuration
, and spec.suspend
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was checking columns in the workflow UI and decided what metadata to include, but if the fields that you mentioned are required, I can open another PR to add them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically speaking, fields
is a query parameter in the API, so ideally we'd dynamically set the fields.
If you didn't notice some of them it may be because you can expand items in the list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, that could be, I will prepare a PR to add those fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be causing a regression now: #13098
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm looking at #11121 (comment) more, it may have also unintentionally introduced a feature where you can get all parts of an Archived Workflows in the list
API, whereas you could not in 3.4 if I'm reading correctly.
I.e. fixing that bug also removed that feature. I don't necessarily mind that if it wasn't present in 3.4, but I can see how that could cause some back-and-forth turbulence & instability for users (which, well, #11121 caused a lot of regressions & instability unfortunately (and predates me))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR to add the above missing fields: #13136
…2912) Signed-off-by: Jiacheng Xu <[email protected]> (cherry picked from commit f80b9e8)
Backported cleanly into |
items.status.nodes
disappeared from /api/v1/workflows
endpoint for completed Workflows
#13098
Signed-off-by: Mason Malone <[email protected]>
…j#13601 As explained in argoproj#13601 (comment), I believe argoproj#12912 introduced a performance regression when listing workflows for PostgreSQL users. Reverting that PR could re-introduce the memory issues mentioned in the PR description, so instead this mitigates the impact by converting the `workflow` column to be of type `jsonb`. Initially `workflow` was of type `text`, and was migrated to `json` in argoproj#2152. I'm not sure why `jsonb` wasn't chosen, but [based on this comment in the linked issue](argoproj#2133 (comment)), I think it was simply an oversight. Here's the relevant docs (https://www.postgresql.org/docs/current/datatype-json.html): > The json and jsonb data types accept almost identical sets of values as input. The major practical difference is one of efficiency. The json data type stores an exact copy of the input text, which processing functions must reparse on each execution; while jsonb data is stored in a decomposed binary format that makes it slightly slower to input due to added conversion overhead, but significantly faster to process, since no reparsing is needed. jsonb also supports indexing, which can be a significant advantage. > > Because the json type stores an exact copy of the input text, it will preserve semantically-insignificant white space between tokens, as well as the order of keys within JSON objects. Also, if a JSON object within the value contains the same key more than once, all the key/value pairs are kept. (The processing functions consider the last value as the operative one.) By contrast, jsonb does not preserve white space, does not preserve the order of object keys, and does not keep duplicate object keys. If duplicate keys are specified in the input, only the last value is kept. > > In general, most applications should prefer to store JSON data as jsonb, unless there are quite specialized needs, such as legacy assumptions about ordering of object keys. I'm pretty sure we don't care about key order or whitespace. We do care somewhat about insertion speed, but archived workflows are read much more frequently than written, so a slight reduction in write speed that gives a large improvement in read speed is a good tradeoff. Here's steps to test this: 1. Use argoproj#13715 to generate 100,000 randomized workflows, with https://gist.github.com/MasonM/52932ff6644c3c0ccea9e847780bfd90 as a template: ``` $ time go run ./hack/db fake-archived-workflows --template "@very-large-workflow.yaml" --rows 100000 Using seed 1935828722624432788 Clusters: [default] Namespaces: [argo] Inserted 100000 rows real 18m35.316s user 3m2.447s sys 0m44.972s ``` 2. Run the benchmarks using argoproj#13767: ``` make BenchmarkWorkflowArchive > postgres_before_10000_workflows.txt ``` 3. Run the migration the DB CLI: ``` $ time go run ./hack/db migrate INFO[0000] Migrating database schema clusterName=default dbType=postgres INFO[0000] applying database change change="alter table argo_archived_workflows alter column workflow set data type jsonb using workflow::jsonb" changeSchemaVersion=60 2024/10/17 18:07:42 Session ID: 00001 Query: alter table argo_archived_workflows alter column workflow set data type jsonb using workflow::jsonb Stack: fmt.(*pp).handleMethods@/usr/local/go/src/fmt/print.go:673 fmt.(*pp).printArg@/usr/local/go/src/fmt/print.go:756 fmt.(*pp).doPrint@/usr/local/go/src/fmt/print.go:1208 fmt.Append@/usr/local/go/src/fmt/print.go:289 log.(*Logger).Print.func1@/usr/local/go/src/log/log.go:261 log.(*Logger).output@/usr/local/go/src/log/log.go:238 log.(*Logger).Print@/usr/local/go/src/log/log.go:260 github.com/argoproj/argo-workflows/v3/persist/sqldb.ansiSQLChange.apply@/home/vscode/go/src/github.com/argoproj/argo-workflows/persist/sqldb/ansi_sql_change.go:11 github.com/argoproj/argo-workflows/v3/persist/sqldb.migrate.applyChange.func1@/home/vscode/go/src/github.com/argoproj/argo-workflows/persist/sqldb/migrate.go:295 github.com/argoproj/argo-workflows/v3/persist/sqldb.migrate.applyChange@/home/vscode/go/src/github.com/argoproj/argo-workflows/persist/sqldb/migrate.go:284 github.com/argoproj/argo-workflows/v3/persist/sqldb.migrate.Exec@/home/vscode/go/src/github.com/argoproj/argo-workflows/persist/sqldb/migrate.go:273 main.NewMigrateCommand.func1@/home/vscode/go/src/github.com/argoproj/argo-workflows/hack/db/main.go:50 github.com/spf13/cobra.(*Command).execute@/home/vscode/go/pkg/mod/github.com/spf13/[email protected]/command.go:985 github.com/spf13/cobra.(*Command).ExecuteC@/home/vscode/go/pkg/mod/github.com/spf13/[email protected]/command.go:1117 github.com/spf13/cobra.(*Command).Execute@/home/vscode/go/pkg/mod/github.com/spf13/[email protected]/command.go:1041 main.main@/home/vscode/go/src/github.com/argoproj/argo-workflows/hack/db/main.go:39 runtime.main@/usr/local/go/src/runtime/proc.go:272 runtime.goexit@/usr/local/go/src/runtime/asm_amd64.s:1700 Rows affected: 0 Error: upper: slow query Time taken: 69.12755s Context: context.Background real 1m10.087s user 0m1.541s sys 0m0.410s ``` 2. Re-run the benchmarks: ``` make BenchmarkWorkflowArchive > postgres_after_10000_workflows.txt ``` 4. Compare results using [benchstat](https://pkg.go.dev/golang.org/x/perf/cmd/benchstat): ``` $ benchstat postgres_before_10000_workflows3.txt postgres_after_10000_workflows2.txt goos: linux goarch: amd64 pkg: github.com/argoproj/argo-workflows/v3/test/e2e cpu: 12th Gen Intel(R) Core(TM) i5-12400 │ postgres_before_10000_workflows3.txt │ postgres_after_10000_workflows2.txt │ │ sec/op │ sec/op vs base │ WorkflowArchive/ListWorkflows-12 183.83m ± ∞ ¹ 24.69m ± ∞ ¹ ~ (p=1.000 n=1) ² WorkflowArchive/ListWorkflows_with_label_selector-12 192.71m ± ∞ ¹ 25.87m ± ∞ ¹ ~ (p=1.000 n=1) ² WorkflowArchive/CountWorkflows-12 13.04m ± ∞ ¹ 11.75m ± ∞ ¹ ~ (p=1.000 n=1) ² geomean 77.31m 19.58m -74.68% ¹ need >= 6 samples for confidence interval at level 0.95 ² need >= 4 samples to detect a difference at alpha level 0.05 │ postgres_before_10000_workflows3.txt │ postgres_after_10000_workflows2.txt │ │ B/op │ B/op vs base │ WorkflowArchive/ListWorkflows-12 497.2Ki ± ∞ ¹ 497.5Ki ± ∞ ¹ ~ (p=1.000 n=1) ² WorkflowArchive/ListWorkflows_with_label_selector-12 503.1Ki ± ∞ ¹ 503.9Ki ± ∞ ¹ ~ (p=1.000 n=1) ² WorkflowArchive/CountWorkflows-12 8.972Ki ± ∞ ¹ 8.899Ki ± ∞ ¹ ~ (p=1.000 n=1) ² geomean 130.9Ki 130.7Ki -0.20% ¹ need >= 6 samples for confidence interval at level 0.95 ² need >= 4 samples to detect a difference at alpha level 0.05 │ postgres_before_10000_workflows3.txt │ postgres_after_10000_workflows2.txt │ │ allocs/op │ allocs/op vs base │ WorkflowArchive/ListWorkflows-12 8.373k ± ∞ ¹ 8.370k ± ∞ ¹ ~ (p=1.000 n=1) ² WorkflowArchive/ListWorkflows_with_label_selector-12 8.410k ± ∞ ¹ 8.406k ± ∞ ¹ ~ (p=1.000 n=1) ² WorkflowArchive/CountWorkflows-12 212.0 ± ∞ ¹ 212.0 ± ∞ ¹ ~ (p=1.000 n=1) ³ geomean 2.462k 2.462k -0.03% ¹ need >= 6 samples for confidence interval at level 0.95 ² need >= 4 samples to detect a difference at alpha level 0.05 ³ all samples are equal ``` Signed-off-by: Mason Malone <[email protected]>
Fixes #11121 (comment), Related to #12025 (comment)
This PR fixes the regression introduced in #11121 (comment). In this PR, we don't load the entire workflow to the memory since that causes high memory usage and slow query performance.