-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nightly-20240117 compute node OOM during sysbench select-random-limits #14634
Comments
cc. @chenzl25 @liurenjie1024 please take a look? |
It may be related to this change: https://github.com/risingwavelabs/kube-bench/pull/356 The intention was to enable compaction result check only on longevity test but it seems that the check is enabled for all tests. @Li0k is taking a look. |
I have rolled back the configuration of the kube-bench and turned on the compaction check in longevity test |
We still have compute node OOM for nightly-20240118 @Li0k https://buildkite.com/risingwave-test/sysbench/builds/660#018d1e60-191c-4a13-a649-2c14b8f40376 |
The CN OOM again with nightly-20240122 https://buildkite.com/risingwave-test/sysbench/builds/662#018d32fa-3d7a-4fbe-b766-2dbb47d473a2 |
The recent two checks should have nothing to do with the compaction check anymore |
It looks weird. Currently, batch scan with a limit would disable the IO prefetch as well. The limit also has been pushed down into the scan executor. |
Do we have any heap dump? |
@chenzl25 s3://test-useast1-mgmt-bucket-archiver/k8s/sysbench-daily-test-20240122/benchmark-risingwave-compute-c-1/6769d727-b756-4d44-b151-a3feecd4d668/2024-1-22/0/ |
Generating collapsed heap file https://buildkite.com/risingwave-test/generate-collapsed-heap-files/builds/18 |
The problem still continues for the past days. Need to take a look cc. @chenzl25 |
Unfortunately, it didn't catch the OOM's moment. Did we enable the memory-usage-triggering automatic dump? There is no auto dump files in S3 bucket. cc. @cyliu0 |
Yeah. It's enabled by default |
Reproduce a heap dump flame-graph from another cluster. Before the OOM how long does this test run @cyliu0 ? |
@chenzl25 According to the buildkite output below, the jobs started at 21:36:47, the oom happened at 21:36:52. So it had ran 5 seconds before OOM. |
nightly-20240125 OOM again. https://buildkite.com/risingwave-test/sysbench/builds/665#018d426c-c33c-4ad0-bf31-3b8452793391 And there are auto dump files
|
Describe the bug
We have a similar issue #13506 which has been fixed already
buildkite job
Grafana
https://github.com/risingwavelabs/rw-commits-history?tab=readme-ov-file#nightly-20240117
nightly-20240117
2c2085f0b91d9a7252d5aa8d3c54ea7eed85f255
fix: fix incorrect compact task memory estimation (#14624)95cdfe9c10be95dd5ed2cb950c03b6cbd599c505
feat(frontend): support idle in transaction session timeout (#14566)8babc53918e13c24577416dcb33544dde74a3852
chore(storage): upgrade config (#14601)8588a2a12496ba714bb6c50b41ab726613104a3e
fix(cdc): resolve shared source in the correct schema (#14606)1674a7c1f84fe5171c0ab4c40602a7c91d49c6e3
feat(cdc-backfill): support pause and resume (#14590)573ccc3ff4ee26fbd7f0da1b04cf598dfe5ea015
test(connector): add pg-sink compatible test (#14615)d0fbfc5cf1c986d03e3984d8a13dbabd3b717085
fix(stream): read exactly a single row if no snapshot read on barrier (#14544)7cf95e63cd3d16c952da64fd48ee16c6819dbddb
refactor(frontend): make frontend compute runtime configurable (#14597)6fe4a0be5793671ed1e031be9683ef6d1f68b1b9
refactor(common): generate more code on system params with macro (#14596)352276eb13f6b56fde989b2aafb08d4371b00b00
test: add nodejs client test into ci pipeline (#14614)4dde9e6e96baa3b068d79e81fab730734694f910
feat(compactor): check result for all compact task (#14521)8a542ad767597d2fd460eeba4c53a64f3334bc81
feat: implementALTER SOURCE xx FORMAT xx ENCODE xx (...)
(#14057)3f3ef99a03de603a792715f851d2dc9d93397c60
feat(common): deriveCopy
forDeltaBTreeMap
(#14579)ec95fe369658d046659cbabe6bfbd4a2b5149984
feat(s3): retry unhandled 503 error (#14562)217ff0b610506a1d956e1c0e64983d5fabd386b3
test(stream): Testno_shuffle_backfill
vsarrangement_backfill
performance (#14593)24c357274bd6ecd9e2b6b204940a6d3c2add8a81
fix(sql-udf): correctly binding arguments for inner calling sql udfs (#14548)8b16eb7c61ff702b7c8a5c7e3c54da176f3e3308
fix(storage): should use epoch with the maximum offset when get (#14607)220dff7e3f51a588db151624a3e55ad5717fef5a
chore(deps): Bump vergen from 8.2.5 to 8.3.0 (#14608)2a929eb93b57fd2f4a7cac1485e772d3e16bf9c7
chore(deps): Bump multimap from 0.9.0 to 0.10.0 (#14609)680a8d59f473d61cc707495bc0198d52623c083d
chore(ci): remove llvm cov (#14565)0a5781192305391ba845b5325432034b886a4b30
refactor: remove OrderedMergeIterator (#14572)84c9a08c037cd01be78f751d4c4853f55a401326
feat(bench): Add sink bench tool (#14064)b1cdb980c8502e5d663fdd0557e5f080540a1036
refactor(meta): maintain snapshot of running actors instead of resolving it every time for barrier (#14517)Error message/log
No response
To Reproduce
No response
Expected behavior
No response
How did you deploy RisingWave?
No response
The version of RisingWave
nightly-20240117
Additional context
Meanwhile, there is a performance degradation for bulk insert http://metabase.risingwave-cloud.xyz/dashboard/278-sysbench-rw-qps
The text was updated successfully, but these errors were encountered: