Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nightly-20240117 compute node OOM during sysbench select-random-limits #14634

Closed
cyliu0 opened this issue Jan 18, 2024 · 16 comments · Fixed by #14795
Closed

nightly-20240117 compute node OOM during sysbench select-random-limits #14634

cyliu0 opened this issue Jan 18, 2024 · 16 comments · Fixed by #14795
Assignees
Milestone

Comments

@cyliu0
Copy link
Collaborator

cyliu0 commented Jan 18, 2024

Describe the bug

We have a similar issue #13506 which has been fixed already

buildkite job
Grafana

image

https://github.com/risingwavelabs/rw-commits-history?tab=readme-ov-file#nightly-20240117

nightly-20240117

Error message/log

No response

To Reproduce

No response

Expected behavior

No response

How did you deploy RisingWave?

No response

The version of RisingWave

nightly-20240117

Additional context

Meanwhile, there is a performance degradation for bulk insert http://metabase.risingwave-cloud.xyz/dashboard/278-sysbench-rw-qps
image

@cyliu0 cyliu0 added type/bug Something isn't working out-of-memory labels Jan 18, 2024
@github-actions github-actions bot added this to the release-1.7 milestone Jan 18, 2024
@fuyufjh
Copy link
Member

fuyufjh commented Jan 18, 2024

cc. @chenzl25 @liurenjie1024 please take a look?

@hzxa21
Copy link
Collaborator

hzxa21 commented Jan 18, 2024

It may be related to this change: https://github.com/risingwavelabs/kube-bench/pull/356

The intention was to enable compaction result check only on longevity test but it seems that the check is enabled for all tests. @Li0k is taking a look.

@Li0k
Copy link
Contributor

Li0k commented Jan 18, 2024

I have rolled back the configuration of the kube-bench and turned on the compaction check in longevity test

@cyliu0
Copy link
Collaborator Author

cyliu0 commented Jan 19, 2024

We still have compute node OOM for nightly-20240118 @Li0k https://buildkite.com/risingwave-test/sysbench/builds/660#018d1e60-191c-4a13-a649-2c14b8f40376

@cyliu0
Copy link
Collaborator Author

cyliu0 commented Jan 23, 2024

@lmatz
Copy link
Contributor

lmatz commented Jan 23, 2024

The recent two checks should have nothing to do with the compaction check anymore
cc: @chenzl25 @liurenjie1024 could you help take a look

@lmatz lmatz assigned liurenjie1024 and chenzl25 and unassigned Li0k Jan 23, 2024
@chenzl25
Copy link
Contributor

It looks weird. Currently, batch scan with a limit would disable the IO prefetch as well. The limit also has been pushed down into the scan executor.

@chenzl25
Copy link
Contributor

Do we have any heap dump?

@cyliu0
Copy link
Collaborator Author

cyliu0 commented Jan 23, 2024

@fuyufjh
Copy link
Member

fuyufjh commented Jan 24, 2024

@fuyufjh
Copy link
Member

fuyufjh commented Jan 24, 2024

Meanwhile, there is a performance degradation for bulk insert

The problem still continues for the past days. Need to take a look cc. @chenzl25

image

@fuyufjh
Copy link
Member

fuyufjh commented Jan 24, 2024

@chenzl25 s3://test-useast1-mgmt-bucket-archiver/k8s/sysbench-daily-test-20240122/benchmark-risingwave-compute-c-1/6769d727-b756-4d44-b151-a3feecd4d668/2024-1-22/0/ https://s3.console.aws.amazon.com/s3/buckets/test-useast1-mgmt-bucket-archiver?region=us-east-1&bucketType=general&prefix=k8s/sysbench-daily-test-20240122/benchmark-risingwave-compute-c-1/6769d727-b756-4d44-b151-a3feecd4d668/2024-1-22/0/&showversions=false

Unfortunately, it didn't catch the OOM's moment.

Did we enable the memory-usage-triggering automatic dump? There is no auto dump files in S3 bucket. cc. @cyliu0

@cyliu0
Copy link
Collaborator Author

cyliu0 commented Jan 24, 2024

Did we enable the memory-usage-triggering automatic dump? There is no auto dump files in S3 bucket. cc. @cyliu0

Yeah. It's enabled by default ENABLE_MEMORY_PROFILING="true"

@chenzl25
Copy link
Contributor

Reproduce a heap dump flame-graph from another cluster. Before the OOM how long does this test run @cyliu0 ?
flamegraph

@cyliu0
Copy link
Collaborator Author

cyliu0 commented Jan 25, 2024

@chenzl25 According to the buildkite output below, the jobs started at 21:36:47, the oom happened at 21:36:52. So it had ran 5 seconds before OOM.
image
image

@cyliu0
Copy link
Collaborator Author

cyliu0 commented Jan 26, 2024

nightly-20240125 OOM again. https://buildkite.com/risingwave-test/sysbench/builds/665#018d426c-c33c-4ad0-bf31-3b8452793391

And there are auto dump files

aws s3 ls s3://test-useast1-mgmt-bucket-archiver/k8s/sysbench-daily-test-20240125/benchmark-risingwave-compute-c-0/39b49e8a-5c15-44f2-ac49-4428581900f7/2024-1-25/0/
2024-01-26 05:23:39       7615 1706217818-heap.1.0.i0.heap.gz
2024-01-26 05:23:39       7994 1706217818-heap.1.0.i0.heap.json
2024-01-26 05:37:04      15379 1706218623-heap.1.1.i1.heap.gz
2024-01-26 05:37:04       7994 1706218623-heap.1.1.i1.heap.json
2024-01-26 05:38:47      16853 1706218726-2024-01-25-21-38-46.auto.heap.gz
2024-01-26 05:38:47       7994 1706218726-2024-01-25-21-38-46.auto.heap.json

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants