Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Longevity test CN and Meta OOM nightly-20240201 #14944

Closed
huangjw806 opened this issue Feb 2, 2024 · 7 comments
Closed

Longevity test CN and Meta OOM nightly-20240201 #14944

huangjw806 opened this issue Feb 2, 2024 · 7 comments
Assignees
Milestone

Comments

@huangjw806
Copy link
Contributor

================================================================================
longevity-test Result
================================================================================
Result               FAIL                
Pipeline Message     @Nightly run all nexmark (8 sets of nexmark queries) with 10k throughput
Namespace            reglngvty-20240201-150245
TestBed              medium-arm-3cn-all-affinity
RW Version           nightly-20240201    
Test Start time      2024-02-01 15:12:42 
Test End time        2024-02-02 03:14:56 
Test Queries         nexmark_q0,nexmark_q1,nexmark_q2,nexmark_q3,nexmark_q4,nexmark_q5,nexmark_q6_group_top1,nexmark_q7,nexmark_q8,nexmark_q9,nexmark_q10,nexmark_q12,nexmark_q14,nexmark_q15,nexmark_q16,nexmark_q17,nexmark_q18,nexmark_q19,nexmark_q20,nexmark_q21,nexmark_q22,nexmark_q101,nexmark_q102,nexmark_q103,nexmark_q104,nexmark_q105
Grafana Metric       https://grafana.test.risingwave-cloud.xyz/d/EpkBw5W4k/risingwave-dev-dashboard?orgId=1&var-datasource=Prometheus:%20test-useast1-eks-a&var-namespace=reglngvty-20240201-150245&from=1706800362000&to=1706843696000
Grafana Logs         https://grafana.test.risingwave-cloud.xyz/d/liz0yRCZz1/log-search-dashboard?orgId=1&var-data_source=Logging:%20test-useast1-eks-a&var-namespace=reglngvty-20240201-150245&from=1706800362000&to=1706843696000
Memory Dumps         https://s3.console.aws.amazon.com/s3/buckets/test-useast1-mgmt-bucket-archiver?region=us-east-1&bucketType=general&prefix=k8s/reglngvty-20240201-150245/&showversions=false
Buildkite Job        https://buildkite.com/risingwave-test/longevity-test/builds/1048

================================================================================
Restarted/Crashed Pods Details
 ================================================================================
Pod crashed/Restarted: benchmark-risingwave-compute-c-0 restart_count:1  phase:Running status:True
Pod crashed/Restarted: benchmark-risingwave-meta-m-0 restart_count:3  phase:Running status:True
@huangjw806 huangjw806 added the type/bug Something isn't working label Feb 2, 2024
@github-actions github-actions bot added this to the release-1.7 milestone Feb 2, 2024
@yezizp2012
Copy link
Member

yezizp2012 commented Feb 2, 2024

Meta OOM killed during recovery right after the compute-0 restarts, I will take a look later.

@fuyufjh
Copy link
Member

fuyufjh commented Feb 2, 2024

Also noticed that the average source throughput (5.52 MB/s) is worse than nightly-20240107 (6.21 MB/s, BuildKite), but this might be caused by the compaction result check.

@yezizp2012
Copy link
Member

image

For the meta OOM part, I found that the meta OOM is caused by the default enablement of auto scaling. Currently, when checking and generating scale plans at the beginning of recovery, auto scaling will list all table fragments twice, which leads to a threefold increase in memory allocation for this part. @shanicky is writing a PR to avoid these two copies and add some necessary checks to fix this issue.

FYI: you can find all memory dump files here including compute node.
https://buildkite.com/risingwave-test/generate-collapsed-heap-files/builds/32#018d68d3-1d92-4ebb-afe7-d420d57ee672

@TennyZhuang
Copy link
Contributor

TennyZhuang commented Feb 2, 2024

Similar to #14613 , I don't think #14855 fixed that.

@TennyZhuang
Copy link
Contributor

Current progress, we found the memory metrics collected by jemalloc is 2-3GB smaller than the cluster node memory.

@TennyZhuang
Copy link
Contributor

avg_over_time(jemalloc_active_bytes{namespace=~"$namespace",risingwave_name=~"$instance",risingwave_component=~"$component",pod=~"$pod"}[30m])
image
avg_over_time(jemalloc_resident_bytes{namespace=~"$namespace",risingwave_name=~"$instance",risingwave_component=~"$component",pod=~"$pod"}[30m])
image

After smoothing, we found that the growth rate of jemalloc_resident_bytes is much higher than that of jemalloc_active_bytes. This may be a clue, and I am trying to enable the background threads option of jemalloc.

Copy link
Contributor

This issue has been open for 60 days with no activity. Could you please update the status? Feel free to continue discussion or close as not planned.

@xxchan xxchan closed this as not planned Won't fix, can't repro, duplicate, stale Aug 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants