Longevity test CN and Meta OOM nightly-20240201 #14944

huangjw806 · 2024-02-02T03:36:08Z

================================================================================
longevity-test Result
================================================================================
Result               FAIL                
Pipeline Message     @Nightly run all nexmark (8 sets of nexmark queries) with 10k throughput
Namespace            reglngvty-20240201-150245
TestBed              medium-arm-3cn-all-affinity
RW Version           nightly-20240201    
Test Start time      2024-02-01 15:12:42 
Test End time        2024-02-02 03:14:56 
Test Queries         nexmark_q0,nexmark_q1,nexmark_q2,nexmark_q3,nexmark_q4,nexmark_q5,nexmark_q6_group_top1,nexmark_q7,nexmark_q8,nexmark_q9,nexmark_q10,nexmark_q12,nexmark_q14,nexmark_q15,nexmark_q16,nexmark_q17,nexmark_q18,nexmark_q19,nexmark_q20,nexmark_q21,nexmark_q22,nexmark_q101,nexmark_q102,nexmark_q103,nexmark_q104,nexmark_q105
Grafana Metric       https://grafana.test.risingwave-cloud.xyz/d/EpkBw5W4k/risingwave-dev-dashboard?orgId=1&var-datasource=Prometheus:%20test-useast1-eks-a&var-namespace=reglngvty-20240201-150245&from=1706800362000&to=1706843696000
Grafana Logs         https://grafana.test.risingwave-cloud.xyz/d/liz0yRCZz1/log-search-dashboard?orgId=1&var-data_source=Logging:%20test-useast1-eks-a&var-namespace=reglngvty-20240201-150245&from=1706800362000&to=1706843696000
Memory Dumps         https://s3.console.aws.amazon.com/s3/buckets/test-useast1-mgmt-bucket-archiver?region=us-east-1&bucketType=general&prefix=k8s/reglngvty-20240201-150245/&showversions=false
Buildkite Job        https://buildkite.com/risingwave-test/longevity-test/builds/1048

================================================================================
Restarted/Crashed Pods Details
 ================================================================================
Pod crashed/Restarted: benchmark-risingwave-compute-c-0 restart_count:1  phase:Running status:True
Pod crashed/Restarted: benchmark-risingwave-meta-m-0 restart_count:3  phase:Running status:True

The text was updated successfully, but these errors were encountered:

yezizp2012 · 2024-02-02T04:41:36Z

Meta OOM killed during recovery right after the compute-0 restarts, I will take a look later.

fuyufjh · 2024-02-02T06:08:50Z

Also noticed that the average source throughput (5.52 MB/s) is worse than nightly-20240107 (6.21 MB/s, BuildKite), but this might be caused by the compaction result check.

yezizp2012 · 2024-02-02T08:22:06Z

For the meta OOM part, I found that the meta OOM is caused by the default enablement of auto scaling. Currently, when checking and generating scale plans at the beginning of recovery, auto scaling will list all table fragments twice, which leads to a threefold increase in memory allocation for this part. @shanicky is writing a PR to avoid these two copies and add some necessary checks to fix this issue.

FYI: you can find all memory dump files here including compute node.
https://buildkite.com/risingwave-test/generate-collapsed-heap-files/builds/32#018d68d3-1d92-4ebb-afe7-d420d57ee672

TennyZhuang · 2024-02-02T11:24:00Z

Similar to #14613 , I don't think #14855 fixed that.

TennyZhuang · 2024-02-02T11:25:56Z

Current progress, we found the memory metrics collected by jemalloc is 2-3GB smaller than the cluster node memory.

TennyZhuang · 2024-02-03T14:38:05Z

avg_over_time(jemalloc_active_bytes{namespace=~"$namespace",risingwave_name=~"$instance",risingwave_component=~"$component",pod=~"$pod"}[30m])

avg_over_time(jemalloc_resident_bytes{namespace=~"$namespace",risingwave_name=~"$instance",risingwave_component=~"$component",pod=~"$pod"}[30m])

After smoothing, we found that the growth rate of jemalloc_resident_bytes is much higher than that of jemalloc_active_bytes. This may be a clue, and I am trying to enable the background threads option of jemalloc.

github-actions · 2024-06-12T10:03:07Z

This issue has been open for 60 days with no activity. Could you please update the status? Feel free to continue discussion or close as not planned.

huangjw806 added the type/bug Something isn't working label Feb 2, 2024

github-actions bot added this to the release-1.7 milestone Feb 2, 2024

huangjw806 added the found-by-longevity-test label Feb 2, 2024

fuyufjh assigned TennyZhuang Feb 2, 2024

yezizp2012 mentioned this issue Feb 2, 2024

fix: try to reduce memory usage during generate_table_resize_plan #14955

Merged

3 tasks

TennyZhuang modified the milestones: release-1.7, future-release-1.9 Mar 6, 2024

github-actions bot added the no-issue-activity label Jun 12, 2024

xxchan closed this as not planned Won't fix, can't repro, duplicate, stale Aug 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Longevity test CN and Meta OOM nightly-20240201 #14944

Longevity test CN and Meta OOM nightly-20240201 #14944

huangjw806 commented Feb 2, 2024

yezizp2012 commented Feb 2, 2024 •

edited

Loading

fuyufjh commented Feb 2, 2024

yezizp2012 commented Feb 2, 2024

TennyZhuang commented Feb 2, 2024 •

edited

Loading

TennyZhuang commented Feb 2, 2024

TennyZhuang commented Feb 3, 2024

github-actions bot commented Jun 12, 2024

Longevity test CN and Meta OOM nightly-20240201 #14944

Longevity test CN and Meta OOM nightly-20240201 #14944

Comments

huangjw806 commented Feb 2, 2024

yezizp2012 commented Feb 2, 2024 • edited Loading

fuyufjh commented Feb 2, 2024

yezizp2012 commented Feb 2, 2024

TennyZhuang commented Feb 2, 2024 • edited Loading

TennyZhuang commented Feb 2, 2024

TennyZhuang commented Feb 3, 2024

github-actions bot commented Jun 12, 2024

yezizp2012 commented Feb 2, 2024 •

edited

Loading

TennyZhuang commented Feb 2, 2024 •

edited

Loading