Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nightly-20240518 weekly test perf degradation #16817

Closed
cyliu0 opened this issue May 20, 2024 · 9 comments
Closed

nightly-20240518 weekly test perf degradation #16817

cyliu0 opened this issue May 20, 2024 · 9 comments

Comments

@cyliu0
Copy link
Collaborator

cyliu0 commented May 20, 2024

Describe the bug

Perf degrdation in weekly test. Those two SKUs runs only in weekly test now.

https://buildkite.com/risingwave-test/tpch-benchmark/builds/1074

https://buildkite.com/risingwave-test/nexmark-benchmark/builds/3708

http://metabase.risingwave-cloud.xyz/question/4966-tpch-q8-bs-medium-1cn-affinity-avg-source-output-rows-per-second-rows-s-history-thtb-371?start_date=2023-09-24

http://metabase.risingwave-cloud.xyz/question/9112-nexmark-q7-rewrite-blackhole-4x-medium-1cn-affinity-avg-source-output-rows-per-second-rows-s-history-thtb-2808?start_date=2023-12-21

+----------------------------------------------------------+--------------+------------+-----------------------------------+---------------------+-----------------------------+-------------------------------+
| BENCHMARK NAME                                           | EXECUTION ID | STATUS     | KEY METRICS                       | FLUCTUATION OF BEST | FLUCTUATION OF LAST 10 DAYS | FLUCTUATION OF LAST EXECUTION |
+----------------------------------------------------------+--------------+------------+-----------------------------------+---------------------+-----------------------------+-------------------------------+
| nexmark-q7-rewrite-blackhole-4x-medium-1cn-affinity      |        28800 | Negative   | avg-source-output-rows-per-second | -28.78%             | -15.22%                     | -26.42%                       |
| tpch-q8-bs-medium-1cn-affinity                           |        28826 | Negative   | avg-source-output-rows-per-second | -49.35%             | -29.35%                     | -43.13%                       |

Error message/log

No response

To Reproduce

No response

Expected behavior

No response

How did you deploy RisingWave?

No response

The version of RisingWave

No response

Additional context

No response

@cyliu0 cyliu0 added type/bug Something isn't working type/perf labels May 20, 2024
@github-actions github-actions bot added this to the release-1.10 milestone May 20, 2024
@st1page
Copy link
Contributor

st1page commented May 20, 2024

For nexmark-q7-rewrite-blackhole-4x-medium-1cn-affinity, the tpch-q8 could be due to other issues
It seems because of the same issue with #15142
There is greater Imbalance in 5.18.

5.12:
image

5.18:
image

cc @lmatz

@st1page
Copy link
Contributor

st1page commented May 20, 2024

No conclusion has been reached regarding the reason for the performance degradation of TPCH Q8.
The current phenomena:

  1. From the perspective of backpressure, the bottleneck occurs at a certain append-only hash join
  2. Both operator cache and block cache show higher miss rates on the 18th compared to the 12th.

rerun 0518 (slow): https://buildkite.com/risingwave-test/tpch-benchmark/builds/1075
test 0514 (fast): https://buildkite.com/risingwave-test/tpch-benchmark/builds/1076
test 0515 ( ) https://buildkite.com/risingwave-test/tpch-benchmark/builds/1077
test 0516 ( ) https://buildkite.com/risingwave-test/tpch-benchmark/builds/1078

@lmatz
Copy link
Contributor

lmatz commented May 20, 2024

For nexmark q7, the network bandwidth between RW and Kafka is not the same:

Previously:
SCR-20240520-l4q

This time:
SCR-20240520-l4t

https://buildkite.com/risingwave-test/nexmark-benchmark/builds/3708


Name:             benchmark-kafka-0
--
  | Namespace:        nexmark-ht-4x-1cn-affinity-weekly-20240518
  | Command:
  | /scripts/setup.sh
  | State:          Running
  | Started:      Sat, 18 May 2024 17:04:06 +0000
  | Ready:          True
  | Restart Count:  0
  | Limits:
  | cpu:     8
  | memory:  13Gi
  | Requests:
  | cpu:      7
  | memory:   13Gi

Hmmm, there is a slight chance that Kafka is not enough,
although Kafka should be I/O bound instead of CPU bound

or because the machine is not large enough and only an "unstable" "up to 12.5Gbps" bandwidth can be achieved, see #15142 (comment)

@st1page
Copy link
Contributor

st1page commented May 20, 2024

No conclusion has been reached regarding the reason for the performance degradation of TPCH Q8. The current phenomena:

  1. From the perspective of backpressure, the bottleneck occurs at a certain append-only hash join
  2. Both operator cache and block cache show higher miss rates on the 18th compared to the 12th.

rerun 0518 (slow): https://buildkite.com/risingwave-test/tpch-benchmark/builds/1075 test 0514 (fast): https://buildkite.com/risingwave-test/tpch-benchmark/builds/1076 test 0515 ( ) https://buildkite.com/risingwave-test/tpch-benchmark/builds/1077 test 0516 ( ) https://buildkite.com/risingwave-test/tpch-benchmark/builds/1078

Overall, there has been some fluctuation in the performance of the images on the 15th and 16th, but I believe the main performance drop is due to a change on the nightly-20240517. https://github.com/risingwavelabs/rw-commits-history?tab=readme-ov-file#nightly-20240517

http://metabase.risingwave-cloud.xyz/question/4966-tpch-q8-bs-medium-1cn-affinity-avg-source-output-rows-per-second-rows-s-history-thtb-371?start_date=2024-05-20

@st1page
Copy link
Contributor

st1page commented May 20, 2024

No conclusion has been reached regarding the reason for the performance degradation of TPCH Q8. The current phenomena:

  1. From the perspective of backpressure, the bottleneck occurs at a certain append-only hash join
  2. Both operator cache and block cache show higher miss rates on the 18th compared to the 12th.

rerun 0518 (slow): https://buildkite.com/risingwave-test/tpch-benchmark/builds/1075 test 0514 (fast): https://buildkite.com/risingwave-test/tpch-benchmark/builds/1076 test 0515 ( ) https://buildkite.com/risingwave-test/tpch-benchmark/builds/1077 test 0516 ( ) https://buildkite.com/risingwave-test/tpch-benchmark/builds/1078

Overall, there has been some fluctuation in the performance of the images on the 15th and 16th, but I believe the main performance drop is due to a change on the nightly-20240517. https://github.com/risingwavelabs/rw-commits-history?tab=readme-ov-file#nightly-20240517

http://metabase.risingwave-cloud.xyz/question/4966-tpch-q8-bs-medium-1cn-affinity-avg-source-output-rows-per-second-rows-s-history-thtb-371?start_date=2024-05-20

Unfortunately, it appears randomly... The degradation can happen in 0516's image but not happen in 0517's image...
http://metabase.risingwave-cloud.xyz/question/4966-tpch-q8-bs-medium-1cn-affinity-avg-source-output-rows-per-second-rows-s-history-thtb-371?start_date=2024-05-20

@st1page
Copy link
Contributor

st1page commented May 21, 2024

The unstable degradation of q8 happens on nightly-20240512 too
image

@fuyufjh
Copy link
Member

fuyufjh commented Jul 10, 2024

cc. @cyliu0 Does this problem recur now?

@fuyufjh fuyufjh closed this as not planned Won't fix, can't repro, duplicate, stale Jul 10, 2024
@st1page
Copy link
Contributor

st1page commented Jul 10, 2024

It appears that the streaming performance of TPCH-Q8 is quite unstable. Before we focus on optimizing the streaming performance for TPCH, we will not conduct an in-depth investigation

@cyliu0
Copy link
Collaborator Author

cyliu0 commented Jul 11, 2024

Just like @st1page said, the tpch q8 fluctuation is quite relatively big. But for weekly test, it's relatively stable. And the low points are the release test for v1.9.2-rt.1 and v1.10.0-rt.1.
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants