2024-02-18 `Nexmark` performance regression `Source Throughput Imbalance` #15142

lmatz · 2024-02-20T03:14:28Z

https://risingwave-labs.slack.com/archives/C04R6R5236C/p1708300808010129

nexmark-q0-blackhole-4x-medium-1cn-affinity scaling up 4X: http://metabase.risingwave-cloud.xyz/question/12347-nexmark-q0-blackhole-4x-medium-1cn-affinity-avg-source-output-rows-per-second-rows-s-history-thtb-2730?start_date=2024-01-21

nexmark-q0-blackhole-medium-4cn-1node-affinity scaling out 4X: http://metabase.risingwave-cloud.xyz/question/11043-nexmark-q0-blackhole-medium-4cn-1node-affinity-avg-source-output-rows-per-second-rows-s-history-thtb-2543?start_date=2024-01-11

nexmark-q7-blackhole-medium-1cn-affinity baseline: http://metabase.risingwave-cloud.xyz/question/1502-nexmark-q7-blackhole-medium-1cn-affinity-avg-source-output-rows-per-second-rows-s-history-thtb-190?start_date=2023-09-08

nexmark-q17-blackhole-4x-medium-1cn-affinity scaling up 4X: http://metabase.risingwave-cloud.xyz/question/9270-nexmark-q17-blackhole-4x-medium-1cn-affinity-avg-source-output-rows-per-second-rows-s-history-thtb-2767?start_date=2024-01-04

q0 is a stateless query that does 0 computation.

These are affinity settings.

The text was updated successfully, but these errors were encountered:

lmatz · 2024-02-20T03:15:22Z

https://risingwave-labs.slack.com/archives/C04R6R5236C/p1708300803851039

These nexmarks are 1 machine for CN and 1 machine for compactor setting.

lmatz · 2024-02-20T05:36:11Z

For nexmark-q0-blackhole-4x-medium-1cn-affinity, aka scaling up 4X setting: http://metabase.risingwave-cloud.xyz/question/11046-nexmark-q0-blackhole-4x-medium-1cn-affinity-avg-source-output-rows-per-second-rows-s-history-thtb-2730?start_date=2024-01-11

02-10:

02-17:

For nexmark-q0-blackhole-medium-4cn-1node-affinity, aka scaling out setting: http://metabase.risingwave-cloud.xyz/question/11043-nexmark-q0-blackhole-medium-4cn-1node-affinity-avg-source-output-rows-per-second-rows-s-history-thtb-2543?start_date=2024-01-11

02-10:

02-17:

scaling out shows the additional problem that the CPU usage across different compute nodes are uneven. Looking into it....

But anyway,
for the baseline 1cn setting:
http://metabase.risingwave-cloud.xyz/question/36-nexmark-q0-blackhole-medium-1cn-affinity-avg-source-output-rows-per-second-rows-s-history-thtb-169?start_date=2023-08-28

02-10 & 02-17 has no difference.
Besides, q0 is a stateless query that does 0 computation

Therefore, considering these two factors, I am more suspicious if there is anything in the testing environment that leads to this regression? cc: @huangjw806

Although we cannot completely rule out the possibility of root cause in the kernel. Looking into it.

lmatz · 2024-02-20T08:38:42Z

Just triggered a test with nightly-20240210 to verify if it is kernel or env's problem:

https://buildkite.com/risingwave-test/nexmark-benchmark/builds/3084

this is the scaling-out setting:
http://metabase.risingwave-cloud.xyz/question/11043-nexmark-q0-blackhole-medium-4cn-1node-affinity-avg-source-output-rows-per-second-rows-s-history-thtb-2543?start_date=2024-01-11

It seems to be not an issue from kernel, cc: @huangjw806

both nightly-20240217 and the new ad-hoc one nightly-20240210 run by me are slower than before, both 3M row/s versus 3.7 rows/s before

lmatz · 2024-02-20T09:13:43Z

However, just triggered annother test with nightly-20240210 for the scaling-up setting:
http://metabase.risingwave-cloud.xyz/question/12347-nexmark-q0-blackhole-4x-medium-1cn-affinity-avg-source-output-rows-per-second-rows-s-history-thtb-2730?start_date=2024-01-21

https://buildkite.com/risingwave-test/nexmark-benchmark/builds/3086

The throughput goes up to the previous stable number again.

I am confused......

But I think both env and kernel are both worth investigating

We remark that both setting once reach a even higher number and fall back.

huangjw806 · 2024-02-20T09:36:07Z

if there is anything in the testing environment that leads to this regression?

It looks like the test environment is no different.

st1page · 2024-02-20T09:39:52Z

For nexmark-q0-blackhole-4x-medium-1cn-affinity, aka scaling up 4X setting:

It is because of the imbalance of consumption from splits. You can see in the left graph, that some split has higher throughput and their events were consumed out very early. So in the second half of the test, there are not enough active splits to reach the biggest throughput.

Maybe with more CN resources, the kafka's bottleneck( AWS EBS) becomes more significant

Maybe related. #5214

lmatz · 2024-02-20T10:04:50Z

Yeah, does the uneven CPU usage among all the compute nodes under the scaling-out setting imply that the number of splits are uneven among all the compute nodes?

Mark it high-priority as it may make a lot of other evaluation difficult to reason about.

lmatz · 2024-02-21T07:43:10Z

Just discussed with @huangjw806 this afternoon that

The current machine setting:
RW's cn and compactor runs on 32c64g(c6i.8xlarge) while kafka runs on 8c16g(c6i.2xlarge)

Note that the network bandwidth of Kafka machine is up to 12.5 instead of straight 12.5. Per my understanding, there are some certain limitations of peak bandwidth: https://stackoverflow.com/questions/71443685/meaning-of-up-to-10-gbps-bandwidth-in-ec2-instances. It can only be offered for a certain minutes, or some other strange rules.

Consider that the peak throughput we get from RW on the dashboard is around 1500MB/s(some time a little bit over 1500MB/s), aka 12Gbps

We want to rule out the possibility that the imbalance is due to the up to 12.5 limitation.

@huangjw806 is helping getting the new number by upgrading Kafka machine from c6i.2xlarge to c6i.8xlarge too

github-actions · 2024-06-12T08:57:25Z

This issue has been open for 60 days with no activity. Could you please update the status? Feel free to continue discussion or close as not planned.

lmatz added help wanted Issues that need help from contributors component/streaming Stream processing related issue. type/perf labels Feb 20, 2024

github-actions bot added this to the release-1.7 milestone Feb 20, 2024

lmatz added the priority/high label Feb 20, 2024

lmatz mentioned this issue Feb 20, 2024

Imbalanced kafka source actors' throughput when running nexmark benchmark #5214

Open

lmatz changed the title ~~2024-02-18 Nexmark performance regression~~ 2024-02-18 Nexmark performance regression Source Throughput Imbalance Feb 21, 2024

lmatz added the found-by-nexmark-perf-test label Mar 5, 2024

lmatz removed this from the release-1.7 milestone Mar 6, 2024

st1page mentioned this issue May 20, 2024

nightly-20240518 weekly test perf degradation #16817

Closed

github-actions bot added the no-issue-activity label Jun 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2024-02-18 `Nexmark` performance regression `Source Throughput Imbalance` #15142

2024-02-18 `Nexmark` performance regression `Source Throughput Imbalance` #15142

lmatz commented Feb 20, 2024 •

edited

Loading

lmatz commented Feb 20, 2024 •

edited

Loading

lmatz commented Feb 20, 2024

lmatz commented Feb 20, 2024 •

edited

Loading

lmatz commented Feb 20, 2024 •

edited

Loading

huangjw806 commented Feb 20, 2024

st1page commented Feb 20, 2024 •

edited

Loading

lmatz commented Feb 20, 2024 •

edited

Loading

lmatz commented Feb 21, 2024 •

edited

Loading

github-actions bot commented Jun 12, 2024

2024-02-18 Nexmark performance regression Source Throughput Imbalance #15142

2024-02-18 Nexmark performance regression Source Throughput Imbalance #15142

Comments

lmatz commented Feb 20, 2024 • edited Loading

lmatz commented Feb 20, 2024 • edited Loading

lmatz commented Feb 20, 2024

lmatz commented Feb 20, 2024 • edited Loading

lmatz commented Feb 20, 2024 • edited Loading

huangjw806 commented Feb 20, 2024

st1page commented Feb 20, 2024 • edited Loading

lmatz commented Feb 20, 2024 • edited Loading

lmatz commented Feb 21, 2024 • edited Loading

github-actions bot commented Jun 12, 2024

2024-02-18 `Nexmark` performance regression `Source Throughput Imbalance` #15142

2024-02-18 `Nexmark` performance regression `Source Throughput Imbalance` #15142

lmatz commented Feb 20, 2024 •

edited

Loading

lmatz commented Feb 20, 2024 •

edited

Loading

lmatz commented Feb 20, 2024 •

edited

Loading

lmatz commented Feb 20, 2024 •

edited

Loading

st1page commented Feb 20, 2024 •

edited

Loading

lmatz commented Feb 20, 2024 •

edited

Loading

lmatz commented Feb 21, 2024 •

edited

Loading