Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2024-02-18 Nexmark performance regression Source Throughput Imbalance #15142

Open
lmatz opened this issue Feb 20, 2024 · 9 comments
Open

2024-02-18 Nexmark performance regression Source Throughput Imbalance #15142

lmatz opened this issue Feb 20, 2024 · 9 comments
Labels

Comments

@lmatz lmatz added help wanted Issues that need help from contributors component/streaming Stream processing related issue. type/perf labels Feb 20, 2024
@github-actions github-actions bot added this to the release-1.7 milestone Feb 20, 2024
@lmatz
Copy link
Contributor Author

lmatz commented Feb 20, 2024

https://risingwave-labs.slack.com/archives/C04R6R5236C/p1708300803851039

SCR-20240220-fmd

These nexmarks are 1 machine for CN and 1 machine for compactor setting.

@lmatz
Copy link
Contributor Author

lmatz commented Feb 20, 2024

For nexmark-q0-blackhole-4x-medium-1cn-affinity, aka scaling up 4X setting: http://metabase.risingwave-cloud.xyz/question/11046-nexmark-q0-blackhole-4x-medium-1cn-affinity-avg-source-output-rows-per-second-rows-s-history-thtb-2730?start_date=2024-01-11

02-10:
SCR-20240220-ify

02-17:
SCR-20240220-ig7

For nexmark-q0-blackhole-medium-4cn-1node-affinity, aka scaling out setting: http://metabase.risingwave-cloud.xyz/question/11043-nexmark-q0-blackhole-medium-4cn-1node-affinity-avg-source-output-rows-per-second-rows-s-history-thtb-2543?start_date=2024-01-11

02-10:
SCR-20240220-iru

02-17:
SCR-20240220-is1

scaling out shows the additional problem that the CPU usage across different compute nodes are uneven. Looking into it....

But anyway,
for the baseline 1cn setting:
http://metabase.risingwave-cloud.xyz/question/36-nexmark-q0-blackhole-medium-1cn-affinity-avg-source-output-rows-per-second-rows-s-history-thtb-169?start_date=2023-08-28

02-10 & 02-17 has no difference.
Besides, q0 is a stateless query that does 0 computation

Therefore, considering these two factors, I am more suspicious if there is anything in the testing environment that leads to this regression? cc: @huangjw806

Although we cannot completely rule out the possibility of root cause in the kernel. Looking into it.

@lmatz
Copy link
Contributor Author

lmatz commented Feb 20, 2024

Just triggered a test with nightly-20240210 to verify if it is kernel or env's problem:

https://buildkite.com/risingwave-test/nexmark-benchmark/builds/3084

this is the scaling-out setting:
http://metabase.risingwave-cloud.xyz/question/11043-nexmark-q0-blackhole-medium-4cn-1node-affinity-avg-source-output-rows-per-second-rows-s-history-thtb-2543?start_date=2024-01-11

It seems to be not an issue from kernel, cc: @huangjw806

both nightly-20240217 and the new ad-hoc one nightly-20240210 run by me are slower than before, both 3M row/s versus 3.7 rows/s before

@lmatz
Copy link
Contributor Author

lmatz commented Feb 20, 2024

However, just triggered annother test with nightly-20240210 for the scaling-up setting:
http://metabase.risingwave-cloud.xyz/question/12347-nexmark-q0-blackhole-4x-medium-1cn-affinity-avg-source-output-rows-per-second-rows-s-history-thtb-2730?start_date=2024-01-21

https://buildkite.com/risingwave-test/nexmark-benchmark/builds/3086

The throughput goes up to the previous stable number again.

I am confused......

But I think both env and kernel are both worth investigating

We remark that both setting once reach a even higher number and fall back.

@huangjw806
Copy link
Contributor

if there is anything in the testing environment that leads to this regression?

It looks like the test environment is no different.

@st1page
Copy link
Contributor

st1page commented Feb 20, 2024

For nexmark-q0-blackhole-4x-medium-1cn-affinity, aka scaling up 4X setting:

It is because of the imbalance of consumption from splits. You can see in the left graph, that some split has higher throughput and their events were consumed out very early. So in the second half of the test, there are not enough active splits to reach the biggest throughput.

image

Maybe with more CN resources, the kafka's bottleneck( AWS EBS) becomes more significant

Maybe related. #5214

@lmatz
Copy link
Contributor Author

lmatz commented Feb 20, 2024

Yeah, does the uneven CPU usage among all the compute nodes under the scaling-out setting imply that the number of splits are uneven among all the compute nodes?

Mark it high-priority as it may make a lot of other evaluation difficult to reason about.

@lmatz lmatz changed the title 2024-02-18 Nexmark performance regression 2024-02-18 Nexmark performance regression Source Throughput Imbalance Feb 21, 2024
@lmatz
Copy link
Contributor Author

lmatz commented Feb 21, 2024

Just discussed with @huangjw806 this afternoon that

The current machine setting:
RW's cn and compactor runs on 32c64g(c6i.8xlarge) while kafka runs on 8c16g(c6i.2xlarge)
SCR-20240221-g7f

Note that the network bandwidth of Kafka machine is up to 12.5 instead of straight 12.5. Per my understanding, there are some certain limitations of peak bandwidth: https://stackoverflow.com/questions/71443685/meaning-of-up-to-10-gbps-bandwidth-in-ec2-instances. It can only be offered for a certain minutes, or some other strange rules.

Consider that the peak throughput we get from RW on the dashboard is around 1500MB/s(some time a little bit over 1500MB/s), aka 12Gbps

We want to rule out the possibility that the imbalance is due to the up to 12.5 limitation.

@huangjw806 is helping getting the new number by upgrading Kafka machine from c6i.2xlarge to c6i.8xlarge too

Copy link
Contributor

This issue has been open for 60 days with no activity. Could you please update the status? Feel free to continue discussion or close as not planned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants