-
Notifications
You must be signed in to change notification settings - Fork 589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2024-02-18 Nexmark
performance regression Source Throughput Imbalance
#15142
Comments
https://risingwave-labs.slack.com/archives/C04R6R5236C/p1708300803851039 These nexmarks are 1 machine for CN and 1 machine for compactor setting. |
For For scaling out shows the additional problem that the CPU usage across different compute nodes are uneven. Looking into it.... But anyway, 02-10 & 02-17 has no difference. Therefore, considering these two factors, I am more suspicious if there is anything in the testing environment that leads to this regression? cc: @huangjw806 Although we cannot completely rule out the possibility of root cause in the kernel. Looking into it. |
Just triggered a test with https://buildkite.com/risingwave-test/nexmark-benchmark/builds/3084 this is the scaling-out setting: It seems to be not an issue from kernel, cc: @huangjw806 both |
However, just triggered annother test with https://buildkite.com/risingwave-test/nexmark-benchmark/builds/3086 The throughput goes up to the previous stable number again. I am confused...... But I think both env and kernel are both worth investigating We remark that both setting once reach a even higher number and fall back. |
It looks like the test environment is no different. |
It is because of the imbalance of consumption from splits. You can see in the left graph, that some split has higher throughput and their events were consumed out very early. So in the second half of the test, there are not enough active splits to reach the biggest throughput. Maybe with more CN resources, the kafka's bottleneck( AWS EBS) becomes more significant Maybe related. #5214 |
Yeah, does the Mark it high-priority as it may make a lot of other evaluation difficult to reason about. |
Nexmark
performance regressionNexmark
performance regression Source Throughput Imbalance
Just discussed with @huangjw806 this afternoon that The current machine setting: Note that the network bandwidth of Kafka machine is Consider that the peak throughput we get from RW on the dashboard is around 1500MB/s(some time a little bit over 1500MB/s), aka 12Gbps We want to rule out the possibility that the imbalance is due to the @huangjw806 is helping getting the new number by upgrading Kafka machine from c6i.2xlarge to c6i.8xlarge too |
This issue has been open for 60 days with no activity. Could you please update the status? Feel free to continue discussion or close as not planned. |
https://risingwave-labs.slack.com/archives/C04R6R5236C/p1708300808010129
nexmark-q0-blackhole-4x-medium-1cn-affinity
scaling up 4X: http://metabase.risingwave-cloud.xyz/question/12347-nexmark-q0-blackhole-4x-medium-1cn-affinity-avg-source-output-rows-per-second-rows-s-history-thtb-2730?start_date=2024-01-21nexmark-q0-blackhole-medium-4cn-1node-affinity
scaling out 4X: http://metabase.risingwave-cloud.xyz/question/11043-nexmark-q0-blackhole-medium-4cn-1node-affinity-avg-source-output-rows-per-second-rows-s-history-thtb-2543?start_date=2024-01-11nexmark-q7-blackhole-medium-1cn-affinity
baseline: http://metabase.risingwave-cloud.xyz/question/1502-nexmark-q7-blackhole-medium-1cn-affinity-avg-source-output-rows-per-second-rows-s-history-thtb-190?start_date=2023-09-08nexmark-q17-blackhole-4x-medium-1cn-affinity
scaling up 4X: http://metabase.risingwave-cloud.xyz/question/9270-nexmark-q17-blackhole-4x-medium-1cn-affinity-avg-source-output-rows-per-second-rows-s-history-thtb-2767?start_date=2024-01-04q0 is a stateless query that does 0 computation.
These are affinity settings.
The text was updated successfully, but these errors were encountered: