windsock: get stable results #1274

rukai · 2023-08-04T07:05:06Z

Running the same bench twice does not give stable results, neither locally nor on AWS
But I'm focusing on AWS atm because that feels more important.

network and disk IO does not go above 5MB, so it feels unlikely that we are hitting limits there.

I've noticed that shotover benches will be off by a certain % for the entire bench
While non shotover benches will be based around 0.0% and then go up or down from there before returning to 0.0%

So it looks like shotover is introducing a second kind of noise.
So we should first address the noise without shotover.

cassandra benches have the bencher setup to only use 1 thread.
using 2 threads on a m6a.large instance seems to make noise worse
But maybe using more threads on a larger instance will help?

Maybe I need a better idea of what people have historically found to be stable.

cassandra,compression=none,driver=scylla,operation=read_i64,protocol=v4,shotover=none,topology=single observations

I've now observed that latte is more consistent than windsock
~~windsock seems to consistently drop performance at the 42-44s into the benchmark.~~ (resolved)
Not sure if there were other differences in consistency observed.

I attempted to rewrite windsock's bencher to be more like latte, but it did not help.
Either need to profile bencher to find out whats going on or I need to blindly try copying more logic from latte.

cassandra,compression=none,driver=scylla,operation=read_i64,protocol=v4,shotover=standard,topology=cluster3 observations

latte has 10x more throughput than shotover in its default configuration.
latte gets ~60000 OPS while windsock gets ~5000 OPS
Increasing lattes thread count drops latte performance
wow, I can get numbers similar to latte by setting --operations-per-second 50000 as soon as I set --operations-per-second 55000 actual OPS drops to 5000.

If I set the bencher OPS to 50000, shotover will meet exactly 50000 OPS.
However if I set the bencher OPS to 55000, depending on the run, shotover may reach 55000, or it may get stuck at a much lower OPS, I've seen as low as 5000.
If I then set OPS to unlimited it pretty much always runs at 5000 OPS
Latte doesnt seem to experience this same cliff, it does seem to max out at about the same point that shotover can reach (60000) on its default configuration.
But if I increase the number of concurrent messages to 500 it can hit 80000

shotover=none gives similar throughputs for latte and windsock but latte still a bit higher.
Here increasing thread count does actually improve latte performance.

Things to try:

profiler the bencher
tokio-console on the bencher
try updatings deps on latte

The text was updated successfully, but these errors were encountered:

rukai · 2024-02-01T04:13:27Z

This PR has shown promise: #1360

However, I think the next step is to add functionality to windsock to allow reusing EC2 instances.
This will eliminate the noise caused by differences in EC2 instances.
I am thinking an API like this:

> # Create the resources required to run the benches specified in FILTER and then store the information required to access those instances to disk
> cargo windsock --store-cloud-resources-to-disk FILTER
Creating AWS resources: CloudResourcesRequired {
    shotover_instance_count: 1,
    docker_instance_count: 3,
    include_shotover_in_docker_instance: false,
}
> # Run the benches once, using the instances created in the previous command.
> cargo windsock --use-cloud-resources-from-disk FILTER
Running "kafka,shotover=standard,size=100KB,topology=cluster3"
...
> # Run the benches a second time reusing the same instances
> cargo windsock --use-cloud-resources-from-disk FILTER
Running "kafka,shotover=standard,size=100KB,topology=cluster3"
...
> # Cleanup resources, also remove the resources-to-disk file to ensure that a `--use-cloud-resources-from-disk` command would fail early.
> cargo windsock --cleanup-cloud-resources
All AWS throwaway resources have been deleted

After that is implemented it should be easier to evaluate #1360

rukai mentioned this issue Aug 8, 2023

scylla driver: disable topology refresh #1276

Merged

rukai mentioned this issue Feb 7, 2024

windsock: add --load-cloud-resources-from-disk and --store-cloud-resources-to-disk #1453

Merged

rukai added bug Something isn't working Performance and removed bug Something isn't working labels May 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

windsock: get stable results #1274

windsock: get stable results #1274

rukai commented Aug 4, 2023 •

edited

Loading

rukai commented Feb 1, 2024 •

edited

Loading

windsock: get stable results #1274

windsock: get stable results #1274

Comments

rukai commented Aug 4, 2023 • edited Loading

cassandra,compression=none,driver=scylla,operation=read_i64,protocol=v4,shotover=none,topology=single observations

cassandra,compression=none,driver=scylla,operation=read_i64,protocol=v4,shotover=standard,topology=cluster3 observations

rukai commented Feb 1, 2024 • edited Loading

rukai commented Aug 4, 2023 •

edited

Loading

rukai commented Feb 1, 2024 •

edited

Loading