Set up benchmarks server #55007

datapythonista · 2023-09-05T10:56:36Z

This issue is to keep track of the benchmarks project we are about to start. Adding context and some initial ideas, but we'll keep evolving the project as we make progress.

The final goal of our benchmarks system is to detect when some functionality in pandas starts taking longer for no good reason. See this fictional example:

>>> import pandas
>>> pandas.__version__
2.0
>>> data = pandas.Series([1, 2, 3])
>>> %time data.sum()
1 ms

>>> import pandas
>>> pandas.__version__
2.1
>>> data = pandas.Series([1, 2, 3])
>>> %time data.sum()
5 ms

In the example, Series.sum for a given data takes 1 millisecond in 2.0, while in 2.1 it takes 5 times that time, probably because an unexpected side effect, not because we made some changes we consider it was worth making the function 5 times slower. When we introduce a performance regression like the one above, it is likely that users will end up reporting it via a GitHub issue. But ideally, we would like to detect it much earlier, and never release versions of pandas with those performance regressions.

For this reason, we implemented many functions that run pandas functionality with some arbitrary but constant data. Our benchmark suite, the code is in asv_bench/benchmarks. Those are implemented using a framework names asv. Running the benchmark suite gives us the time it takes to run each of many pandas functions with data consistent over runs. If comparing the executions with two different pandas versions, we can detect performance regressions.

In an ideal world, what we'd like is to detect performance regressions in the CI of our PRs. So, before merging a PR we can see there is a performance regression, and we make a decision on whether the change is worth making pandas slow. This would be equivalent to what we do with tests or with linting, where we get a red CI when something is not right, and merging doesn't happen until we're happy with the PR.

In practice, running consistent benchmarks for every commit in a pull request is not feasible at this point. The main reasons are:

Our benchmarks take around 7 hours in decent hardware
The CI doesn't use decent hardware, it uses small workers, where it would take way longer
To compare benchmarks we need consistent hardware. We could run in the same worker the benchmarks for main and the PR of interest, that would double the execution time
Virtual hardware, like a GitHub CI worker introduces a lot of noise. The time to execute a function will depend as much as how busy is the rest of the physical host than the time our implementation takes

For now, we are giving up on executing the benchmarks for every commit in an open PR, and the focus has been on executing them after merging a PR. When running the benchmarks, we run them in a physical server, not a virtual server. The server we've been using was bought by Wes many years ago and was running 24/7 from his home, and at some point it was moved to Tom's home. This is still how we run the benchmarks now. Some months ago, OVH Cloud donated to pandas credits to use a dedicated server from their cloud for our benchmarks (we also got credits to host the website/docs and for more things we may need). There was some work on setting up the server and improving things, but we didn't complete even the initial work.

There are three main challenges to what otherwise would be a somehow simple project:

Having stable benchmarks is very hard
Our benchmarks suite takes a very long time
ASV is not as good as we would like. The codebase was unmaintained for several years, and while we put a lot of work more recently, the codebase is complex and not easy to deal with, and the UX not always intuitive.

The most common approach for benchmark stability is what it's usually called statistical benchmarking. In its simplest form, the idea is for a single benchmark run, running the function to time something like 100 times and taking the mean. ASV does a slightly smarter version of it, where the first few runs (warm up) are discarded, and the exact number of times a benchmark runs depends on the variance of the first runs. But 100 repetitions is common.

This repetition brings more stability, but obviously makes the second challenge worse, as timing every function 100 times makes benchmarks 100 times slower. We have a CI job where we run our benchmark suite with just one call for each function, and the timing is very reasonable, the job takes 25 minutes in the CI worker. But the results are discarded, since they are very unstable both because the lack of repetition, and the instability of using a virtual machine / CI worker.

For now, what we want is:

Configure our dedicated server to be as stable as possible. An amazing resource to better understand the problem and possible solutions is Victor Stinner's blog series. Also this page: https://pyperf.readthedocs.io/en/latest/system.html
Find a way to run benchmarks for each commit to main. The existing server uses the code in https://github.com/asv-runner. There are many options here, but anything running in the server affects its stability. We could implement a web server that receives webhooks from github for each commit, but we can only run the benchmarks once at a time, and I'm not sure if the activity of the web server would be significant or not for the benchmark results
Find a way to run the benchmarks faster than adding them to the queue. If the benchmarks take 6 hours, we can run 4 times the suite per day, it's likely that we merge more than 4 PRs per day, so the queue would keep growing and over time our last executed benchmark would be very far from the recent commits. If we can run the benchmarks for each commit that would be amazing, it may require reducing the amount of data used to test the slower benchmarks, cap the number of repetitions, or other solutions. If we need to skip commits, we may want to do that in a known and consistent way.
Make sure that results are meaningful and useful. The results are currently available here: https://asv-runner.github.io/asv-collection/pandas/. ASV is able to run the benchmark suite with different environments. This means that for a specific pandas commit it can run for a matrix of Python version, Cython versions, NumPy versions... I'm not sure if there is a clear decision on how we'd like to benchmarks things regarding the environments, but I personally find it quite hard to understand the results easily. See for example: https://asv-runner.github.io/asv-collection/pandas/#algos.isin.IsIn.time_isin_mismatched_dtype. Even when filtering all but one result, results aren't obvious. There are like 50 results with their commit hash, but not easy to tell the dates of these measurements. Is ASV just keeping the last 50 measurements? Are there commits that didn't run? I guess most of the problem is in ASV itself, but since other projects are using it, I guess we can set it up in a way that it's more useful.

Once we have a reasonable version of the above, some other things can be considered. Some ideas:

Implement more complex benchmarks (benchmark whole pipelines, not only individual functions). TPCH for example, as Polars do
Consider ASV alternatives. I did some research on this, and there is no obvious too that we should be using instead of ASV, but ASV will be a blocker for any innovation
Anything that can be helpful towards the final goal of early detection of performance regressions.

CC: @DeaMariaLeon @lithomas1 @rhshadrach

The text was updated successfully, but these errors were encountered:

rhshadrach · 2023-09-05T21:30:22Z

When running the benchmarks, we run them in a physical server, not a virtual server. The server we've been using was bought by Wes many years ago and was running 24/7 from his home, and at some point it was moved to Tom's home. This is still how we run the benchmarks now.

I now run the ASV machine from my home. It is my personal hardware (dedicated to doing nothing but running benchmarks).

Find a way to run the benchmarks faster than adding them to the queue. If the benchmarks take 6 hours, we can run 4 times the suite per day, it's likely that we merge more than 4 PRs per day, so the queue would keep growing and over time our last executed benchmark would be very far from the recent commits.

The way I inherited it from Tom, we grab the last 5 commits and run benchmarks on those commits if benchmarks have not already been run yet. The benefit of this is that we don't have an ever growing queue. The downside is that the machine may have no new commits to run and does not look further back in history so may unnecessarily sit idle.

I've been slowly working toward automating the process of reliably detecting regressions and making/tracking comments on PRs. Currently I'm building a front end in dash. I don't think this competes with the work as outlined here; I believe both setups are good to have as they each have strengths and weaknesses.

lithomas1 · 2023-09-07T18:41:38Z

I had something to be able to trigger a runner on OVH to be spun up on Github Actions with cirun and run some of the tslibs benchmarks, there was too much variability (>20%) too.

Here is my branch.
https://github.com/lithomas1/pandas/tree/cirun-test

Maybe we should try to set up a meeting with everyone that's interested so we can figure out what we have, and what needs to be done?

rhshadrach · 2023-09-07T20:21:48Z

@lithomas1 - do you know if that was a problem with the setup, or a problem with the benchmark? E.g. at least at face value this appears to me to be a problem with the benchmark:

https://asv-runner.github.io/asv-collection/pandas/#arithmetic.Ops2.time_frame_dot

lithomas1 · 2023-09-07T22:01:28Z

I think I was just experimenting with a standard cloud VM (b2-7 from OVH) to see how bad the noise was.

I didn't tune the system at all back then and we weren't using the bare metal server, so maybe there's room for improvement.

I was mainly sharing to show a possible way to connect Github with whatever we use to run the benchmarks.

DeaMariaLeon · 2023-09-19T07:36:17Z

I'm trying to understand the current workflow (for the benchmark suite).

We have a CI job where we run our benchmark suite with just one call for each function, and the timing is very reasonable, the job takes 25 minutes in the CI worker. But the results are discarded,

Are the results discarded after running asv in
code-checks.yml ? (as noted above).

Is this up to date? Benchmark-machine

BTW, thanks @lithomas1 and @rhshadrach for the info. I have been getting familiar with asv and the server, that's why it took me a while to comment.

rhshadrach · 2023-09-19T10:32:47Z

Are the results discarded after running asv in
code-checks.yml ? (as noted above).

Correct.

Is this up to date? Benchmark-machine

For the most part. We now use instead https://github.com/asv-runner rather than Tom's repo as mentioned there. We also no longer publish from the asv-collection repo to the pandas website. The benchmark machine is running in my home and has no access to the wider internet, so only I can "debug" (read: reboot) it.

It is still written in ansible. I plan to rewrite it using Docker.

datapythonista · 2023-10-04T09:52:55Z

Thanks @rhshadrach and @lithomas1 for the information, this is helpful.

The way I inherited it from Tom, we grab the last 5 commits and run benchmarks on those commits if benchmarks have not already been run yet. The benefit of this is that we don't have an ever growing queue. The downside is that the machine may have no new commits to run and does not look further back in history so may unnecessarily sit idle.

Is this a daily cron job? If there are 5 commits for that day, do you know how long does it take? Is it possible that if it takes more than 24h (more than 4.8 to run the benchmarks of each commit) the process is still alive when the execution of the next day starts? I guess that would add a decent amount of extra noise.

We're finishing setting up things, I need to see how benchmarks take exactly in the server, but I was thinking on running the last commit from the repo twice or 3 times per day, depending on what's possible.

And then, in parallel and in a separate server, keep running the benchmarks for the same commit again and again, and test changes in the machine, asv params or whatever to see what make things more stable and faster.

I think there are other things that can be done, like:

Improve the asv interface
Have a thought and set up a frozen environment (ping all dependencies in asv.conf.json), so benchmarks keep consistent and easier to read (see here for example how the changes to cython or python versions make things quite complicated to follow). Or not sure if people see more value on testing new version of dependencies as soon as they are released. We could possibly have both too, in different servers and asv websites, but I'm not sure if the was asv is displaying things now is the most helpful, I would possibly not using asv environments and just have a single line in the plot for "latest version of everything"
Detect regressions and send notifications, maybe create issues on github if something is detected?

And I guess other things. Does anyone in the @pandas-dev/pandas-core team has anything they'd really like to see improved in the benchmarks that it's doable? As you know, @DeaMariaLeon will spend the next couple of months working on this, besides the work of others. We can surely have a meeting as suggested above, maybe we can gather some feedback here first, and have the meeting in couple of weeks?

rhshadrach · 2023-10-04T11:20:17Z

It is not a cron job. Once a set of commits is finished, the machine will pull from main and run the 5 latest commits, only running those that have not been benchmarked already. If there are no new commits to benchmark, it will sleep for 10 minutes and try again.

datapythonista · 2023-10-09T22:44:12Z

We not got the benchmarks automatically being published here from our benchmarks server: https://pandas.pydata.org/benchmarks/

So far it uses a cron job that every 3 hours executes the last commit (the benchmarks take a bit more than two hours in the server). This is just a first step, we can discuss and implement things in any other way that it's more convenient.

I also set up another benchmarks server, but that keeps running the benchmarks for the same commit (same exact code, but with different hash). This is to see how much noise we've got in the server,. For certain benchmarks seems to be a lot. We'll be setting up the server to be more stable (cpu isolation...). This other server should help see if what we change really has an effect: https://pandas.pydata.org/benchmarks/same-commit/

All is work in progress, but wanted to share those already, in case it's useful.

rhshadrach · 2023-10-19T12:58:28Z

On the benchmarks call today, the issue of parsing the current ASV json files into something more usable (e.g. a pandas DataFrame) came up. Wanted to share what I use today:

https://github.com/rhshadrach/asv-watcher/blob/10cec8096b5bb1d1ef7ef4b320bf8b2db482c98b/asv_watcher/_core/update_data.py#L19

jorisvandenbossche · 2023-10-20T09:28:35Z

On that topic, here is an older notebook where I convert the json file to a DataFrame and did some basic analysis (listing the ones that take the most time): https://nbviewer.org/gist/jorisvandenbossche/0af1c0a20ef187197ecdcfdb3545306a (from #44450 (comment))

datapythonista · 2023-10-22T09:09:40Z

I created a repo with what we've got so far in the OVH servers: https://github.com/pandas-dev/pandas-benchmarks

We're still quite far from having enough stability in the servers to be happy, but I'll keep updating that repo as we make progress.

So far we're going to continue with the next tasks:

Build a PoC of conbench for pandas (@DeaMariaLeon is already working on it)
Implement a smarter way to pick which commits to run (instead of leaving the server idle, look for the commits that add more value to have benchmarks for, for example the middle commit in a long gap in our recent history)
Reduce the noise in the server (CPU frequency scaling, CPU isolation...). Some work has been done already, but results seem far from good

We will consider the other proposed topics after we get some progress in the above tasks.

rhshadrach · 2023-10-22T12:31:56Z

@datapythonista and @DeaMariaLeon: One issue that recently came up is that of dependencies. I said on the call that we should test against latest versions of e.g. NumPy so that we can help identify regressions there. However we currently face the issue that the dependency versions used are not stored in ASV results (unless they are pinned, which most are not).

datapythonista · 2023-10-23T09:34:31Z

@datapythonista and @DeaMariaLeon: One issue that recently came up is that of dependencies. I said on the call that we should test against latest versions of e.g. NumPy so that we can help identify regressions there. However we currently face the issue that the dependency versions used are not stored in ASV results (unless they are pinned, which most are not).

I think this will introduce noise in the plots, as asv will generate a different line for each version of each dependency. But maybe we should create an issue in asv, and see what can be done, as this is in their side.

datapythonista · 2023-10-23T14:37:29Z

@rhshadrach are the asv results of your server published anywhere? I can find the rendered html here, but I wonder if you sync the json files somewhere we can access. I guess it's a non-trivial amount of data, but I think it could be useful for others to access, to run your regressions detection, export to conbench... If it's not published anywhere, and you want to share it but need somewhere to, please let me know (and how much space is needed) and I'm happy to set up something in the OVH infrastructure.

rhshadrach · 2023-10-23T20:04:07Z

I think this will introduce noise in the plots, as asv will generate a different line for each version of each dependency.

If you specify multiple versions if the config file, yes. However, if only one version is specified, ASV treats it as a single line (e.g. this example)

are the asv results of your server published anywhere? I can find the rendered html here, but I wonder if you sync the json files somewhere we can access.

I believe they are in the repo you linked to. Direct links:

The last bullet above is merely one example of where the JSONs for the benchmarks are stored. You need to navigate the various folder paths that are created whenever we pin a version in the ASV conf file. This is what the code I linked to above does.

If there are more JSONs to publish, let me know and I can look into them.

datapythonista · 2023-10-24T00:25:07Z

I've been doing more tests to set up the server with minimum noise, and results start to look promising, except one thing I'm not sure where it's coming from.

In the chart, I'm comparing two different configurations. baseline is just running in our server as is after installing it. stablecpu contains some changes:

A physical CPU is isolated, and the benchmarks run in one of the virtual cores of that CPU
Interruptions are banned from running in the isolated CPU
I replaced the cpufreq kernel driver from intel_pstate to acpi-cpufreq
I set the cpufreq subsystem to use the performance governor, which should disable P-states
TurboBoost is disabled (via the OS, not the BIOS)

In the top chart, the space between the red lines is where we capture the benchmarks from (we leave a 0.1 second of warm up where results are discarded, and 10 runs are considered after it).

In the bottom histograms I plot the times over 10,000 runs. The purple lines represent the minimum and maximum time. For stablecpu, the variance is minimal, but there is a second tiny peak that comes from the first 300 ms, where for some reason the CPU seems to be slow. Since P-states should be disabled, and the acpi-cpufreq driver shouldn't allow the hardware to scale the frequency as the intel_pstate does, I'm not sure where it comes from. Seems to also happen with TurboBoost enabled, or the intel_pstate driver, I guess it's related to the benchmark governor and a change in P-states.

I'll keep having a look, if we manage to remove the period with lower performance, I think we should be able to remove or reduce significantly the warm up period and reduce the number of repetitions we run each benchmark, while still reducing to a minimum the number of false positives.

DeaMariaLeon · 2023-10-24T08:20:38Z

Regarding the issue of storing dependencies with asv:

I think this will introduce noise in the plots, as asv will generate a different line for each version of each dependency.

If you specify multiple versions if the config file, yes. However, if only one version is specified, ASV treats it as a single line (e.g. this example)

Asv treats it as a single line, but I think we need to remove previous results.
For example, I ran it first with the default version of matplotlib, and again with a pinned version 3.7.3. The graph generated had a line for an "empty" matplotlib version, and one for the "new" 3.7.3.
Maybe this was obvious for you, but just in case.

datapythonista · 2023-10-24T15:43:27Z

If you specify multiple versions if the config file, yes. However, if only one version is specified, ASV treats it as a single line (e.g. this example)

That looks like multiple lines to me, one after the other. ;) If that's what you'd like to have, I guess creating an issue in asv is the best thing to do, but I'm not sure if it's again unmaintained. Also, maybe worth checking conbench first, we should have a PoC soon, no idea how versions are managed there.

Regarding the server stability, seems like the two different speeds in the previous chart were caused by C-states. Disabling C-states completely fixed the issue:

I also tried different governors, in particular the userspace one, to see if that brings more stability, but for some reason the variance becomes huge.

I think for the server configuration we should be good with what we've got. There is a minimal amount of noise now, and I don't think we'll be able to get rid of easily. I think the rest of work is in asv:

Check if the minimal noise in my experiments translates to asv benchmarks
Seems like after removing all noise the timings follow a exponential distribution. I think it'll make more sense to aggregate the different runs with the minimum, than with the mean. This needs to be changed in asv
It'd be good to check if using CPU cycles instead of time gives us a perfect match among runs. I think that would make things simpler and clearer, but surely not a top priority for now
We should be able to adjust the parameters that asv uses. I'm not sure if we still need a warm up period, feels like we can fully skip it, or maybe one or two runs is enough in case there is still some initial noise. Also, seems like the logic in asv has been changed after some recent refactoring. Not fully sure, but I think we used to run many benchmarks for 100 times, and now seems like by default the maximum is 10. If things seem stable I'll give a try to repeat=3, min_run_count=3 and warmup_time=-1. The function in asv that computes the exact number of iterations is a bit convoluted with many parameters, but this should cause to always run every benchmarks 3 times without warm up, which based on the tests seem reasonable, in particular if we use the minimum, and should run the whole benchmark suite very fast, I think something like in 30 minutes.

I'll share more updates as I have them.

datapythonista · 2023-10-25T10:23:34Z

I let asv run few iterations with the new server configuration, and results don't seem to make sense.

For the same benchmark used above, algorithms.DuplicatedMaskedArray.time_duplicated(unique=False, keep='first', dtype='int'), we've got the next results:

The last 4 commits are the ones that used the latest server configuration (the one that caused in the independent tests the orange line in the chart from the previous comment). They seem quite stable, and the timing is consistent in both cases, 6.9ms for asv and 6.5ms for the script I use to run that benchmark. The small overhead in asv is expected (in my script I avoid a function call, and probably asv has some more complexity in the timing, I don't fully understand its code).

But then before that we have many instances before that where the asv timing is 50% what it would be expected, and what it seems consistent in many runs. First, I don't think this benchmark can run in around 3ms in this hardware. And even less as an average of 10 iterations, but won't run as fast even in the best case scenario. Second, it's very suspicious that the benchmark is running in exactly 50% in most of these instances of what's expected, and always at the same. Based on the other tests there is no randomness in the function that explains that, and as said, even less that averaging over 10 repetitions cause exactly 50% less.

I couldn't find the exact problem yet, but I think there is a bug, probably related to this division in the asv code. Seems like among a lot of other complexity, when timing the benchmark it can be executed a number of times defined by the number variable, and the total time of all runs is then divided by the number of runs to get the average. That's not the only way a benchmark run is repeated, as this will be repeated a number of times. I thing there is a bug somewhere in all this logic that ends up causing measurements of 50% the actual time (and some other round values).

I think the logic has been fully changed in a recent refactoring 5 months ago, and while now the code is better structured and easier to follow, I think the logic of the rounds is buggy and causing all that variance with many measurement 50% lower than they should be. I can't fully confirm yet, but I would be surprised if that wasn't the problem.

I'll leave that server to continue to run as is for longer, to see if the problem still happens, and then set it up to run with the parameters specified above repeat=3, min_run_count=3, warmup_time=-1, rounds=0, and number=1.

jbrockmendel · 2023-10-25T15:45:11Z

This reminded me of #40066

…

On Wed, Oct 25, 2023 at 3:23 AM Marc Garcia ***@***.***> wrote: I let asv run few iterations with the new server configuration, and results don't seem to make sense. For the same benchmark used above, algorithms.DuplicatedMaskedArray.time_duplicated(unique=False, keep='first', dtype='int'), we've got the next results: [image: Screenshot_20231025_132902] <https://user-images.githubusercontent.com/10058240/277946028-5455928e-02af-403b-98b4-10a6d7bca39e.png> The last 4 commits are the ones that used the latest server configuration (the one that caused in the independent tests the orange line in the chart from the previous comment). They seem quite stable, and the timing is consistent in both cases, 6.9ms for asv and 6.5ms for the script I use to run that benchmark. The small overhead in asv is expected (in my script I avoid a function call, and probably asv has some more complexity in the timing, I don't fully understand its code). But then before that we have many instances before that where the asv timing is 50% what it would be expected, and what it seems consistent in many runs. First, I don't think this benchmark can run in around 3ms in this hardware. And even less as an average of 10 iterations, but won't run as fast even in the best case scenario. Second, it's very suspicious that the benchmark is running in exactly 50% in most of these instances of what's expected, and always at the same. Based on the other tests there is no randomness in the function that explains that, and as said, even less that averaging over 10 repetitions cause exactly 50% less. I couldn't find the exact problem yet, but I think there is a bug, probably related to this division <https://github.com/airspeed-velocity/asv_runner/blob/main/asv_runner/benchmarks/time.py#L175> in the asv code. Seems like among a lot of other complexity, when timing the benchmark it can be executed a number of times defined by the number variable, and the total time of all runs is then divided by the number of runs to get the average. That's not the only way a benchmark run is repeated, as this will be repeated a number of times. I thing there is a bug somewhere in all this logic that ends up causing measurements of 50% the actual time (and some other round values). I think the logic has been fully changed in a recent refactoring 5 months ago, and while now the code is better structured and easier to follow, I think the logic of the rounds is buggy and causing all that variance with many measurement 50% lower than they should be. I can't fully confirm yet, but I would be surprised if that wasn't the problem. I'll leave that server to continue to run as is for longer, to see if the problem still happens, and then set it up to run with the parameters specified above repeat=3, min_run_count=3, warmup_time=-1, rounds=0, and number=1. — Reply to this email directly, view it on GitHub <#55007 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB5UM6DNCHL7JQR7YBALHPLYBDSDFAVCNFSM6AAAAAA4LS3JT2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZYHE2TQOJWGE> . You are receiving this because you are on a team that was mentioned.Message ID: ***@***.***>

datapythonista · 2023-10-25T18:49:17Z

This reminded me of #40066

Interesting, thanks for sharing. I'm not sure I fully understand your issue, as your examples seem to use %timeit not asv, and I'm not sure what's in master and what's in the branch. If you think it's relevant happy to discuss further, the problem seems indeed similar to what we're seeing.

Related to my last comment, asv does indeed seem to have problems averaging. Good news is that I already run few executions removing asv repetition system (warmup_time=0 repeat=1 number=1 rounds=1), and benchmarks run in 18 minutes (counting building pandas, but not counting asv publish), and benchmarks times seem the most stable we've seen so far in asv by far (let's see if we don't get outliers when we've got more samples):

I also tried to just set repeat=2, but that already introduces a significant amout of noise. I think the next obvious step is to set up the main (OVH) benchmarks server with this configuration. And I'll run the benchmarks since 2.1.1 (or earlier depending on how long it takes).

After that, depending on how the tests with conbench go, I may implement a benchmark runner from scratch. I already did a prototype, and seems a very reasonable thing to do in my opinion. Much better than try to fix asv.

@lithomas1 I think something we could do soon is to implement an action for the Github comment /perf (or whatever name) that runs the benchmarks, compares them to the latest main branch, and reports in a comment any significant regression found. I think the number of false positives now should be small. I think you already did some work on this, do you want to continue working on it? Do you want me to build a PoC?

DeaMariaLeon · 2023-10-26T10:14:23Z

I may implement a benchmark runner from scratch.

In case this was forgotten: conbench has a runner as well. We don't need to use asv at all if conbench does what we want.

datapythonista · 2023-10-26T10:24:07Z

In case this was forgotten: conbench has a runner as well. We don't need to use asv at all if conbench does what we want.

Does it have a runner for asv benchmarks? Did you try it?

DeaMariaLeon · 2023-10-26T14:12:59Z

No, conbench doesn't have a way to run the benchmarks as they are currently written for asv - if that's the question.
I just meant that it can run the benchmarks that you write, without using asv. From conbench's website:
Conbench includes a runner which can be used as a stand-alone library for traditional macro benchmark authoring. The runner will time a unit of work (or measure throughput), collect machine information that may be relevant for hardware specific optimizations, and return JSON formatted results.

phofl · 2023-10-26T18:02:27Z

I don’t want to rewrite our benchmarks. Would that be necessary?

lithomas1 · 2023-10-28T16:44:12Z

@lithomas1 I think something we could do soon is to implement an action for the Github comment /perf (or whatever name) that runs the benchmarks, compares them to the latest main branch, and reports in a comment any significant regression found. I think the number of false positives now should be small. I think you already did some work on this, do you want to continue working on it? Do you want me to build a PoC?

Happen to help out here.
There is an old asv-bot workflow checked in here, that basically does this on the Github hosted runners.

Maybe we can setup a self-hosted runner, then modifying the workflow to run on the benchmarks machine would be pretty easy.

WillAyd · 2023-10-28T16:52:46Z

On the topic of replacing asv with conbench I think that's another one that is on the border of requiring a PDEP. @phofl makes a good point about not wanting to rewrite benchmarks, but at the same time ASV has not evolved much in the past few years.

datapythonista · 2023-10-28T17:11:32Z

I agree with @phofl, and I'm not aware of anyone disagreeing, that rewriting our benchmarks shouldn't be considered. Personally I'd be +1 in replacing the current params = [[1, 2], [3, 4]] ; param_names = ['foo', 'bar'] by simply params = {'foo': [1, 2], 'bar': [3, 4]} which could be done progressively if we make asv support both. But that's it, and even that is surely not a priority or something we're considering for the short term.

Regarding conbench, we are not considering at this point replacing the asv interface by conbench. We want to set up a conbench instance with our benchmarks, to see if it can be useful. The benchmarks would still be executed by asv, and wouldn't be modified. And we can continue to build the asv UI website, even if conbench happen to be useful, and we want to have a proper instance of it after the PoC.

I personally couldn't make too much sense of the conbench interface. Seems like a more complex navigation to end up in mostly a plot similar to the asv one but not interactice (see here for example). @jorisvandenbossche it'd be good if you could give the rest of us a demo, not sure if there is more stuff after log in, or if the interesting part are the batch jobs, and not the UI.

datapythonista · 2023-10-29T13:53:00Z

As Patrick pointed out, seems like Dask also has some benchmark stuff, in particular to detect regressions. Feels similar to what Richard has, and I assume to what conbench uses to notify of regressions, so probably worth to have a look too: https://github.com/coiled/benchmarks (in the file detect_regressions.py, the benchmarks are in a sqlite database)

jorisvandenbossche · 2023-10-30T09:00:43Z

To be clear, my suggestion of looking into conbench doesn't involve rewriting our asv benchmarks (I would also not be in favor of that at this point), but solely about the web UI.

One of potential areas for improvement that were mentioned was "improve the UX when analyzing the benchmarks results". I asked the question if this was about improving the ASV web interface? In that context, if we would consider spending a significant amount on web UI part of ASV, then it might be worth considering conbench as an alternative web UI. While it is true that it has some basic runner as well (for macro-benchmarks, according to the docs), the general idea is that you can run your benchmarks with whatever tool, and then inject those results into conbench. For example for Apache Arrow, most of our benchmarks are written and run by a language specific tool (google benchmarks for C++, and we also have benchmarks in R, Java, Go), and conbench is used to gather those all into a single web UI (with features to detect regressions and report about that on PRs).
I am also not saying that the conbench UI is much better than ASV web (I agree it's a bit complex), but at least it's something that is actively maintained and developed (and feature requests / contributions are certainly welcome).

Now, I think most of this issue is about running the benchmarks, and finding a way to get stable results. So the conbench idea might be better kept for a separate thread.

datapythonista · 2023-10-30T11:37:00Z

I think we all agree in not rewriting our benchmarks. Hopefully the runner is not a big deal, if we find the problems with asv, I think asv is good for now. I think it's still worth to write a PoC of a runner without all the asv complexity. First, I'm unsure how much of running the benchmarks is caused by asv. Seems like the difference between one and two repetitions is small, so I wonder if asv itself and not our benchmarks is an important bottleneck. Second, the asv code is quite difficult to maintain, if the PoC of runner seems to work reasonably well, it may leave us in a much better position to continue from there. Or maybe we discover the opposite, that a runner is actually quite complex, and is worth to keep asv. I think it's worth to spend few hours in this and find out.

This issue is generic for improving our benchmarks. I don't think the stability (which we're not far from a very good state) and the runner are an important part, but not too big. I think noticing regressions is the main part. The UI is quite relevant for it, but detecting regressions in the CI of each PR can be more important, or detecting them in every commit and automatically creating an issue could be very useful too. If we have that, probably the current UIs with any or both of asv and conbench may be more than enough. I think it's worth to have a parallel conbench instance for our benchmarks. Even if for what you said I don't think the current UI seems to be much of an improvement, but the fact that is better maintained than asv. And if we want to make the UI better, I think the asv one is a very complex javascript system, so that seems as an advantage clearly. But probably better to discuss further when we have it ready.

lithomas1 · 2023-10-31T03:01:39Z

Sorry if I missed this but what's the main compelling reason to consider switching to conbench?

It seems like using the runner is mostly off the table (unless someone wants to write a shim to translate our ASV benchmarks into conbench benchmarks).

It's also mentioned above that the GUI/GH integration might be a bit better, but I feel like this is kind of subjective.
IMO, it'd be kind of helpful for someone to note down exactly what's there in conbench/missing from ASV, just to provide context for those of us that haven't used conbench before.

Personally, I'm more interested in the ability to trigger benchmarks from something like conbench manually instead of the current approach with ASV (@rhshadrach would know more, but it looks like the current approach is a mix of Ansible/Airflow which is used to schedule jobs?).

This would require either a total rewrite of the benchmarks or the shim that I mentioned previously, though.

DeaMariaLeon · 2023-10-31T07:03:39Z

@lithomas1 conbench has an automated alert for regressions, errors or improvements. It can be set up to send a comment to the commit or a PR on github. There is a "bench alert package" that can be used to do that, but it needs to be configured.
It is not necessary to rewrite the asv benchmarks, I'm working on the shim you are mentioning: an adapter so asv results can be sent to a conbench dashboard.

lithomas1 · 2023-10-31T15:45:14Z

Thanks for clarifying, excited to see what you come up with!

datapythonista · 2023-11-03T10:50:10Z

I've been doing more research on the noise in the benchmarks. Every case seems a bit different, but the main things I see for benchmarks that are noisy in a single run after making the server stable are related to loading of libraries.

The idea is next:

The benchmark runs in a very short time (like 1 microsecond)
Some dependency is loaded in the benchmark itself (not in the setup function or the imports at the top of the file)
I think the largest noise happens in benchmarks that load our own C extensions
Seems like loading our extension takes around 20 microseconds (20x more than the benchmark itself)
The Python interpreter caches libraries and extensions, so the second time the function executes extremely fast in comparison
For some reason I don't understand in many cases the second run is much faster than the first, but significantly slower than the third and after the third, for which the time starts to be consistent

Asv allows for a warm up time specified in seconds, but it doesn't allow to discard the first N runs, which I think it'd be the most convenient to get more stable results with minimum iterations and total time to run the suite. I think as the most immediate solution we can use 3 or 5 repetitions. Asv will take the median. Assuming this pattern of a very slow first iteration, and a second close to the following but significantly higher, if we use 3 repetitions, we will be using the second as our measurement. If we use 5 repetitions, the median will be the slower of the 3 last repetitions, the "normal" ones. In practice the difference is not huge:

In the chart, the first noisy part is with 1 repetition (each benchmark is run just once). In the final part, the first section with a bit higher values is with 3 repetitions, and the last section is with 5 repetitions. This and more benchmarks can also be seen in this asv UI instance.

The times to run the whole benchmark suite in the OVH server (including building pandas which takes around 4 minutes, but not including calling asv publish):

1 iteration: 18 minutes
3 iterations: 34 minutes
5 iterations 50 minutes

Based on the numbers above, looks like there is an overhead of 10 minutes to run the benchmarks (the 4 to build pandas, environment creation, benchmark discovery...),, and then each run takes 8 minutes.

I think as a next step I'll leave the main benchmarks server in OVH with all the configurations to make it stable, run benchmarks with 5 repeatitions, and we'll set up the conbench instance and action to report regressions (we will continue to have the asv UI). I think we can have this ready in the next days.

Once we have this milestone, I think we can continue with Thomas' ideas. I think we should be able to get our benchmarks running in around 15 minutes, and we can have 3 dedicated servers for benchmarks, I think with this we should be able to run the benchmark suite for every commit of every PR, and detect any regression before merging PRs. There are some challenges, but I think this is doable.

If anyone has thoughts, opinions or any feedback, that would be great. Based on the discussions we already had I think the above plan is what makes more sense, but happy to discuss further and replan as needed.

WillAyd · 2023-11-03T11:44:51Z

I think the largest noise happens in benchmarks that load our own C extensions

Does this happen for every extension or just some? We have a mixed bag of multi-phase (ex: pd_datetime.c) and single-phase initialization extensions (ex: ujson.c). I'm not sure of all of the impacts of that but PEP 489 mentions differences in how those are treated within sub-interpreters.

datapythonista · 2023-11-03T12:19:36Z

I didn't check with enough detail to know. It'd be great to understand that, but one option may be to simply preload all them with asv, and that may allow to run just once each benchmark while keeping stability. Or I'm not sure if it could add value to track the first run independently. I assume in most cases we care about regressions in the algorithms that run after the extension or library is loaded. But I guess if it could be good to know if there is something that makes an extension load slower.

But if people are fine with it, we'll first focus on setting everything to run the benchmarks with the increase stability and in 50 minutes (in the OVH server I think they are taking around 2.5h each run). And set up the conbench feature to warn us of regressions. Then we'll try to go deeper in the initialization of the extensions and any other noise that it's left.

WillAyd · 2023-11-03T12:24:45Z

What is the default number of repetitions? My only concern would be that dropping to 5 will generate more false positives or benchmarks whose timing varies between runs. I'm operating off of the assumption that the ASV authors set it higher for a reason

datapythonista · 2023-11-03T12:44:54Z

The default is some logic, I think now it's 10, but it's less if the benchmarks take more than 0.1 seconds, then it stops early. I think in the past the logic was more complex and run up to 100 times.

Every benchmark is different, and how the server is configured makes a huge difference too. Many of our benchmarks are extremely stable and I'm confident to not repeat them at all. Only the ones loading stuff during the benchmark (not at setup) or caching things require repeating as the first iteration is different. Also, not sure about conbench, but Richard has a script that detects regressions in a smart way that identifies false positives caused by noise.

I think from 10 repetitions to 5 there is zero difference with the server properly configured. And the noise will be minimal in most benchmarks, compared to what we have now. And we can also change the parameters or whatever is needed if we do get false positives when the system is in place.

Dr-Irv · 2023-11-03T14:29:57Z

The default is some logic, I think now it's 10, but it's less if the benchmarks take more than 0.1 seconds, then it stops early. I think in the past the logic was more complex and run up to 100 times.

Before I saw this, I was thinking that you could do less repetitions on benchmarks that take longer to run, so if there is a way to configure a repetition count per benchmark (or per set of benchmarks), that might lower the overall time.

For example, if we have some benchmark that takes on average 10 seconds to run, we can be safe with 3 repetitions, because the overhead in the first time will be swamped by the total benchmark time. That would save 20 seconds overall by reducing the count from 5 to 3 for that particular benchmark. Add that up over a lot of longer benchmarks, and that could help reduce the overall time.

datapythonista · 2023-11-03T16:23:05Z

For what I know, what you say is possible @Dr-Irv. And I think it makes sense in a scenario where we run the benchmarks 100 times or a big number. At the current state we are able to run benchmarks just once with minimal noise for many benchmarks. And the ones with noise is because the first run is slower as things are being loaded and cached during the benchmark. So, we do want to run those at least something like 3 times. So, in this particular case I wouldn't kmow how to use a max time for all repetitions of a benchmark in a way that is useful.

Dr-Irv · 2023-11-03T16:42:30Z

FWIW, I took your last runs and computed the sum of all times of all the benchmarks, which is somewhat equivalent to running each one of them just once., assuming they could be done with no overhead in sequence, and leaving out the memory ones. That total is 170 seconds. So I guess there is a LOT of overhead with asv running each benchmark multiple times.

datapythonista · 2023-11-04T08:42:22Z

That's interesting, thanks for checking @Dr-Irv. I guess most of the difference between the 3 minutes of adding up the benchmark times, and the 14 minutes of their run (18 minutes minus 4 minutes of compiling pandas) is in the set up and tear down functions. I guess it's normal that it takes time, as the set up will usually build datasets, and allocations are slow. But still, 11 minutes compared to the 3 minutes running the benchmarks seems a lot, I guess the overhead of asv itself may be significant, and there is probably room for improvement.

Something that I also don't think it's probably very efficient is that all results are kept into memory as the process runs. The final result is a single json file. Changing that to jsonl, csv or something more efficient, that is written to disk as benchmarks are run could possibly make things a bit faster and memory needs lower.

We can consider recording more times in asv, like timing the setup functions... In the same way asv and pytest report durations, it could make sense to report them in asv for setup (maybe durations in asv already include the set up and tear down times, I don't know).

In any case, I'll continue with the low hanging fruits and set up everything as I proposed above for the first milestone, and then we can go deeper into the rabbit hole of where is really the time being spent, and try to optimize it so we can run the benchmarks fast enough to be run as a regular CI job in PRs.

DeaMariaLeon · 2023-11-29T18:36:44Z

The first part of the adapter that takes asv benchmarks results and sends them to a conbench dashboard is done. I temporarily installed it in one of the OVH servers with a local set up.

It’s still not automatically updating with the new files generated by asv. I only took 125 files that were the most recent last week. For the time being, they have to be read in batches.

If someone wants to see one of the graphs:

http://15.235.45.255:5000/benchmark-results/0656145ccca870b88000f4a7daf8f338/

I’m just starting to study conbench alert system.

datapythonista added the Benchmark Performance (ASV) benchmarks label Sep 5, 2023

datapythonista assigned DeaMariaLeon Sep 5, 2023

datapythonista mentioned this issue Oct 3, 2023

CI: Exclude benchmarks directory when publishing docs #55380

Merged

datapythonista mentioned this issue Oct 25, 2023

BUG: asv seems to report some measurements 50% lower than the actual time airspeed-velocity/asv_runner#33

Open

rhshadrach mentioned this issue Dec 1, 2023

PERF: groupby.nunique #56061

Merged

5 tasks

DeaMariaLeon mentioned this issue Jan 26, 2024

Adding conbench-poc pandas-dev/pandas-benchmarks#1

Closed

Set up benchmarks server #55007

Set up benchmarks server #55007

Comments

datapythonista commented Sep 5, 2023

rhshadrach commented Sep 5, 2023

lithomas1 commented Sep 7, 2023

rhshadrach commented Sep 7, 2023

lithomas1 commented Sep 7, 2023

DeaMariaLeon commented Sep 19, 2023

rhshadrach commented Sep 19, 2023 • edited Loading

datapythonista commented Oct 4, 2023

rhshadrach commented Oct 4, 2023

datapythonista commented Oct 9, 2023

rhshadrach commented Oct 19, 2023

jorisvandenbossche commented Oct 20, 2023

datapythonista commented Oct 22, 2023

rhshadrach commented Oct 22, 2023 • edited Loading

datapythonista commented Oct 23, 2023

datapythonista commented Oct 23, 2023

rhshadrach commented Oct 23, 2023 • edited Loading

datapythonista commented Oct 24, 2023

DeaMariaLeon commented Oct 24, 2023

datapythonista commented Oct 24, 2023

datapythonista commented Oct 25, 2023

jbrockmendel commented Oct 25, 2023 via email

datapythonista commented Oct 25, 2023

DeaMariaLeon commented Oct 26, 2023

datapythonista commented Oct 26, 2023

DeaMariaLeon commented Oct 26, 2023

phofl commented Oct 26, 2023

lithomas1 commented Oct 28, 2023

WillAyd commented Oct 28, 2023

datapythonista commented Oct 28, 2023

datapythonista commented Oct 29, 2023

jorisvandenbossche commented Oct 30, 2023 • edited Loading

datapythonista commented Oct 30, 2023

lithomas1 commented Oct 31, 2023 • edited Loading

DeaMariaLeon commented Oct 31, 2023

lithomas1 commented Oct 31, 2023

datapythonista commented Nov 3, 2023

WillAyd commented Nov 3, 2023 • edited Loading

datapythonista commented Nov 3, 2023

WillAyd commented Nov 3, 2023

datapythonista commented Nov 3, 2023

Dr-Irv commented Nov 3, 2023

datapythonista commented Nov 3, 2023

Dr-Irv commented Nov 3, 2023

datapythonista commented Nov 4, 2023

DeaMariaLeon commented Nov 29, 2023

rhshadrach commented Sep 19, 2023 •

edited

Loading

rhshadrach commented Oct 22, 2023 •

edited

Loading

rhshadrach commented Oct 23, 2023 •

edited

Loading

jorisvandenbossche commented Oct 30, 2023 •

edited

Loading

lithomas1 commented Oct 31, 2023 •

edited

Loading

WillAyd commented Nov 3, 2023 •

edited

Loading