-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set up benchmarks server #55007
Comments
I now run the ASV machine from my home. It is my personal hardware (dedicated to doing nothing but running benchmarks).
The way I inherited it from Tom, we grab the last 5 commits and run benchmarks on those commits if benchmarks have not already been run yet. The benefit of this is that we don't have an ever growing queue. The downside is that the machine may have no new commits to run and does not look further back in history so may unnecessarily sit idle. I've been slowly working toward automating the process of reliably detecting regressions and making/tracking comments on PRs. Currently I'm building a front end in dash. I don't think this competes with the work as outlined here; I believe both setups are good to have as they each have strengths and weaknesses. |
I had something to be able to trigger a runner on OVH to be spun up on Github Actions with cirun and run some of the tslibs benchmarks, there was too much variability (>20%) too. Here is my branch. Maybe we should try to set up a meeting with everyone that's interested so we can figure out what we have, and what needs to be done? |
@lithomas1 - do you know if that was a problem with the setup, or a problem with the benchmark? E.g. at least at face value this appears to me to be a problem with the benchmark: https://asv-runner.github.io/asv-collection/pandas/#arithmetic.Ops2.time_frame_dot |
I think I was just experimenting with a standard cloud VM (b2-7 from OVH) to see how bad the noise was. I didn't tune the system at all back then and we weren't using the bare metal server, so maybe there's room for improvement. I was mainly sharing to show a possible way to connect Github with whatever we use to run the benchmarks. |
I'm trying to understand the current workflow (for the benchmark suite).
Are the results discarded after running Is this up to date? Benchmark-machine BTW, thanks @lithomas1 and @rhshadrach for the info. I have been getting familiar with |
Correct.
For the most part. We now use instead https://github.com/asv-runner rather than Tom's repo as mentioned there. We also no longer publish from the asv-collection repo to the pandas website. The benchmark machine is running in my home and has no access to the wider internet, so only I can "debug" (read: reboot) it. It is still written in ansible. I plan to rewrite it using Docker. |
Thanks @rhshadrach and @lithomas1 for the information, this is helpful.
Is this a daily cron job? If there are 5 commits for that day, do you know how long does it take? Is it possible that if it takes more than 24h (more than 4.8 to run the benchmarks of each commit) the process is still alive when the execution of the next day starts? I guess that would add a decent amount of extra noise. We're finishing setting up things, I need to see how benchmarks take exactly in the server, but I was thinking on running the last commit from the repo twice or 3 times per day, depending on what's possible. And then, in parallel and in a separate server, keep running the benchmarks for the same commit again and again, and test changes in the machine, asv params or whatever to see what make things more stable and faster. I think there are other things that can be done, like:
And I guess other things. Does anyone in the @pandas-dev/pandas-core team has anything they'd really like to see improved in the benchmarks that it's doable? As you know, @DeaMariaLeon will spend the next couple of months working on this, besides the work of others. We can surely have a meeting as suggested above, maybe we can gather some feedback here first, and have the meeting in couple of weeks? |
It is not a cron job. Once a set of commits is finished, the machine will pull from main and run the 5 latest commits, only running those that have not been benchmarked already. If there are no new commits to benchmark, it will sleep for 10 minutes and try again. |
We not got the benchmarks automatically being published here from our benchmarks server: https://pandas.pydata.org/benchmarks/ So far it uses a cron job that every 3 hours executes the last commit (the benchmarks take a bit more than two hours in the server). This is just a first step, we can discuss and implement things in any other way that it's more convenient. I also set up another benchmarks server, but that keeps running the benchmarks for the same commit (same exact code, but with different hash). This is to see how much noise we've got in the server,. For certain benchmarks seems to be a lot. We'll be setting up the server to be more stable (cpu isolation...). This other server should help see if what we change really has an effect: https://pandas.pydata.org/benchmarks/same-commit/ All is work in progress, but wanted to share those already, in case it's useful. |
On the benchmarks call today, the issue of parsing the current ASV json files into something more usable (e.g. a pandas DataFrame) came up. Wanted to share what I use today: |
On that topic, here is an older notebook where I convert the json file to a DataFrame and did some basic analysis (listing the ones that take the most time): https://nbviewer.org/gist/jorisvandenbossche/0af1c0a20ef187197ecdcfdb3545306a (from #44450 (comment)) |
I created a repo with what we've got so far in the OVH servers: https://github.com/pandas-dev/pandas-benchmarks We're still quite far from having enough stability in the servers to be happy, but I'll keep updating that repo as we make progress. So far we're going to continue with the next tasks:
We will consider the other proposed topics after we get some progress in the above tasks. |
@datapythonista and @DeaMariaLeon: One issue that recently came up is that of dependencies. I said on the call that we should test against latest versions of e.g. NumPy so that we can help identify regressions there. However we currently face the issue that the dependency versions used are not stored in ASV results (unless they are pinned, which most are not). |
I think this will introduce noise in the plots, as asv will generate a different line for each version of each dependency. But maybe we should create an issue in asv, and see what can be done, as this is in their side. |
@rhshadrach are the asv results of your server published anywhere? I can find the rendered html here, but I wonder if you sync the json files somewhere we can access. I guess it's a non-trivial amount of data, but I think it could be useful for others to access, to run your regressions detection, export to conbench... If it's not published anywhere, and you want to share it but need somewhere to, please let me know (and how much space is needed) and I'm happy to set up something in the OVH infrastructure. |
If you specify multiple versions if the config file, yes. However, if only one version is specified, ASV treats it as a single line (e.g. this example)
I believe they are in the repo you linked to. Direct links: The last bullet above is merely one example of where the JSONs for the benchmarks are stored. You need to navigate the various folder paths that are created whenever we pin a version in the ASV conf file. This is what the code I linked to above does. If there are more JSONs to publish, let me know and I can look into them. |
I've been doing more tests to set up the server with minimum noise, and results start to look promising, except one thing I'm not sure where it's coming from. In the chart, I'm comparing two different configurations.
In the top chart, the space between the red lines is where we capture the benchmarks from (we leave a 0.1 second of warm up where results are discarded, and 10 runs are considered after it). In the bottom histograms I plot the times over 10,000 runs. The purple lines represent the minimum and maximum time. For I'll keep having a look, if we manage to remove the period with lower performance, I think we should be able to remove or reduce significantly the warm up period and reduce the number of repetitions we run each benchmark, while still reducing to a minimum the number of false positives. |
Regarding the issue of storing dependencies with asv:
Asv treats it as a single line, but I think we need to remove previous results. |
That looks like multiple lines to me, one after the other. ;) If that's what you'd like to have, I guess creating an issue in asv is the best thing to do, but I'm not sure if it's again unmaintained. Also, maybe worth checking conbench first, we should have a PoC soon, no idea how versions are managed there. Regarding the server stability, seems like the two different speeds in the previous chart were caused by C-states. Disabling C-states completely fixed the issue: I also tried different governors, in particular the userspace one, to see if that brings more stability, but for some reason the variance becomes huge. I think for the server configuration we should be good with what we've got. There is a minimal amount of noise now, and I don't think we'll be able to get rid of easily. I think the rest of work is in asv:
I'll share more updates as I have them. |
I let asv run few iterations with the new server configuration, and results don't seem to make sense. For the same benchmark used above, The last 4 commits are the ones that used the latest server configuration (the one that caused in the independent tests the orange line in the chart from the previous comment). They seem quite stable, and the timing is consistent in both cases, But then before that we have many instances before that where the asv timing is 50% what it would be expected, and what it seems consistent in many runs. First, I don't think this benchmark can run in around 3ms in this hardware. And even less as an average of 10 iterations, but won't run as fast even in the best case scenario. Second, it's very suspicious that the benchmark is running in exactly 50% in most of these instances of what's expected, and always at the same. Based on the other tests there is no randomness in the function that explains that, and as said, even less that averaging over 10 repetitions cause exactly 50% less. I couldn't find the exact problem yet, but I think there is a bug, probably related to this division in the asv code. Seems like among a lot of other complexity, when timing the benchmark it can be executed a number of times defined by the I think the logic has been fully changed in a recent refactoring 5 months ago, and while now the code is better structured and easier to follow, I think the logic of the rounds is buggy and causing all that variance with many measurement 50% lower than they should be. I can't fully confirm yet, but I would be surprised if that wasn't the problem. I'll leave that server to continue to run as is for longer, to see if the problem still happens, and then set it up to run with the parameters specified above |
This reminded me of #40066
…On Wed, Oct 25, 2023 at 3:23 AM Marc Garcia ***@***.***> wrote:
I let asv run few iterations with the new server configuration, and
results don't seem to make sense.
For the same benchmark used above, algorithms.DuplicatedMaskedArray.time_duplicated(unique=False,
keep='first', dtype='int'), we've got the next results:
[image: Screenshot_20231025_132902]
<https://user-images.githubusercontent.com/10058240/277946028-5455928e-02af-403b-98b4-10a6d7bca39e.png>
The last 4 commits are the ones that used the latest server configuration
(the one that caused in the independent tests the orange line in the chart
from the previous comment). They seem quite stable, and the timing is
consistent in both cases, 6.9ms for asv and 6.5ms for the script I use to
run that benchmark. The small overhead in asv is expected (in my script I
avoid a function call, and probably asv has some more complexity in the
timing, I don't fully understand its code).
But then before that we have many instances before that where the asv
timing is 50% what it would be expected, and what it seems consistent in
many runs. First, I don't think this benchmark can run in around 3ms in
this hardware. And even less as an average of 10 iterations, but won't run
as fast even in the best case scenario. Second, it's very suspicious that
the benchmark is running in exactly 50% in most of these instances of
what's expected, and always at the same. Based on the other tests there is
no randomness in the function that explains that, and as said, even less
that averaging over 10 repetitions cause exactly 50% less.
I couldn't find the exact problem yet, but I think there is a bug,
probably related to this division
<https://github.com/airspeed-velocity/asv_runner/blob/main/asv_runner/benchmarks/time.py#L175>
in the asv code. Seems like among a lot of other complexity, when timing
the benchmark it can be executed a number of times defined by the number
variable, and the total time of all runs is then divided by the number of
runs to get the average. That's not the only way a benchmark run is
repeated, as this will be repeated a number of times. I thing there is a
bug somewhere in all this logic that ends up causing measurements of 50%
the actual time (and some other round values).
I think the logic has been fully changed in a recent refactoring 5 months
ago, and while now the code is better structured and easier to follow, I
think the logic of the rounds is buggy and causing all that variance with
many measurement 50% lower than they should be. I can't fully confirm yet,
but I would be surprised if that wasn't the problem.
I'll leave that server to continue to run as is for longer, to see if the
problem still happens, and then set it up to run with the parameters
specified above repeat=3, min_run_count=3, warmup_time=-1, rounds=0, and
number=1.
—
Reply to this email directly, view it on GitHub
<#55007 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB5UM6DNCHL7JQR7YBALHPLYBDSDFAVCNFSM6AAAAAA4LS3JT2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZYHE2TQOJWGE>
.
You are receiving this because you are on a team that was mentioned.Message
ID: ***@***.***>
|
Interesting, thanks for sharing. I'm not sure I fully understand your issue, as your examples seem to use Related to my last comment, asv does indeed seem to have problems averaging. Good news is that I already run few executions removing asv repetition system ( I also tried to just set After that, depending on how the tests with conbench go, I may implement a benchmark runner from scratch. I already did a prototype, and seems a very reasonable thing to do in my opinion. Much better than try to fix asv. @lithomas1 I think something we could do soon is to implement an action for the Github comment |
In case this was forgotten: conbench has a runner as well. We don't need to use asv at all if conbench does what we want. |
Does it have a runner for asv benchmarks? Did you try it? |
No, conbench doesn't have a way to run the benchmarks as they are currently written for asv - if that's the question. |
I don’t want to rewrite our benchmarks. Would that be necessary? |
Happen to help out here. Maybe we can setup a self-hosted runner, then modifying the workflow to run on the benchmarks machine would be pretty easy. |
On the topic of replacing asv with conbench I think that's another one that is on the border of requiring a PDEP. @phofl makes a good point about not wanting to rewrite benchmarks, but at the same time ASV has not evolved much in the past few years. |
I agree with @phofl, and I'm not aware of anyone disagreeing, that rewriting our benchmarks shouldn't be considered. Personally I'd be +1 in replacing the current Regarding conbench, we are not considering at this point replacing the asv interface by conbench. We want to set up a conbench instance with our benchmarks, to see if it can be useful. The benchmarks would still be executed by asv, and wouldn't be modified. And we can continue to build the asv UI website, even if conbench happen to be useful, and we want to have a proper instance of it after the PoC. I personally couldn't make too much sense of the conbench interface. Seems like a more complex navigation to end up in mostly a plot similar to the asv one but not interactice (see here for example). @jorisvandenbossche it'd be good if you could give the rest of us a demo, not sure if there is more stuff after log in, or if the interesting part are the batch jobs, and not the UI. |
As Patrick pointed out, seems like Dask also has some benchmark stuff, in particular to detect regressions. Feels similar to what Richard has, and I assume to what conbench uses to notify of regressions, so probably worth to have a look too: https://github.com/coiled/benchmarks (in the file |
To be clear, my suggestion of looking into conbench doesn't involve rewriting our asv benchmarks (I would also not be in favor of that at this point), but solely about the web UI. One of potential areas for improvement that were mentioned was "improve the UX when analyzing the benchmarks results". I asked the question if this was about improving the ASV web interface? In that context, if we would consider spending a significant amount on web UI part of ASV, then it might be worth considering conbench as an alternative web UI. While it is true that it has some basic runner as well (for macro-benchmarks, according to the docs), the general idea is that you can run your benchmarks with whatever tool, and then inject those results into conbench. For example for Apache Arrow, most of our benchmarks are written and run by a language specific tool (google benchmarks for C++, and we also have benchmarks in R, Java, Go), and conbench is used to gather those all into a single web UI (with features to detect regressions and report about that on PRs). Now, I think most of this issue is about running the benchmarks, and finding a way to get stable results. So the conbench idea might be better kept for a separate thread. |
I think we all agree in not rewriting our benchmarks. Hopefully the runner is not a big deal, if we find the problems with asv, I think asv is good for now. I think it's still worth to write a PoC of a runner without all the asv complexity. First, I'm unsure how much of running the benchmarks is caused by asv. Seems like the difference between one and two repetitions is small, so I wonder if asv itself and not our benchmarks is an important bottleneck. Second, the asv code is quite difficult to maintain, if the PoC of runner seems to work reasonably well, it may leave us in a much better position to continue from there. Or maybe we discover the opposite, that a runner is actually quite complex, and is worth to keep asv. I think it's worth to spend few hours in this and find out. This issue is generic for improving our benchmarks. I don't think the stability (which we're not far from a very good state) and the runner are an important part, but not too big. I think noticing regressions is the main part. The UI is quite relevant for it, but detecting regressions in the CI of each PR can be more important, or detecting them in every commit and automatically creating an issue could be very useful too. If we have that, probably the current UIs with any or both of asv and conbench may be more than enough. I think it's worth to have a parallel conbench instance for our benchmarks. Even if for what you said I don't think the current UI seems to be much of an improvement, but the fact that is better maintained than asv. And if we want to make the UI better, I think the asv one is a very complex javascript system, so that seems as an advantage clearly. But probably better to discuss further when we have it ready. |
Sorry if I missed this but what's the main compelling reason to consider switching to conbench? It seems like using the runner is mostly off the table (unless someone wants to write a shim to translate our ASV benchmarks into conbench benchmarks). It's also mentioned above that the GUI/GH integration might be a bit better, but I feel like this is kind of subjective. Personally, I'm more interested in the ability to trigger benchmarks from something like conbench manually instead of the current approach with ASV (@rhshadrach would know more, but it looks like the current approach is a mix of Ansible/Airflow which is used to schedule jobs?). This would require either a total rewrite of the benchmarks or the shim that I mentioned previously, though. |
@lithomas1 conbench has an automated alert for regressions, errors or improvements. It can be set up to send a comment to the commit or a PR on github. There is a "bench alert package" that can be used to do that, but it needs to be configured. |
Thanks for clarifying, excited to see what you come up with! |
I've been doing more research on the noise in the benchmarks. Every case seems a bit different, but the main things I see for benchmarks that are noisy in a single run after making the server stable are related to loading of libraries. The idea is next:
Asv allows for a warm up time specified in seconds, but it doesn't allow to discard the first N runs, which I think it'd be the most convenient to get more stable results with minimum iterations and total time to run the suite. I think as the most immediate solution we can use 3 or 5 repetitions. Asv will take the median. Assuming this pattern of a very slow first iteration, and a second close to the following but significantly higher, if we use 3 repetitions, we will be using the second as our measurement. If we use 5 repetitions, the median will be the slower of the 3 last repetitions, the "normal" ones. In practice the difference is not huge: In the chart, the first noisy part is with 1 repetition (each benchmark is run just once). In the final part, the first section with a bit higher values is with 3 repetitions, and the last section is with 5 repetitions. This and more benchmarks can also be seen in this asv UI instance. The times to run the whole benchmark suite in the OVH server (including building pandas which takes around 4 minutes, but not including calling asv publish):
Based on the numbers above, looks like there is an overhead of 10 minutes to run the benchmarks (the 4 to build pandas, environment creation, benchmark discovery...),, and then each run takes 8 minutes. I think as a next step I'll leave the main benchmarks server in OVH with all the configurations to make it stable, run benchmarks with 5 repeatitions, and we'll set up the conbench instance and action to report regressions (we will continue to have the asv UI). I think we can have this ready in the next days. Once we have this milestone, I think we can continue with Thomas' ideas. I think we should be able to get our benchmarks running in around 15 minutes, and we can have 3 dedicated servers for benchmarks, I think with this we should be able to run the benchmark suite for every commit of every PR, and detect any regression before merging PRs. There are some challenges, but I think this is doable. If anyone has thoughts, opinions or any feedback, that would be great. Based on the discussions we already had I think the above plan is what makes more sense, but happy to discuss further and replan as needed. |
Does this happen for every extension or just some? We have a mixed bag of multi-phase (ex: pd_datetime.c) and single-phase initialization extensions (ex: ujson.c). I'm not sure of all of the impacts of that but PEP 489 mentions differences in how those are treated within sub-interpreters. |
I didn't check with enough detail to know. It'd be great to understand that, but one option may be to simply preload all them with asv, and that may allow to run just once each benchmark while keeping stability. Or I'm not sure if it could add value to track the first run independently. I assume in most cases we care about regressions in the algorithms that run after the extension or library is loaded. But I guess if it could be good to know if there is something that makes an extension load slower. But if people are fine with it, we'll first focus on setting everything to run the benchmarks with the increase stability and in 50 minutes (in the OVH server I think they are taking around 2.5h each run). And set up the conbench feature to warn us of regressions. Then we'll try to go deeper in the initialization of the extensions and any other noise that it's left. |
What is the default number of repetitions? My only concern would be that dropping to 5 will generate more false positives or benchmarks whose timing varies between runs. I'm operating off of the assumption that the ASV authors set it higher for a reason |
The default is some logic, I think now it's 10, but it's less if the benchmarks take more than 0.1 seconds, then it stops early. I think in the past the logic was more complex and run up to 100 times. Every benchmark is different, and how the server is configured makes a huge difference too. Many of our benchmarks are extremely stable and I'm confident to not repeat them at all. Only the ones loading stuff during the benchmark (not at setup) or caching things require repeating as the first iteration is different. Also, not sure about conbench, but Richard has a script that detects regressions in a smart way that identifies false positives caused by noise. I think from 10 repetitions to 5 there is zero difference with the server properly configured. And the noise will be minimal in most benchmarks, compared to what we have now. And we can also change the parameters or whatever is needed if we do get false positives when the system is in place. |
Before I saw this, I was thinking that you could do less repetitions on benchmarks that take longer to run, so if there is a way to configure a repetition count per benchmark (or per set of benchmarks), that might lower the overall time. For example, if we have some benchmark that takes on average 10 seconds to run, we can be safe with 3 repetitions, because the overhead in the first time will be swamped by the total benchmark time. That would save 20 seconds overall by reducing the count from 5 to 3 for that particular benchmark. Add that up over a lot of longer benchmarks, and that could help reduce the overall time. |
For what I know, what you say is possible @Dr-Irv. And I think it makes sense in a scenario where we run the benchmarks 100 times or a big number. At the current state we are able to run benchmarks just once with minimal noise for many benchmarks. And the ones with noise is because the first run is slower as things are being loaded and cached during the benchmark. So, we do want to run those at least something like 3 times. So, in this particular case I wouldn't kmow how to use a max time for all repetitions of a benchmark in a way that is useful. |
FWIW, I took your last runs and computed the sum of all times of all the benchmarks, which is somewhat equivalent to running each one of them just once., assuming they could be done with no overhead in sequence, and leaving out the memory ones. That total is 170 seconds. So I guess there is a LOT of overhead with asv running each benchmark multiple times. |
That's interesting, thanks for checking @Dr-Irv. I guess most of the difference between the 3 minutes of adding up the benchmark times, and the 14 minutes of their run (18 minutes minus 4 minutes of compiling pandas) is in the set up and tear down functions. I guess it's normal that it takes time, as the set up will usually build datasets, and allocations are slow. But still, 11 minutes compared to the 3 minutes running the benchmarks seems a lot, I guess the overhead of asv itself may be significant, and there is probably room for improvement. Something that I also don't think it's probably very efficient is that all results are kept into memory as the process runs. The final result is a single json file. Changing that to jsonl, csv or something more efficient, that is written to disk as benchmarks are run could possibly make things a bit faster and memory needs lower. We can consider recording more times in asv, like timing the setup functions... In the same way asv and pytest report durations, it could make sense to report them in asv for setup (maybe durations in asv already include the set up and tear down times, I don't know). In any case, I'll continue with the low hanging fruits and set up everything as I proposed above for the first milestone, and then we can go deeper into the rabbit hole of where is really the time being spent, and try to optimize it so we can run the benchmarks fast enough to be run as a regular CI job in PRs. |
The first part of the adapter that takes asv benchmarks results and sends them to a conbench dashboard is done. I temporarily installed it in one of the OVH servers with a local set up. It’s still not automatically updating with the new files generated by asv. I only took 125 files that were the most recent last week. For the time being, they have to be read in batches. If someone wants to see one of the graphs: http://15.235.45.255:5000/benchmark-results/0656145ccca870b88000f4a7daf8f338/ I’m just starting to study conbench alert system. |
This issue is to keep track of the benchmarks project we are about to start. Adding context and some initial ideas, but we'll keep evolving the project as we make progress.
The final goal of our benchmarks system is to detect when some functionality in pandas starts taking longer for no good reason. See this fictional example:
In the example,
Series.sum
for a given data takes 1 millisecond in 2.0, while in 2.1 it takes 5 times that time, probably because an unexpected side effect, not because we made some changes we consider it was worth making the function 5 times slower. When we introduce a performance regression like the one above, it is likely that users will end up reporting it via a GitHub issue. But ideally, we would like to detect it much earlier, and never release versions of pandas with those performance regressions.For this reason, we implemented many functions that run pandas functionality with some arbitrary but constant data. Our benchmark suite, the code is in
asv_bench/benchmarks
. Those are implemented using a framework names asv. Running the benchmark suite gives us the time it takes to run each of many pandas functions with data consistent over runs. If comparing the executions with two different pandas versions, we can detect performance regressions.In an ideal world, what we'd like is to detect performance regressions in the CI of our PRs. So, before merging a PR we can see there is a performance regression, and we make a decision on whether the change is worth making pandas slow. This would be equivalent to what we do with tests or with linting, where we get a red CI when something is not right, and merging doesn't happen until we're happy with the PR.
In practice, running consistent benchmarks for every commit in a pull request is not feasible at this point. The main reasons are:
main
and the PR of interest, that would double the execution timeFor now, we are giving up on executing the benchmarks for every commit in an open PR, and the focus has been on executing them after merging a PR. When running the benchmarks, we run them in a physical server, not a virtual server. The server we've been using was bought by Wes many years ago and was running 24/7 from his home, and at some point it was moved to Tom's home. This is still how we run the benchmarks now. Some months ago, OVH Cloud donated to pandas credits to use a dedicated server from their cloud for our benchmarks (we also got credits to host the website/docs and for more things we may need). There was some work on setting up the server and improving things, but we didn't complete even the initial work.
There are three main challenges to what otherwise would be a somehow simple project:
The most common approach for benchmark stability is what it's usually called statistical benchmarking. In its simplest form, the idea is for a single benchmark run, running the function to time something like 100 times and taking the mean. ASV does a slightly smarter version of it, where the first few runs (warm up) are discarded, and the exact number of times a benchmark runs depends on the variance of the first runs. But 100 repetitions is common.
This repetition brings more stability, but obviously makes the second challenge worse, as timing every function 100 times makes benchmarks 100 times slower. We have a CI job where we run our benchmark suite with just one call for each function, and the timing is very reasonable, the job takes 25 minutes in the CI worker. But the results are discarded, since they are very unstable both because the lack of repetition, and the instability of using a virtual machine / CI worker.
For now, what we want is:
Once we have a reasonable version of the above, some other things can be considered. Some ideas:
CC: @DeaMariaLeon @lithomas1 @rhshadrach
The text was updated successfully, but these errors were encountered: