Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduled tests failing while running example tests on python 3.8 (Ubuntu) #3415

Closed
arjxn-py opened this issue Oct 6, 2023 · 14 comments · Fixed by #3494
Closed

Scheduled tests failing while running example tests on python 3.8 (Ubuntu) #3415

arjxn-py opened this issue Oct 6, 2023 · 14 comments · Fixed by #3494
Assignees
Labels
bug Something isn't working priority: high To be resolved as soon as possible

Comments

@arjxn-py
Copy link
Member

arjxn-py commented Oct 6, 2023

Example tests failing while testing half-cell.ipynb - https://github.com/pybamm-team/PyBaMM/actions/runs/6427005783/job/17452030411#step:15:682

@agriyakhetarpal agriyakhetarpal added bug Something isn't working priority: high To be resolved as soon as possible labels Oct 6, 2023
@kratman
Copy link
Contributor

kratman commented Oct 6, 2023

I have noticed this on my PRs as well, but it appears to be intermittent. In my one PR it failed for python 3.11 instead of 3.8 like in your example. I have not been able to reproduce it locally either.

@arjxn-py
Copy link
Member Author

arjxn-py commented Oct 7, 2023

I have noticed this on my PRs as well, but it appears to be intermittent. In my one PR it failed for python 3.11 instead of 3.8 like in your example. I have not been able to reproduce it locally either.

I have also noticed it failing on 3.9 but now it is consistently failing on 3.8. Potentially a resultant of #3198

@kratman
Copy link
Contributor

kratman commented Oct 7, 2023

I can confirm it is random. If I run this code 3-4 times in the notebook then I get a failure locally.

model_with_degradation = pybamm.lithium_ion.DFN({
    "working electrode": "positive",
    "SEI": "reaction limited",  # SEI on both electrodes
    "SEI porosity change": "true",
    "particle mechanics": "swelling and cracking",
    "SEI on cracks": "true",
    "lithium plating": "partially reversible",
    "lithium plating porosity change": "true",  # alias for "SEI porosity change"
})
param_GrSi = pybamm.ParameterValues("OKane2022_graphite_SiOx_halfcell")
param_GrSi.update({"SEI reaction exchange current density [A.m-2]": 1.5e-07})
var_pts = {"x_n": 1, "x_s": 5, "x_p": 7, "r_n": 1, "r_p": 30}
exp_degradation = pybamm.Experiment(["Charge at 0.3C until 1.5 V", "Discharge at 0.3C until 0.005 V"])
sim3 = pybamm.Simulation(model_with_degradation, parameter_values=param_GrSi, experiment=exp_degradation, var_pts=var_pts)
sol3 = sim3.solve()
t = sol3["Time [s]"].entries
V = sol3["Voltage [V]"].entries
plt.figure()
plt.plot(t,V)
plt.xlabel("Time [s]")
plt.ylabel("Voltage [V]")
plt.show()

image

I am running python 3.11 on a Mac M2 laptop.

@kratman
Copy link
Contributor

kratman commented Oct 7, 2023

@agriyakhetarpal, @arjxn-py The solver may be sensitive to initial state. I will try to dig deeper later today.

@agriyakhetarpal
Copy link
Member

I remember there have been similar issues with casadi before, perhaps we should also check whether such stochastic behaviour persists outside M-series macOS machines, i.e., on Intel-based macOS systems

@kratman
Copy link
Contributor

kratman commented Oct 7, 2023

I will pull it down the branch and test on my Linux machine as well

@kratman
Copy link
Contributor

kratman commented Oct 7, 2023

Here is what I can confirm from seeing it personally.

Failures from my laptops:

  • Mac M1, python 3.9: Random failure after 3-4 runs
  • Mac M2, python 3.11: Random failure after 3-4 runs
  • Linux mint i5, python 3.9: Random failure after 4-5 runs

Failures in CI:

  • Ubuntu, python 3.8
  • Ubuntu, python 3.11

@kratman
Copy link
Contributor

kratman commented Oct 7, 2023

I think I have an adjustment that will fix it. I wrote a stress testing script to make sure random problems are done. It will take a few hours to makes sure everything is OK. I will make a PR tomorrow if it works

@kratman kratman self-assigned this Oct 7, 2023
@kratman
Copy link
Contributor

kratman commented Oct 7, 2023

I tried adjusting the var_pts and solver settings. So far I have not been able to stabilize the notebook. I am going to set this aside for now.

I was testing by running the code I posted above in a loop and counting how many times it failed. All the settings I tried had 6-10% failure rates with 1000 runs of the code.

@kratman kratman removed their assignment Oct 7, 2023
@agriyakhetarpal
Copy link
Member

It's good to know that this failure is random, at least, and that it occurs only in the example tests and not the unit or integration tests (or does it? I haven't noticed yet). I labelled this issue as high priority so that it can be looked at and fixed before the stable release

@brosaplanella
Copy link
Member

It was agreed in the dev meeting 30/10/23 that we will wait for the next release before tackling this.

@agriyakhetarpal
Copy link
Member

agriyakhetarpal commented Nov 1, 2023

I pushed some experiments for this on this branch by running the half-cell models notebook 500 times in a row in a workflow (I tried running them 1000 times first but that timed out miserably on GitHub Actions runners after six hours 😅). As we discussed in the meeting it looks like the CasADi solver is susceptible to the initial conditions and fails to converge intermittently for currently unexplained reasons, and while I cannot comment on the use of the solver given my inexperience with battery models, setting a fixed random seed at the time of instantiating an instance of the Simulation class looks like an appropriate solution, i.e., similar to what has been done to make the integration test suite deterministic in #2844 and the benchmarks in #3406.

The runs are here:

1. without any changes

https://github.com/agriyakhetarpal/PyBaMM/actions/runs/6709365330

Across Python 3.8–3.11, 500 runs on each, for a total of 2000, there were

  1. Python 3.8 – 72 failures
  2. Python 3.9 – 94 failures
  3. Python 3.10 – 92 failures
  4. Python 3.11 – 83 failures

These results are in line with @kratman's failure rate as noted above, though slightly more as 341/2000 means the probability comes close to 0.17. This is without any adjustments to the notebook or the involved parameter values.

2. with all Simulation objects initialised with fixed random seed

https://github.com/agriyakhetarpal/PyBaMM/actions/runs/6709408022

Zero failures across any Python version. All 2000 runs passed.

The logs have been saved to a file and are available as downloadable artifacts on both of the above links.


Notwithstanding such stochastic behaviour it is reasonable to assume that the Python version is not relevant for this functionality and we can accumulate all the results to record a statistically significant experiment, it should be fine to infer that the probability of encountering two failures in two consecutive runs of the code becomes close to 3% (we would obviously need to run more experiments and this value is going to fluctuate as the number of runs increases to, say, 10,000 and onwards).

I can write a PR to implement these changes, but I would appreciate some advice and clarifications first as to what they should look like and what sort of API changes should be required (e.g., should we provide this as an internal function in pybamm.util or rename it to something more informative, etc.). The current implementation works but is limited and not complete, I'm not sure whether we should make this behaviour optional using an argument in the class signature (like Simulation(fixed_random_seed=True) which can set to False as default), or implement this PyBaMM-wide in the settings class (or maybe keep it for all solvers). This can also be regarded in future as a precursor for other discussions (#1690). All unit tests are passing locally.

Also, it looks like the name of the object changes: instead of being <bound method Simulation.__init__ of <pybamm.simulation.Simulation object at 0x168a55f10>> it gets modified to be <bound method fix_random_seed_for_class.<locals>.new_init of <pybamm.simulation.Simulation object at 0x107843b50>>. It is not a big deal, but I will try to modify the __name__ and other class attributes using a sort of functools.@wraps equivalent but for classes instead of methods (this SO thread looks promising).

@kratman
Copy link
Contributor

kratman commented Nov 1, 2023

I think I would make the fix random seed aspect a member function that gets called from "inside" when the solver gets set to casadi. That solves your wrapping issue and makes it something that is easy to yank out if we find a better fix.

@agriyakhetarpal
Copy link
Member

Thanks @kratman, that sounds very reasonable and doing that indeed solved the wrapping issue. It is a minimal change at best, so it can be replicated across any other moving parts which tend to fail at random later on.

I have pushed some changes to my branch and will stress-test the notebook with each Python version one last time (another workflow run passed with zero failures across 2000). I am self-assigning this and I will open a PR soon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working priority: high To be resolved as soon as possible
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants