Scheduled tests failing while running example tests on python 3.8 (Ubuntu) #3415

arjxn-py · 2023-10-06T05:33:03Z

Example tests failing while testing half-cell.ipynb - https://github.com/pybamm-team/PyBaMM/actions/runs/6427005783/job/17452030411#step:15:682

kratman · 2023-10-06T11:37:36Z

I have noticed this on my PRs as well, but it appears to be intermittent. In my one PR it failed for python 3.11 instead of 3.8 like in your example. I have not been able to reproduce it locally either.

arjxn-py · 2023-10-07T13:38:28Z

I have noticed this on my PRs as well, but it appears to be intermittent. In my one PR it failed for python 3.11 instead of 3.8 like in your example. I have not been able to reproduce it locally either.

I have also noticed it failing on 3.9 but now it is consistently failing on 3.8. Potentially a resultant of #3198

kratman · 2023-10-07T14:39:34Z

I can confirm it is random. If I run this code 3-4 times in the notebook then I get a failure locally.

model_with_degradation = pybamm.lithium_ion.DFN({
    "working electrode": "positive",
    "SEI": "reaction limited",  # SEI on both electrodes
    "SEI porosity change": "true",
    "particle mechanics": "swelling and cracking",
    "SEI on cracks": "true",
    "lithium plating": "partially reversible",
    "lithium plating porosity change": "true",  # alias for "SEI porosity change"
})
param_GrSi = pybamm.ParameterValues("OKane2022_graphite_SiOx_halfcell")
param_GrSi.update({"SEI reaction exchange current density [A.m-2]": 1.5e-07})
var_pts = {"x_n": 1, "x_s": 5, "x_p": 7, "r_n": 1, "r_p": 30}
exp_degradation = pybamm.Experiment(["Charge at 0.3C until 1.5 V", "Discharge at 0.3C until 0.005 V"])
sim3 = pybamm.Simulation(model_with_degradation, parameter_values=param_GrSi, experiment=exp_degradation, var_pts=var_pts)
sol3 = sim3.solve()
t = sol3["Time [s]"].entries
V = sol3["Voltage [V]"].entries
plt.figure()
plt.plot(t,V)
plt.xlabel("Time [s]")
plt.ylabel("Voltage [V]")
plt.show()

I am running python 3.11 on a Mac M2 laptop.

kratman · 2023-10-07T14:42:28Z

@agriyakhetarpal, @arjxn-py The solver may be sensitive to initial state. I will try to dig deeper later today.

agriyakhetarpal · 2023-10-07T14:47:30Z

I remember there have been similar issues with casadi before, perhaps we should also check whether such stochastic behaviour persists outside M-series macOS machines, i.e., on Intel-based macOS systems

kratman · 2023-10-07T15:14:27Z

I will pull it down the branch and test on my Linux machine as well

kratman · 2023-10-07T18:21:50Z

Here is what I can confirm from seeing it personally.

Failures from my laptops:

Mac M1, python 3.9: Random failure after 3-4 runs
Mac M2, python 3.11: Random failure after 3-4 runs
Linux mint i5, python 3.9: Random failure after 4-5 runs

Failures in CI:

Ubuntu, python 3.8
Ubuntu, python 3.11

kratman · 2023-10-07T19:14:37Z

I think I have an adjustment that will fix it. I wrote a stress testing script to make sure random problems are done. It will take a few hours to makes sure everything is OK. I will make a PR tomorrow if it works

kratman · 2023-10-07T23:21:23Z

I tried adjusting the var_pts and solver settings. So far I have not been able to stabilize the notebook. I am going to set this aside for now.

I was testing by running the code I posted above in a loop and counting how many times it failed. All the settings I tried had 6-10% failure rates with 1000 runs of the code.

agriyakhetarpal · 2023-10-08T00:09:35Z

It's good to know that this failure is random, at least, and that it occurs only in the example tests and not the unit or integration tests (or does it? I haven't noticed yet). I labelled this issue as high priority so that it can be looked at and fixed before the stable release

brosaplanella · 2023-10-30T16:41:19Z

It was agreed in the dev meeting 30/10/23 that we will wait for the next release before tackling this.

agriyakhetarpal · 2023-11-01T09:34:55Z

I pushed some experiments for this on this branch by running the half-cell models notebook 500 times in a row in a workflow (I tried running them 1000 times first but that timed out miserably on GitHub Actions runners after six hours 😅). As we discussed in the meeting it looks like the CasADi solver is susceptible to the initial conditions and fails to converge intermittently for currently unexplained reasons, and while I cannot comment on the use of the solver given my inexperience with battery models, setting a fixed random seed at the time of instantiating an instance of the Simulation class looks like an appropriate solution, i.e., similar to what has been done to make the integration test suite deterministic in #2844 and the benchmarks in #3406.

The runs are here:

1. without any changes

https://github.com/agriyakhetarpal/PyBaMM/actions/runs/6709365330

Across Python 3.8–3.11, 500 runs on each, for a total of 2000, there were

Python 3.8 – 72 failures
Python 3.9 – 94 failures
Python 3.10 – 92 failures
Python 3.11 – 83 failures

These results are in line with @kratman's failure rate as noted above, though slightly more as 341/2000 means the probability comes close to 0.17. This is without any adjustments to the notebook or the involved parameter values.

2. with all Simulation objects initialised with fixed random seed

https://github.com/agriyakhetarpal/PyBaMM/actions/runs/6709408022

Zero failures across any Python version. All 2000 runs passed.

The logs have been saved to a file and are available as downloadable artifacts on both of the above links.

Notwithstanding such stochastic behaviour it is reasonable to assume that the Python version is not relevant for this functionality and we can accumulate all the results to record a statistically significant experiment, it should be fine to infer that the probability of encountering two failures in two consecutive runs of the code becomes close to 3% (we would obviously need to run more experiments and this value is going to fluctuate as the number of runs increases to, say, 10,000 and onwards).

I can write a PR to implement these changes, but I would appreciate some advice and clarifications first as to what they should look like and what sort of API changes should be required (e.g., should we provide this as an internal function in pybamm.util or rename it to something more informative, etc.). The current implementation works but is limited and not complete, I'm not sure whether we should make this behaviour optional using an argument in the class signature (like Simulation(fixed_random_seed=True) which can set to False as default), or implement this PyBaMM-wide in the settings class (or maybe keep it for all solvers). This can also be regarded in future as a precursor for other discussions (#1690). All unit tests are passing locally.

Also, it looks like the name of the object changes: instead of being <bound method Simulation.__init__ of <pybamm.simulation.Simulation object at 0x168a55f10>> it gets modified to be <bound method fix_random_seed_for_class.<locals>.new_init of <pybamm.simulation.Simulation object at 0x107843b50>>. It is not a big deal, but I will try to modify the __name__ and other class attributes using a sort of functools.@wraps equivalent but for classes instead of methods (this SO thread looks promising).

kratman · 2023-11-01T11:18:51Z

I think I would make the fix random seed aspect a member function that gets called from "inside" when the solver gets set to casadi. That solves your wrapping issue and makes it something that is easy to yank out if we find a better fix.

agriyakhetarpal · 2023-11-01T21:16:35Z

Thanks @kratman, that sounds very reasonable and doing that indeed solved the wrapping issue. It is a minimal change at best, so it can be replicated across any other moving parts which tend to fail at random later on.

I have pushed some changes to my branch and will stress-test the notebook with each Python version one last time (another workflow run passed with zero failures across 2000). I am self-assigning this and I will open a PR soon

agriyakhetarpal added bug Something isn't working priority: high To be resolved as soon as possible labels Oct 6, 2023

kratman self-assigned this Oct 7, 2023

kratman removed their assignment Oct 7, 2023

agriyakhetarpal mentioned this issue Oct 25, 2023

JaxSolver fails when using GPU support with no input parameters #3423

Merged

8 tasks

agriyakhetarpal self-assigned this Nov 1, 2023

agriyakhetarpal mentioned this issue Nov 2, 2023

Improve reliability of simulations when using CasadiSolver #3494

Merged

8 tasks

Saransh-cpp closed this as completed in #3494 Nov 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduled tests failing while running example tests on python 3.8 (Ubuntu) #3415

Scheduled tests failing while running example tests on python 3.8 (Ubuntu) #3415

arjxn-py commented Oct 6, 2023

kratman commented Oct 6, 2023

arjxn-py commented Oct 7, 2023

kratman commented Oct 7, 2023

kratman commented Oct 7, 2023

agriyakhetarpal commented Oct 7, 2023

kratman commented Oct 7, 2023

kratman commented Oct 7, 2023

kratman commented Oct 7, 2023

kratman commented Oct 7, 2023

agriyakhetarpal commented Oct 8, 2023

brosaplanella commented Oct 30, 2023

agriyakhetarpal commented Nov 1, 2023 •

edited

Loading

kratman commented Nov 1, 2023

agriyakhetarpal commented Nov 1, 2023

Scheduled tests failing while running example tests on python 3.8 (Ubuntu) #3415

Scheduled tests failing while running example tests on python 3.8 (Ubuntu) #3415

Comments

arjxn-py commented Oct 6, 2023

kratman commented Oct 6, 2023

arjxn-py commented Oct 7, 2023

kratman commented Oct 7, 2023

kratman commented Oct 7, 2023

agriyakhetarpal commented Oct 7, 2023

kratman commented Oct 7, 2023

kratman commented Oct 7, 2023

kratman commented Oct 7, 2023

kratman commented Oct 7, 2023

agriyakhetarpal commented Oct 8, 2023

brosaplanella commented Oct 30, 2023

agriyakhetarpal commented Nov 1, 2023 • edited Loading

kratman commented Nov 1, 2023

agriyakhetarpal commented Nov 1, 2023

agriyakhetarpal commented Nov 1, 2023 •

edited

Loading