Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"GC error (probable corruption)" with Enzyme on Julia 1.11 #2070

Closed
gdalle opened this issue Nov 7, 2024 · 10 comments
Closed

"GC error (probable corruption)" with Enzyme on Julia 1.11 #2070

gdalle opened this issue Nov 7, 2024 · 10 comments

Comments

@gdalle
Copy link
Contributor

gdalle commented Nov 7, 2024

Congrats on the huge work for making Enzyme compatible with Julia 1.11!

DI's test suite manages to run much further, but it now hits a weird GC error and the process is aborted. This does not happen on Julia 1.10, using the same Enzyme version (v0.13.14).
Unfortunately I wasn't able to reproduce it locally. The tests do run on my computer, they fail but at least they don't crash my REPL.

CI log: https://github.com/JuliaDiff/DifferentiationInterface.jl/actions/runs/11720210493/job/32645063624 from the PR JuliaDiff/DifferentiationInterface.jl#615.

Stack trace:

GC error (probable corruption)
Allocations: 615703113 (Pool: 615692568; Big: 10545); GC: 283
<?#0x7fe15b90c020::<circular reference @-1>>

thread 0 ptr queue:
~~~~~~~~~~ ptr queue top ~~~~~~~~~~
Memory{Float64}(6, 0x7fe159e15830)[0.678816, 0, 0, 0, 0, 0]
==========
Memory{Float64}(6, 0x7fe159e15790)[0.734308, 0.0857732, 0.299033, 0.289682, 0.78841, 0.679977]
==========
~~~~~~~~~~ ptr queue bottom ~~~~~~~~~~

[2994] signal 6 (-6): Aborted
in expression starting at /home/runner/work/DifferentiationInterface.jl/DifferentiationInterface.jl/DifferentiationInterface/test/Back/Enzyme/test.jl:31
pthread_kill at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
raise at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
abort at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
gc_dump_queue_and_abort at /cache/build/builder-demeter6-6/julialang/julia-master/src/gc.c:2079
gc_mark_outrefs at /cache/build/builder-demeter6-6/julialang/julia-master/src/gc.c:2783 [inlined]
gc_mark_loop_serial_ at /cache/build/builder-demeter6-6/julialang/julia-master/src/gc.c:2944
gc_mark_loop_serial at /cache/build/builder-demeter6-6/julialang/julia-master/src/gc.c:2967
gc_mark_loop at /cache/build/builder-demeter6-6/julialang/julia-master/src/gc.c:3149 [inlined]
_jl_gc_collect at /cache/build/builder-demeter6-6/julialang/julia-master/src/gc.c:3538
ijl_gc_collect at /cache/build/builder-demeter6-6/julialang/julia-master/src/gc.c:3899
maybe_collect at /cache/build/builder-demeter6-6/julialang/julia-master/src/gc.c:922 [inlined]
jl_gc_pool_alloc_inner at /cache/build/builder-demeter6-6/julialang/julia-master/src/gc.c:1325
ijl_gc_pool_alloc_instrumented at /cache/build/builder-demeter6-6/julialang/julia-master/src/gc.c:1383
reshape at ./reshapedarray.jl:60 [inlined]
reshape at ./reshapedarray.jl:127 [inlined]
vec at ./abstractarraymath.jl:41 [inlined]
mat_to_vec at /home/runner/work/DifferentiationInterface.jl/DifferentiationInterface.jl/DifferentiationInterfaceTest/src/scenarios/default.jl:359 [inlined]
fwddiffejulia_mat_to_vec_155874wrap at /home/runner/work/DifferentiationInterface.jl/DifferentiationInterface.jl/DifferentiationInterfaceTest/src/scenarios/default.jl:0
macro expansion at /home/runner/.julia/packages/Enzyme/RvNgp/src/compiler.jl:8305 [inlined]
enzyme_call at /home/runner/.julia/packages/Enzyme/RvNgp/src/compiler.jl:7868 [inlined]
ForwardModeThunk at /home/runner/.julia/packages/Enzyme/RvNgp/src/compiler.jl:7657 [inlined]
autodiff at /home/runner/.julia/packages/Enzyme/RvNgp/src/Enzyme.jl:647 [inlined]
autodiff at /home/runner/.julia/packages/Enzyme/RvNgp/src/Enzyme.jl:537 [inlined]
autodiff at /home/runner/.julia/packages/Enzyme/RvNgp/src/Enzyme.jl:504 [inlined]
pushforward at /home/runner/work/DifferentiationInterface.jl/DifferentiationInterface.jl/DifferentiationInterface/ext/DifferentiationInterfaceEnzymeExt/forward_onearg.jl:58 [inlined]
#12 at /home/runner/work/DifferentiationInterface.jl/DifferentiationInterface.jl/DifferentiationInterface/src/first_order/pullback.jl:184 [inlined]
iterate at ./generator.jl:48 [inlined]
_collect at ./array.jl:800
collect_similar at ./array.jl:709 [inlined]
map at ./abstractarray.jl:3371 [inlined]
_pullback_via_pushforward at /home/runner/work/DifferentiationInterface.jl/DifferentiationInterface.jl/DifferentiationInterface/src/first_order/pullback.jl:183 [inlined]

Any idea what could be the cause?

Related:

@gdalle
Copy link
Contributor Author

gdalle commented Nov 7, 2024

Apparently it's non deterministic: I re-ran CI with the same commit and it didn't crash

@wsmoses
Copy link
Member

wsmoses commented Nov 8, 2024

can you post a mwe?

@gdalle
Copy link
Contributor Author

gdalle commented Nov 8, 2024

As stated:

Unfortunately I wasn't able to reproduce it locally. The tests do run on my computer, they fail but at least they don't crash my REPL.

Apparently it's non deterministic: I re-ran CI with the same commit and it didn't crash

Feel free to close the issue, but at least it's there if someone else encounters it

@wsmoses
Copy link
Member

wsmoses commented Nov 8, 2024

Sure but even if you can’t reproduce it, the input code which triggered it is useful (perhaps reducing to the one in question on CI).

@gdalle
Copy link
Contributor Author

gdalle commented Nov 8, 2024

The code that triggered the error in CI is here: https://github.com/JuliaDiff/DifferentiationInterface.jl/blob/gd/en11/DifferentiationInterface/test/Back/Enzyme/test.jl

But I don't know how to narrow it down further. I tried running that whole file on my computer and it worked fine. It also worked on the second CI run (there are test failures but no crashes).

Besides, the stack trace doesn't help: it says

 [2994] signal 6 (-6): Aborted
in expression starting at /home/runner/work/DifferentiationInterface.jl/DifferentiationInterface.jl/DifferentiationInterface/test/Back/Enzyme/test.jl:31

but that line (31) doesn't contain anything error-worthy at all, it's not even autodiff testing:

https://github.com/JuliaDiff/DifferentiationInterface.jl/blob/ba00a78326d41ac3e85ae3eda27e3f5b5ad94949/DifferentiationInterface/test/Back/Enzyme/test.jl#L26-L31

I'm very puzzled by all this

@wsmoses
Copy link
Member

wsmoses commented Nov 8, 2024

yeah I don't trust debug information during GC errors.

And fwiw the last several GC issues were found in julia proper so its very possible it's a bug in julia itself.

In any case, I'd recommend taking your code in CI and removing code until it certainly succeeds. GC errors are often non deterministic so you may need to run a couple of times to see them.

See if you can make a standalone version which fails [probably starting by either removing code on CI until it doesn't fail 10 times in a row], or perhaps seeing if running local tests with Pkg.test() triggers it.

@wsmoses
Copy link
Member

wsmoses commented Nov 16, 2024

any luck here?

@gdalle
Copy link
Contributor Author

gdalle commented Nov 16, 2024

Sorry, I didn't spend any time chasing this because there is no telling how many CI cycles I would need to run to find a nondeterministic bug that disappeared the second time. It's not on top of my priority list but if the bug shows up again organically I'll let you know.

@gdalle
Copy link
Contributor Author

gdalle commented Nov 16, 2024

At the moment I'm not running the Enzyme tests on 1.11 at all because of #2071, so once that is fixed there will be more iterations of CI with that version

@wsmoses wsmoses closed this as not planned Won't fix, can't repro, duplicate, stale Nov 16, 2024
@wsmoses
Copy link
Member

wsmoses commented Nov 16, 2024

okay going to close for now then since there's no code that can trigger this atm. Feel free to reopen when you have an erring code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants