Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ocaml5-issue] Abort / crash on thread_joingraph and thread_createtree under debug runtime #353

Closed
jmid opened this issue Jun 1, 2023 · 4 comments
Labels
ocaml5-issue A potential issue in the OCaml5 compiler/runtime

Comments

@jmid
Copy link
Collaborator

jmid commented Jun 1, 2023

Yesterday, I saw an abort on src/thread/thread_joingraph using the debug runtime build on trunk:
https://github.com/ocaml-multicore/multicoretests/actions/runs/5135424605/jobs/9240823033

random seed: 24146066
generated error fail pass / total     time test name

[ ]    0    0    0    0 /  100     0.0s Thread.create/join - tak work
[00] file runtime/domain.c; line 1526 ### Assertion failed: (uintnat)dom_st->young_ptr > (uintnat)dom_st->young_trigger
File "src/thread/dune", line 6, characters 7-23:
6 |  (name thread_joingraph)
           ^^^^^^^^^^^^^^^^
(cd _build/default/src/thread && ./thread_joingraph.exe --verbose)
Command got signal ABRT.
[ ]    0    0    0    0 /  100     0.0s Thread.create/join - tak work (generating)

After having built a fresh local trunk switch, I was just able to recreate locally in a loop:

$ while OCAMLRUNPARAM="v=1,V=1" _build/default/src/thread/thread_joingraph.exe --verbose -s 24146066; do :; done

[... 12 iterations go by without error]

### OCaml runtime: debug mode ###
random seed: 24146066
generated error fail pass / total     time test name
[ ]   79    0    0   79 /  100    66.9s Thread.create/join - tak work[00] file runtime/domain.c; line 1526 ### Assertion failed: (uintnat)dom_st->young_ptr > (uintnat)dom_st->young_trigger
Aborted (core dumped)
@jmid jmid added the ocaml5-issue A potential issue in the OCaml5 compiler/runtime label Jun 1, 2023
@jmid
Copy link
Collaborator Author

jmid commented Jun 12, 2023

Spotted this again on Linux debug runtime trunk 5.2
https://github.com/ocaml-multicore/multicoretests/actions/runs/5224077321/jobs/9431819887

random seed: 381509558
generated error fail pass / total     time test name

[ ]    0    0    0    0 /  100     0.0s Thread.create/join - tak work
[00] file runtime/domain.c; line 1526 ### Assertion failed: (uintnat)dom_st->young_ptr > (uintnat)dom_st->young_trigger
File "src/thread/dune", line 6, characters 7-23:
6 |  (name thread_joingraph)
           ^^^^^^^^^^^^^^^^
(cd _build/default/src/thread && ./thread_joingraph.exe --verbose)
Command got signal ABRT.
[ ]    0    0    0    0 /  100     0.0s Thread.create/join - tak work (generating)

@jmid
Copy link
Collaborator Author

jmid commented Nov 1, 2023

On the branch https://github.com/ocaml-multicore/multicoretests/tree/unify-thread-domain where I'm playing with a reusable Work module, I can reproduce this pretty reliably:

$ dune build src/thread/thread_joingraph.exe --profile=debug-runtime
$ OCAMLRUNPARAM="s=1024,v=1,V=1" _build/default/src/thread/thread_joingraph.exe -v -s 107463665
### OCaml runtime: debug mode ###
random seed: 107463665
generated error fail pass / total     time test name
[ ]   22    0    0   22 /  200     6.8s Thread.create/join[00] file runtime/domain.c; line 1526 ### Assertion failed: (uintnat)dom_st->young_ptr > (uintnat)dom_st->young_trigger
Aborted (core dumped)

gdb reports the following backtrace:

[00] file runtime/domain.c; line 1526 ### Assertion failed: (uintnat)dom_st->young_ptr > (uintnat)dom_st->young_trigger
[Thread 0x7fffde1bc640 (LWP 2022924) exited]
[Thread 0x7fffdd1ba640 (LWP 2022926) exited]
[Thread 0x7ffff61ec640 (LWP 2022879) exited]
[Thread 0x7fffdc9b9640 (LWP 2022927) exited]

Thread 353 "thread_joingrap" received signal SIGABRT, Aborted.
[Switching to Thread 0x7ffff59eb640 (LWP 2022880)]
__pthread_kill_implementation (no_tid=0, signo=6, threadid=140737314207296) at ./nptl/pthread_kill.c:44
44	./nptl/pthread_kill.c: No such file or directory.
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=140737314207296) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=140737314207296) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=140737314207296, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007ffff7c42476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007ffff7c287f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x00005555556e6770 in caml_failed_assert (expr=expr@entry=0x5555556f9310 "(uintnat)dom_st->young_ptr > (uintnat)dom_st->young_trigger", 
    file_os=file_os@entry=0x5555556f8e59 "runtime/domain.c", line=line@entry=1526) at runtime/misc.c:56
#6  0x00005555556cd1d8 in caml_reset_young_limit (dom_st=<optimized out>) at runtime/domain.c:1526
#7  0x00005555556ce43b in caml_poll_gc_work () at runtime/domain.c:1647
#8  0x00005555556eb991 in caml_do_pending_actions_exn () at runtime/signals.c:308
#9  0x00005555556eba91 in caml_process_pending_actions_with_root_exn (root=<optimized out>) at runtime/signals.c:342
#10 0x00005555556ebc52 in caml_process_pending_actions_with_root (root=1) at runtime/signals.c:351
#11 caml_process_pending_actions () at runtime/signals.c:362
#12 <signal handler called>
#13 0x0000555555610f76 in camlWork.tak_1052 () at src/work.ml:41
#14 0x0000555555610f5e in camlWork.tak_1052 () at src/work.ml:41
#15 0x0000555555610f5e in camlWork.tak_1052 () at src/work.ml:41
#16 0x0000555555610f26 in camlWork.tak_1052 () at src/work.ml:41
#17 0x0000555555610f5e in camlWork.tak_1052 () at src/work.ml:41
#18 0x0000555555610f42 in camlWork.tak_1052 () at src/work.ml:41
#19 0x0000555555610f26 in camlWork.tak_1052 () at src/work.ml:41
#20 0x0000555555610f5e in camlWork.tak_1052 () at src/work.ml:41
#21 0x0000555555610f42 in camlWork.tak_1052 () at src/work.ml:41
#22 0x00005555556110b4 in camlWork.run_1125 () at src/work.ml:54
#23 0x000055555564eb07 in camlThread.fun_843 () at thread.ml:48
#24 <signal handler called>
#25 0x00005555556cabc8 in caml_callback_exn (closure=<optimized out>, closure@entry=140737347848384, arg=<optimized out>, arg@entry=1)
    at runtime/callback.c:197
#26 0x00005555556bae19 in caml_thread_start (v=<optimized out>) at st_stubs.c:552
#27 0x00007ffff7c94ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#28 0x00007ffff7d26a40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

@jmid
Copy link
Collaborator Author

jmid commented Nov 9, 2023

On a Linux 5.2/trunk debug run of #409 we triggered a related error - but this time in thread_createtree:
https://github.com/ocaml-multicore/multicoretests/actions/runs/6800766105/job/18490073637

random seed: 242456814
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s thread_createtree - with Atomic
[00] file runtime/domain.c; line 1621 ### Assertion failed: (uintnat)dom_st->young_ptr > (uintnat)dom_st->young_trigger
File "src/thread/dune", line 14, characters 7-24:
14 |  (name thread_createtree)
            ^^^^^^^^^^^^^^^^^
(cd _build/default/src/thread && ./thread_createtree.exe --verbose)
Command got signal ABRT.
[ ]    0    0    0    0 / 1000     0.0s thread_createtree - with Atomic (generating)

@jmid jmid changed the title [ocaml5-issue] Abort / crash on thread_joingraph under debug runtime [ocaml5-issue] Abort / crash on thread_joingraph and thread_createtree under debug runtime Nov 9, 2023
@jmid
Copy link
Collaborator Author

jmid commented Nov 16, 2023

This is fixed in ocaml/ocaml/pull/12742 for trunk but will still show up in 5.1 tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ocaml5-issue A potential issue in the OCaml5 compiler/runtime
Projects
None yet
Development

No branches or pull requests

1 participant