Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

testcase do not exit when run ‘malloc1_threads’ #33

Open
zhu1fei opened this issue Sep 2, 2021 · 2 comments
Open

testcase do not exit when run ‘malloc1_threads’ #33

zhu1fei opened this issue Sep 2, 2021 · 2 comments

Comments

@zhu1fei
Copy link

zhu1fei commented Sep 2, 2021

Hi

'malloc1_threads' is not exit when run end.
stopped and not exit after put "average:12343"

will-it-scale# ./malloc1_threads -t 2 -s 1
testcase:malloc/free of 128MB
warmup
min:4940 max:8187 total:13127
min:4866 max:7876 total:12742
min:5815 max:5989 total:11804
min:5658 max:6697 total:12355
min:5861 max:5868 total:11729
min:5803 max:6031 total:11834
measurement
min:5386 max:6957 total:12343
average:12343

will-it-scale# ldd ./malloc1_threads
linux-vdso.so.1 (0x00007ffd97a8a000)
libhwloc.so.0 => /usr/lib/x86_64-linux-gnu/libhwloc.so.0 (0x00007fdf07c69000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fdf07c48000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fdf07a87000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fdf07904000)
libudev.so.1 => /lib/x86_64-linux-gnu/libudev.so.1 (0x00007fdf078de000)
/lib64/ld-linux-x86-64.so.2 (0x00007fdf07d01000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fdf078d4000)

BTW: make by gcc-9

@ryanhrob
Copy link

ryanhrob commented Apr 6, 2023

Hi,

I've just hit this same issue running on an Ampere Altra (arm64) system. It is pretty easy to reproduce. Looking at the backtraces, I think we have a deadlock caused by asynchronous cancellation of the threads.

If the thread being cancelled is inside malloc() at the time of cancellation, it dies while holding the lock. Then during thread shutdown, free gets called and deadlocks waiting for the lock that will never be freed. Then the parent thread just waits in pthread_join() forever.

I think you just not bother cleaning up the threads in the threaded case, and just exit the process. That might be a problem for testcase_cleanup() though if the the test case is potentially still running concurrently?

Thanks,
Ryan

Thread back traces:

(gdb) thread apply all bt

Thread 3 (Thread 0xfffff4e6f100 (LWP 7068) "malloc1_threads" (Exiting)):
#0 futex_wait (private=0, expected=2, futex_word=0xffffe8000030) at ../sysdeps/nptl/futex-internal.h:146
#1 __GI___lll_lock_wait_private (futex=futex@entry=0xffffe8000030) at ./nptl/lowlevellock.c:34
#2 0x0000fffff7e1b384 in _int_free (av=0xffffe8000030, p=0xffffe80008d0, have_lock=0) at ./malloc/malloc.c:4576
#3 0x0000fffff7e1dc84 in __GI___libc_free (mem=mem@entry=0xffffe80008e0) at ./malloc/malloc.c:3391
#4 0x0000fffff7e1dd64 in tcache_thread_shutdown () at ./malloc/malloc.c:3231
#5 __malloc_arena_thread_freeres () at ./malloc/arena.c:1003
#6 0x0000fffff7e1fef8 in __libc_thread_freeres () at ./malloc/thread-freeres.c:44
#7 0x0000fffff7e0d498 in start_thread (arg=0x0) at ./nptl/pthread_create.c:456
#8 0x0000fffff7e75d1c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:79

Thread 2 (Thread 0xfffff567f100 (LWP 7067) "malloc1_threads" (Exiting)):
#0 futex_wait (private=0, expected=2, futex_word=0xfffff0000030) at ../sysdeps/nptl/futex-internal.h:146
#1 __GI___lll_lock_wait_private (futex=futex@entry=0xfffff0000030) at ./nptl/lowlevellock.c:34
#2 0x0000fffff7e1b384 in _int_free (av=0xfffff0000030, p=0xfffff00008d0, have_lock=0) at ./malloc/malloc.c:4576
#3 0x0000fffff7e1dc84 in __GI___libc_free (mem=mem@entry=0xfffff00008e0) at ./malloc/malloc.c:3391
#4 0x0000fffff7e1dd64 in tcache_thread_shutdown () at ./malloc/malloc.c:3231
#5 __malloc_arena_thread_freeres () at ./malloc/arena.c:1003
#6 0x0000fffff7e1fef8 in __libc_thread_freeres () at ./malloc/thread-freeres.c:44
#7 0x0000fffff7e0d498 in start_thread (arg=0x0) at ./nptl/pthread_create.c:456
#8 0x0000fffff7e75d1c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:79

Thread 1 (Thread 0xfffff7ff5440 (LWP 7066) "malloc1_threads"):
#0 __futex_abstimed_wait_common64 (private=128, cancel=true, abstime=0x0, op=265, expected=7067, futex_word=0xfffff567f1d0) at ./nptl/futex-internal.c:57
#1 __futex_abstimed_wait_common (cancel=true, private=128, abstime=0x0, clockid=0, expected=7067, futex_word=0xfffff567f1d0) at ./nptl/futex-internal.c:87
#2 __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0xfffff567f1d0, expected=7067, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=128) at ./nptl/futex-internal.c:139
#3 0x0000fffff7e0ef2c in __pthread_clockjoin_ex (threadid=281474798973184, thread_return=thread_return@entry=0x0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, block=block@entry=true) at ./nptl/pthread_join_common.c:105
#4 0x0000fffff7e0edb0 in ___pthread_join (threadid=, thread_return=thread_return@entry=0x0) at ./nptl/pthread_join.c:24
#5 0x0000aaaaaaaa1a7c in kill_tasks () at main.c:151
#6 main (argc=, argv=) at main.c:427

@ryanhrob
Copy link

ryanhrob commented Apr 6, 2023

It's a big old hack, but I've solved the problem by cancelling but not joining the child threads. This means that the child threads should not be running by the time we call testcase_cleanup() in the parent thread. And if the child thread gets deadlocked during cancellation, it will all be cleaned up when the process exits.

For good measure, I've also removed this code from the main thread (we don't want it to deadlock on the same heap lock):

    for (i = 0; i < opt_tasks; i++) {
            hwloc_bitmap_free(args[i].cpuset);
            hwloc_topology_destroy(args[i].topology);
    }
    free(args);

There are a few cases where testcase_cleanup() calls free(), but that only happens for test cases that make direct syscalls, so there should be no risk of deadlock here.

Like I said, massive hack. But it unblocks me and shouldn't impact the results.

heatd added a commit to heatd/will-it-scale that referenced this issue Jun 9, 2023
See issue (antonblanchard#33).

Since async pthread_cancel on threads that do hold locks is wrong, we
instead create a separate process that holds the threads. On exit, the
holding process just exits, and the parent wait()'s for it, and cleans
up the testcase.

Signed-off-by: Pedro Falcato <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants