SEGFAULT sometimes occurs when testing libloading code #41

binkoni · 2018-03-27T06:04:06Z

Operating system is Fedora 27.

uname -r
4.15.4-300.fc27.x86_64

When running cargo test multiple times, sometimes test fails raising a segmentation fault.

#[test]
fn libloading() {
    let lib = libloading::Library::new("libdltest.so").unwrap();
    unsafe {
        let test_fn: libloading::Symbol<unsafe extern fn(i32) -> i32> = lib.get(b"test").unwrap();
        assert!(test_fn(100) == 200);
    }
}

test test::libloading ... error: process didn't exit successfully: `/home/User/test_libloading/target/debug/deps/test_libloading-b3311c80df7834fd libloading` (signal: 11, SIGSEGV: invalid memory reference)

But when this code runs in an executable crate, It seems to run without the segfault even when I execute it multiple times.

fn main() {
    let lib = libloading::Library::new("libdltest.so").unwrap();
    unsafe {
        let test_fn: libloading::Symbol<unsafe extern fn(i32) -> i32> = lib.get(b"test").unwrap();
        assert!(test_fn(100) == 200);
    }
}

The text was updated successfully, but these errors were encountered:

nagisa · 2018-03-27T06:45:55Z

Amusing. Could you try getting a backtrace with a debugger?

…

On Tue, Mar 27, 2018, 08:04 Byeonggon Lee ***@***.***> wrote: Operating system is Fedora 27. uname -r 4.15.4-300.fc27.x86_64 When running cargo test multiple times, sometimes test fails raising a segmentation fault. #[test]fn libloading() { let lib = libloading::Library::new("libdltest.so").unwrap(); unsafe { let test_fn: libloading::Symbol<unsafe extern fn(i32) -> i32> = lib.get(b"test").unwrap(); assert!(test_fn(100) == 200); } } test test::libloading ... error: process didn't exit successfully: `/home/User/test_libloading/target/debug/deps/test_libloading-b3311c80df7834fd libloading` (signal: 11, SIGSEGV: invalid memory reference) But when this code runs in an executable crate, It seems to run without the segfault. fn main() { let lib = libloading::Library::new("libdltest.so").unwrap(); unsafe { let test_fn: libloading::Symbol<unsafe extern fn(i32) -> i32> = lib.get(b"test").unwrap(); assert!(test_fn(100) == 200); } } — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#41>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AApc0rbIn5hqrwa-lMgTtZH-gBaXcQGvks5tidZYgaJpZM4S8VS-> .

binkoni · 2018-03-28T00:57:15Z

I tried debugging using gdb but gdb was unable to get symbols from the test executable for unknown reason.
Instead I used valgrind and It looks like that the segfault was caused by some kind of thread safety violation.
Here is the output :

==7268== Memcheck, a memory error detector
==7268== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==7268== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==7268== Command: ./test_libloading-8b92f602ec17e25e
==7268== 
==7268== Thread 2 libloading:
==7268== Jump to the invalid address stated on the next line
==7268==    at 0x6ED5260: ???
==7268==    by 0x524E367: __nptl_deallocate_tsd.part.5 (in /usr/lib64/libpthread-2.26.so)
==7268==    by 0x524F75A: start_thread (in /usr/lib64/libpthread-2.26.so)
==7268==    by 0x579598E: clone (in /usr/lib64/libc-2.26.so)
==7268==  Address 0x6ed5260 is not stack'd, malloc'd or (recently) free'd
==7268== 
==7268== Can't extend stack to 0x402c138 during signal delivery for thread 2:
==7268==   no stack segment
==7268== 
==7268== Process terminating with default action of signal 11 (SIGSEGV)
==7268==  Access not within mapped region at address 0x402C138
==7268==    at 0x6ED5260: ???
==7268==    by 0x524E367: __nptl_deallocate_tsd.part.5 (in /usr/lib64/libpthread-2.26.so)
==7268==    by 0x524F75A: start_thread (in /usr/lib64/libpthread-2.26.so)
==7268==    by 0x579598E: clone (in /usr/lib64/libc-2.26.so)
==7268==  If you believe this happened as a result of a stack
==7268==  overflow in your program's main thread (unlikely but
==7268==  possible), you can try to increase the size of the
==7268==  main thread stack using the --main-stacksize= flag.
==7268==  The main thread stack size used in this run was 8388608.
==7268== Invalid write of size 8
==7268==    at 0x4A2A64A: _vgnU_freeres (vg_preloaded.c:59)
==7268==  Address 0x402cff8 is on thread 2's stack
==7268== 
==7268== 
==7268== Process terminating with default action of signal 11 (SIGSEGV)
==7268==  Access not within mapped region at address 0x402CFF8
==7268==    at 0x4A2A64A: _vgnU_freeres (vg_preloaded.c:59)
==7268==  If you believe this happened as a result of a stack
==7268==  overflow in your program's main thread (unlikely but
==7268==  possible), you can try to increase the size of the
==7268==  main thread stack using the --main-stacksize= flag.
==7268== 
==7268== HEAP SUMMARY:
==7268==     in use at exit: 352 bytes in 3 blocks
==7268==   total heap usage: 23 allocs, 20 frees, 5,984 bytes allocated
==7268== 
==7268== Thread 1:
==7268== 32 bytes in 1 blocks are still reachable in loss record 1 of 3
==7268==    at 0x4C31A1E: calloc (vg_replace_malloc.c:711)
==7268==    by 0x56BA081: __cxa_thread_atexit_impl (in /usr/lib64/libc-2.26.so)
==7268==    by 0x1611C4: register_dtor (fast_thread_local.rs:39)
==7268==    by 0x1611C4: register_dtor<core::cell::RefCell<core::option::Option<std::sys_common::thread_info::ThreadInfo>>> (local.rs:431)
==7268==    by 0x1611C4: get<core::cell::RefCell<core::option::Option<std::sys_common::thread_info::ThreadInfo>>> (local.rs:422)
==7268==    by 0x1611C4: std::sys_common::thread_info::THREAD_INFO::__getit (local.rs:184)
==7268==    by 0x159BE4: try_with<core::cell::RefCell<core::option::Option<std::sys_common::thread_info::ThreadInfo>>,closure,()> (local.rs:374)
==7268==    by 0x159BE4: <std::thread::local::LocalKey<T>>::with (local.rs:288)
==7268==    by 0x160F9B: set (thread_info.rs:46)
==7268==    by 0x160F9B: std::rt::lang_start_internal (rt.rs:51)
==7268==    by 0x11A4A1: std::rt::lang_start (rt.rs:74)
==7268==    by 0x11AC6D: main (lib.rs:0)
==7268== 
==7268== 32 bytes in 1 blocks are still reachable in loss record 2 of 3
==7268==    at 0x4C31A1E: calloc (vg_replace_malloc.c:711)
==7268==    by 0x4E3D7C4: _dlerror_run (in /usr/lib64/libdl-2.26.so)
==7268==    by 0x4E3D140: dlsym (in /usr/lib64/libdl-2.26.so)
==7268==    by 0x16BFAB: _ZN3std3sys4unix4weak5fetch17h0bb2acfe88a8cfddE.llvm.E9BB0996 (weak.rs:78)
==7268==    by 0x15C3D7: get<unsafe extern "C" fn(*const libc::unix::notbsd::linux::other::b64::x86_64::pthread_attr_t) -> usize> (weak.rs:62)
==7268==    by 0x15C3D7: min_stack_size (thread.rs:378)
==7268==    by 0x15C3D7: std::sys::unix::thread::Thread::new (thread.rs:59)
==7268==    by 0x131FD3: std::thread::Builder::spawn (mod.rs:416)
==7268==    by 0x128056: test::run_test::run_test_inner (lib.rs:1424)
==7268==    by 0x12774E: test::run_test (lib.rs:1448)
==7268==    by 0x122936: run_tests<closure> (lib.rs:1108)
==7268==    by 0x122936: test::run_tests_console (lib.rs:929)
==7268==    by 0x11D9D6: test::test_main (lib.rs:267)
==7268==    by 0x11DCA4: test::test_main_static (lib.rs:303)
==7268==    by 0x11AC39: test_libloading::__test::main (lib.rs:1)
==7268== 
==7268== 288 bytes in 1 blocks are possibly lost in loss record 3 of 3
==7268==    at 0x4C31A1E: calloc (vg_replace_malloc.c:711)
==7268==    by 0x4013D66: _dl_allocate_tls (in /usr/lib64/ld-2.26.so)
==7268==    by 0x5250177: pthread_create@@GLIBC_2.2.5 (in /usr/lib64/libpthread-2.26.so)
==7268==    by 0x15C491: std::sys::unix::thread::Thread::new (thread.rs:78)
==7268==    by 0x131FD3: std::thread::Builder::spawn (mod.rs:416)
==7268==    by 0x128056: test::run_test::run_test_inner (lib.rs:1424)
==7268==    by 0x12774E: test::run_test (lib.rs:1448)
==7268==    by 0x122936: run_tests<closure> (lib.rs:1108)
==7268==    by 0x122936: test::run_tests_console (lib.rs:929)
==7268==    by 0x11D9D6: test::test_main (lib.rs:267)
==7268==    by 0x11DCA4: test::test_main_static (lib.rs:303)
==7268==    by 0x11AC39: test_libloading::__test::main (lib.rs:1)
==7268==    by 0x11A4C1: std::rt::lang_start::{{closure}} (rt.rs:74)
==7268== 
==7268== LEAK SUMMARY:
==7268==    definitely lost: 0 bytes in 0 blocks
==7268==    indirectly lost: 0 bytes in 0 blocks
==7268==      possibly lost: 288 bytes in 1 blocks
==7268==    still reachable: 64 bytes in 2 blocks
==7268==         suppressed: 0 bytes in 0 blocks
==7268== 
==7268== For counts of detected and suppressed errors, rerun with: -v
==7268== ERROR SUMMARY: 3 errors from 3 contexts (suppressed: 0 from 0)

nagisa · 2018-03-28T07:12:36Z

Okay, so it seems to be the same (or a very similar) problem as #5, except this time in Linux.

It seems to me that in your case you have a scenario where the libdltest.so in question has some thread local data, which gets allocated when you run the test_fn (is it using println!? That uses thread-local data), but the test thread terminates after the library is unloaded -- and destructor code is gone.

Not sure why this does not happen with executables, but it would be nice if you tried the workaround referenced in this comment.

binkoni · 2018-03-28T10:11:05Z

The workaround you mentioned worked and segfault doesn't seem to occur anymore. But I don't think test_fn uses thread local storage, because the function is very simple.
The source code of libdltest.so

jackcmay · 2018-09-22T01:39:00Z

I'm seeing the same issue in both tests and executables when loading/unloading modules in threads on linux (~16.04.1-Ubuntu SMP Thu Jul 19 09:46:30 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux) and does not reproduce on recent OS-X).

rust-lang/rust#28794 captures the issue well but I think on linux the problems still exists.

The workarounds mentioned so far won't work when the module must be loaded/unloaded many times from a thread. One alternative workaround is to load the module, spawn the thread, then unload the module after the thread has been cleaned up. This workaround is also not always viable and really it should be fixed on linux.

Anyone familiar with movement on this issue from the linux side?

nagisa · 2018-09-22T06:44:13Z

Can either of you produce a minimal reproducer?

jackcmay · 2018-09-23T22:54:05Z

Try this out: https://github.com/jackcmay/dynamic_module_segfault

nagisa · 2018-09-24T18:05:29Z

@jackcmay This is literally the TLS issue, similar one to the issue which OS X fixed by explicitly not unloading the dynamic libraries if they have any thread locals, implicitly applying RTLD_NODELETE to them. This is the first time I see it being reproduced on Linux though.

I’m fairly confident that you can construct a similarly problematic sample when the loader is C/C++ code (i.e. taking libloading out of the equation), as well as without a dynamic library written in Rust (by, for example, using the C++’s thread variables, taking Rust’s libstd out of equation as well). I think you’re best off reporting this against glibc itself.

Just like the OS X counterpart #5, I’m inclined to call this issue not-a-bug in libloading. It is most likely not a bug in rust’s libstd as well.

In your case a viable workaround, if RTLD_NODELETE is not appropriate, could be ensuring that you do not use anything that creates thread-local variables (such as stdio from libstd).

KizzyCode · 2018-12-18T17:32:56Z

Ok, the workaround with RTLD_NODELETE works (the values are from the libc-crate):

// Load and initialize library
#[cfg(target_os = "linux")]
let library: Library = {
    // Load library with `RTLD_NOW | RTLD_NODELETE` to fix a SIGSEGV
    ::libloading::os::unix::Library::open(Some(path.as_ref()), 0x2 | 0x1000)?.into()
};
#[cfg(not(target_os = "linux"))]
let library = Library::new(path.as_ref())?;

Aaron1011 · 2021-02-20T19:49:48Z

The glibc documentation mentions this kind of issue, and how the implementation (supposedly) avoids it: https://sourceware.org/glibc/wiki/Destructor%20support%20for%20thread_local%20variables

I suspect that this may be a problem with the particular shared library being unloaded. In particular, the glibc implementation seems to rely on the function __cxa_thread_atexit_impl being called by the programming language runtime (Rust calls it here)

Aaron1011 · 2021-02-20T20:58:18Z

I've created a standalone C reproducer, using pthread_key_create: https://github.com/Aaron1011/pthread_dlopen

It does the following:

Spawn a new thread from main(), and block on it using pthread_join
From the new thread, load a simple shared library, and call a function in it.
In the shared library, call pthread_key_create with a destrutor function, and call pthread_setspecific with a non-NULL value to force the destrutor to actually run on thread exit.
Back in the main program (on the thread), call dlclose on the shared library.
Return from the thread function

This causes the following segfault:

[Current thread is 1 (Thread 0x7facc3ff4640 (LWP 701108))]
gef➤  bt
#0  0x00007facc4229129 in ?? ()
#1  0x00007facc41cd411 in __nptl_deallocate_tsd.part.0 () from /usr/lib/libpthread.so.0
#2  0x00007facc41ce2ba in start_thread () from /usr/lib/libpthread.so.0
#3  0x00007facc40f7053 in clone () from /usr/lib/libc.so.6
gef➤  q

It appears that calling dlclose on a shared library does not unregister any pthreads destructors (and presumably, any destructors registered using the same underlying mechanism that pthreads uses).

Aaron1011 · 2021-02-20T22:34:31Z

There's already a glibc bug report for this: https://sourceware.org/bugzilla/show_bug.cgi?id=21032

Unfortunately, it's been open since 2017 without any feedback.

Aaron1011 · 2021-02-21T20:03:24Z

While this not a bug in libloading, I think it's a major footgun for anyone using this library. As https://github.com/Aaron1011/pthread_dlopen demonstrates, it's not possible to determine ahead of time if a shared library has registered TLS destructors (if it's even possible at all), since they may be registered long after the library is loaded. If the usage of libloading is encapsulated by a safe API, this can easily lead to confusing segfaults without any obvious cause.

In light of this, I think it might be a good idea of libloading to change its default behavior to not unload shared libraries on Linux when a Library is dropped. While this is obviously a hack, the only alternative is for every single consumer of the crate to add in its own special handling for Linux. Even if glibc released a fix tomorrow, older versions of glibc will still be in use for years. Since glibc is almost always dynamically linked, individual crate authors are generally stuck with whatever glibc is in use on the machine the crate ends up running on.

Concretely, I think that Library's Drop implementation should not unload libraries on Linux, unless the user explicitly requests it (e.g. via a new method library.unload_on_drop()). While this would have side effects (increased memory usage, and skipping of __attribute__((destructor)) functions), I think preventing crashes by default is more important.

@nagisa What are your thoughts?

nagisa · 2021-02-21T20:14:50Z

I… don't know. Not unloading the libraries will break a fairly common use-case of hot reloading, as glibc's loader will not attempt to (re-)load a library that shares the same name with an already loaded DSO, even if the file contents have changed.

With a recent update of libloading the Library::new and the OS-specific counterparts already became unsafe for fairly similar reasons, so I'm kind of inclined to just consider library loading (and unloading) extremely unsafe and try document all of the cases people need to consider when they load libraries. And leave it up to them to figure out what makes most sense for their situations.

Aaron1011 · 2021-02-21T20:33:25Z

Not unloading the libraries will break a fairly common use-case of hot reloading

I may be wrong, but I would think that hot reloading would be less common than other uses cases (e.g. opening a typical shared library and calling a function). If that's true, then I think it would make more sense for the (less-typical) hot-reloading code to opt into automatically unloading, instead of (more-typical) code needing to opt out of automatically unloading.

Looking at the issues that are linked here, it looks like severe; different projects have run into this. I think the fact that the issue occurs in the Drop impl makes this particularly subtle - while unsafe code authors should be aware of the need to audit surrounding safe code, I think it's particularly easy to forget to consider what happens when a (fully initialized) struct is dropped, sometimes behind multiple layers of indirection.

nagisa · 2021-02-21T21:59:14Z

I know that macOS does employ this technique – they don't unload the machine code for libraries that contain thread-local variables, but they are both in a position to selectively apply this, and they are able to make sure this approach of theirs does not end up having unfortunate side effects (unsuccessfully).

We're not in that kind of position I fear. If we end up doing this, I would be inclined to adjust the library so that we don't implicitly implement Drop at all – we do have close after all. This would help somewhat with issues like #46, which is effectively the same issue in a different context, too.

_{… says @nagisa while looking at "Safer bindings around system dynamic library loading primitives" and wondering if they should remove the first word out of that sentence...}

1236: Update gfx to 0a201d1c406b5119ec11068293a40e50ec0be4c8 r=kvark a=Aaron1011 **Connections** wgpu issue: #246 GFX PR: gfx-rs/gfx#3653 Underlying libloading issue: nagisa/rust_libloading#41 **Description** Pulls in gfx-rs/gfx#3653, which fixes a segfault when using wgpu from a non-main thread. **Testing** The example in #246 should run successfully. I'm not certain how to add an integration test to the repository. Co-authored-by: Aaron Hill <[email protected]>

nagisa mentioned this issue Jul 8, 2018

Unloading Rust-made SOs can lead to segfaults in host program rust-lang/rust#52138

Closed

jackcmay mentioned this issue Sep 23, 2018

Thread safety issue when loading dynamic modules on Linux solana-labs/solana#1314

Closed

kurtlawrence mentioned this issue Mar 14, 2019

Linux: out# results hidden after call to println!() kurtlawrence/papyrus#16

Closed

kurtlawrence mentioned this issue Dec 18, 2019

Keep loaded libraries in memory kurtlawrence/papyrus#44

Closed

3 tasks

nagisa added bug duplicate not-our-bug labels Aug 22, 2020

ChrisRx mentioned this issue Feb 18, 2021

Running tests on linux intermittently causes a SIGSEGV wasmCloud/wasmCloud#98

Closed

Aaron1011 mentioned this issue Feb 20, 2021

Running wgpu in a secondary thread segfaults when main thread ends. gfx-rs/wgpu#246

Closed

timothee-haudebourg mentioned this issue Feb 24, 2021

Use RTLD_NODELETE for loading the library timothee-haudebourg/khronos-egl#14

Closed

Aaron1011 mentioned this issue Feb 25, 2021

Update gfx to 0a201d1c406b5119ec11068293a40e50ec0be4c8 gfx-rs/wgpu#1236

Merged

ionut-arm mentioned this issue Dec 7, 2021

pkcs11.open_session_no_callback against Luna Network HSM crashed with SIGSEGV parallaxsecond/rust-cryptoki#72

Closed

follower mentioned this issue Dec 16, 2021

Segmentation fault when thread using dynamically loaded Rust library exits rust-lang/rust#91979

Open

aseyboldt mentioned this issue Apr 27, 2023

Deadlock on Windows when unloading libraries in several threads roualdes/bridgestan#111

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SEGFAULT sometimes occurs when testing libloading code #41

SEGFAULT sometimes occurs when testing libloading code #41

binkoni commented Mar 27, 2018 •

edited

Loading

nagisa commented Mar 27, 2018 via email

binkoni commented Mar 28, 2018

nagisa commented Mar 28, 2018

binkoni commented Mar 28, 2018

jackcmay commented Sep 22, 2018

nagisa commented Sep 22, 2018

jackcmay commented Sep 23, 2018

nagisa commented Sep 24, 2018 •

edited

Loading

KizzyCode commented Dec 18, 2018

Aaron1011 commented Feb 20, 2021

Aaron1011 commented Feb 20, 2021

Aaron1011 commented Feb 20, 2021

Aaron1011 commented Feb 21, 2021

nagisa commented Feb 21, 2021

Aaron1011 commented Feb 21, 2021 •

edited

Loading

nagisa commented Feb 21, 2021 •

edited

Loading

SEGFAULT sometimes occurs when testing libloading code #41

SEGFAULT sometimes occurs when testing libloading code #41

Comments

binkoni commented Mar 27, 2018 • edited Loading

nagisa commented Mar 27, 2018 via email

binkoni commented Mar 28, 2018

nagisa commented Mar 28, 2018

binkoni commented Mar 28, 2018

jackcmay commented Sep 22, 2018

nagisa commented Sep 22, 2018

jackcmay commented Sep 23, 2018

nagisa commented Sep 24, 2018 • edited Loading

KizzyCode commented Dec 18, 2018

Aaron1011 commented Feb 20, 2021

Aaron1011 commented Feb 20, 2021

Aaron1011 commented Feb 20, 2021

Aaron1011 commented Feb 21, 2021

nagisa commented Feb 21, 2021

Aaron1011 commented Feb 21, 2021 • edited Loading

nagisa commented Feb 21, 2021 • edited Loading

binkoni commented Mar 27, 2018 •

edited

Loading

nagisa commented Sep 24, 2018 •

edited

Loading

Aaron1011 commented Feb 21, 2021 •

edited

Loading

nagisa commented Feb 21, 2021 •

edited

Loading