-
Notifications
You must be signed in to change notification settings - Fork 398
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible deadlock caught with strace (double FUTEX_WAIT_PRIVATE) #669
Comments
Correcting the source code branch: https://github.com/linea-it/pz-compute/blob/lephare/rail_scripts/rail-estimate#L114 |
I cannot reproduce this:
|
I tried also after installing
|
If there is more setup required, please give us all the instructions required to reproduce the problem, with all the package versions and additional files. |
Can you also give us a stack trace using gdb of the pace wheree the hang happens? |
Hello, if you have a conda installation available, you may run the
I'll attach the input and calibration files here so that you may execute the program until the end, but it doesn't really matter for triggering the deadlock, which happens before the main estimation process. objectTable_tract_5065_DC2_2_2i_runs_DP0_2_v23_0_1_PREOPS-905_step3_29_20220314T204515Z-part2.hdf5.gz For the gdb stack trace do you mean attaching to the process after it hangs ? |
For the lephare algorithm auxiliary data:
|
Yeah or run it under |
I still cannot reproduce. The function
but this happens after the point you mention so I am not sure what can be wrong here. I'm going to need some reproducer in a Dockerfile or your gdb trace |
I'm sorry, you're right, I've installed lephare from the master branch. I have asked for a new release, but I think it's not there yet. The original issue is here: lephare-photoz/lephare#178 (comment) So, confirming the exact lephare version used: 0.1.11.dev4+gff46ff8 Since your calls to
|
There are 3 threads in your program, can you print the stack for all 3 of them? You can switch from thread to thread using the “thread X” command and then type “bt” for every thread. |
You can actually just |
I've tried reproducing this on a RHEL 7.9 machine, and wasn't able to, either. I get to the same point as @pablogsal reached:
|
@hdante FWIW, the strace log isn't very useful. strace shows only system calls, and the entire idea behind futexes - "fast userspace mutexes" - is that they typically don't need to make any system calls, and can get all the work they need to do done in userspace. Since there's no system call when there's no contention, we can't figure anything out about what locks are held by what threads at any given point in time from the strace output. I think we really do need a |
Hello, everyone, here follows another gdb backtrace, this time with all 3 threads:
|
Hello, Matt, yes, I'm afraid you'll need lephare from the master branch for that to work. |
The given strace log is useful because I fell on the case where the fast userspace path was not taken and it's guaranteed by the 2 futex system calls on separate threads that they are the cause of the deadlock. Posting only the 2 relevant lines:
I've posted the backtrace for the 3 threads, more library calls are visible right now. Meanwhile, I'll try to find the line in the code where the deadlock happens. If the log is not enough, I can make the docker image. |
I don't think so - I'm pretty sure only one of those two threads is involved in causing the deadlock (I think these two futex calls correspond to the main thread and the
This was enough to figure out what's going on, though.
It's almost certainly related to the changes we made in #525 - if that theory is right, I believe the problem will go away if you downgrade memray to 1.11.0 - can you try that for me and confirm if it fixes the issue for you? And, if it doesn't, please grab a |
Ok, so to make sure I understood, the proposed explanation is that there are 2 futexes being simultaneously grabbed in reversed order by 2 threads ? I'll check with the older version and return. Which version did you use when the program worked ? |
The program has been working for Pablo and I with the latest version of Memray. There must be something else that's different about your environment than ours - maybe glibc version, or something? - that's stopping us from reproducing it. I was optimistic that using glibc 2.17 would do the trick, but I wasn't able to reproduce it even with that, so there's still some variable we haven't identified. It may even be a plain old race condition, and you're just consistently managing to nail the right timing and we're not...
Yep, it's a lock ordering deadlock - one thread is grabbing a lock inside glibc and then trying to acquire a lock inside memray, and the other thread is grabbing the lock inside memray and then trying to acquire the lock inside glibc. |
Yes, memray 1.11.0 works, the function finishes, lephare starts estimation, no deadlocks. |
OK, thanks for the confirmation. In that case, you've got a workaround, and we'll figure out how to fix this going forward. Most likely we'll wind up reverting #525 and resurrecting #549 instead - we waffled between those two approaches, but it seems like the latter might be safer, albeit a bit hackier - it forces us to do part of the loader's job... Thanks for your patience in helping us track this down! |
Thank you all for the help ! |
We've released a fix in Memray 1.14 - if you get a chance, @hdante, please make sure that version works for you! |
Ok, thank you, I'll test it later this week. |
Hello, Pablo, memray 1.14 is working fine with lephare without deadlocks, thank you very much! |
Awesome! Thanks for opening the issue! |
Is there an existing issue for this?
Current Behavior
Hello, I'm having a possible deadlock when executing memray and it seems to have appeared on the
strace
output log. The relevant snippet is here:The actual program running is
rail-estimate
in functionload_slow_modules()
, from here:https://github.com/linea-it/pz-compute/blob/main/rail_scripts/rail-estimate#L114
That function recursively loads a large amount of libraries, I can still find out more precisely the point it happened if necessary and if some other library could be causing the deadlock instead.
Expected Behavior
Without running memray, the
load_slow_modules()
function works normally and the python modules are successfully loaded.Steps To Reproduce
Memray Version
1.13.4
Python Version
3.11
Operating System
Linux
Anything else?
Full strace log attached:
trace.gz
The text was updated successfully, but these errors were encountered: