Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure VM is running in signal handler #200

Merged
merged 1 commit into from
Mar 20, 2023

Conversation

ianks
Copy link
Contributor

@ianks ianks commented Mar 16, 2023

As of a few weeks ago, there's been an increase of sigabrt in prod for us. After hunting the issue down a bit with @peterzhu2118, we honed in on a case where stackprof is signaled and ruby_current_vm_ptr is null. This seems to be happening after SIGQUIT'ing a unicorn worker during rb_during_gc.

* thread #1, name = 'ruby', stop reason = signal SIGABRT
  * frame #0: 0x00007f10787f100b libc.so.6`raise + 203
    frame #1: 0x00007f10787d0859 libc.so.6`abort + 299
    frame #2: 0x0000562e2c372d4f ruby`rb_bug_for_fatal_signal.cold at error.c:776:5
    frame #3: 0x0000562e2c4d56cd ruby`sigsegv(sig=<unavailable>, info=<unavailable>, ctx=<unavailable>) at signal.c:964:5
    frame #4: 0x00007f10789cf420 libpthread.so.0`__restore_rt
    frame #5: 0x0000562e2c396b0b ruby`rb_during_gc at gc.c:11006:12
    frame #6: 0x00007f1070f45391 stackprof.so`stackprof_signal_handler at stackprof.c:745:34
    frame #7: 0x00007f1070f45382 stackprof.so`stackprof_signal_handler(sig=<unavailable>, sinfo=<unavailable>, ucontext=<unavailable>) at stackprof.c:720
    frame #8: 0x00007f10789cf420 libpthread.so.0`__restore_rt
    frame #9: 0x00007f1066be70f4 libselinux.so.1`___lldb_unnamed_symbol8$$libselinux.so.1 + 36
    frame #10: 0x00007f10790daf6b ld-linux-x86-64.so.2`___lldb_unnamed_symbol59$$ld-linux-x86-64.so.2 + 523
    frame #11: 0x00007f10787f48a7 libc.so.6`___lldb_unnamed_symbol191$$libc.so.6 + 247
    frame #12: 0x00007f10787f4a60 libc.so.6`exit + 32
    frame #13: 0x00007f10787d208a libc.so.6`__libc_start_main + 250
    frame #14: 0x0000562e2c373fde ruby`_start + 46

Since signal safety is hard to reason about, let's add a sanity check and not assume the Ruby VM is active just because the handler is called.

@ianks ianks force-pushed the ruby-signal-safety branch from 8ebeff1 to b9e762e Compare March 16, 2023 05:15
@ianks
Copy link
Contributor Author

ianks commented Mar 16, 2023

Since ruby_current_vm_ptr is not a public symbol, I had to declare it with as an extern. I think the symbol name has changed in various ruby versions, so it would be best to use stable API to do this check.

Is there any other public API that I can use instead to ensure the VM is running? Otherwise, I’ll plunge forward with detecting the various names of this symbol across Ruby versions.

@ivoanjo
Copy link
Contributor

ivoanjo commented Mar 16, 2023

Perhaps adding an at_exit { StackProf.stop } would be a possible solution? I've ran into this in #157 as well.

@peterzhu2118
Copy link
Contributor

On Ruby 2.4 and before, it should be called ruby_current_vm rather than ruby_current_vm_ptr. But at any rate, we shouldn't rely on ruby_current_vm_ptr being available for extensions because it isn't marked as public (it isn't wrapped within a RUBY_SYMBOL_EXPORT_BEGIN/RUBY_SYMBOL_EXPORT_END region). AFAIK there isn't a public API to determine if the Ruby VM is available.

@ianks
Copy link
Contributor Author

ianks commented Mar 16, 2023

Perhaps adding an at_exit { StackProf.stop } would be a possible solution? I've ran into this in #157 as well.

That may help, and is probably a good change in its own right. I would still like to add this guard though since it improves the async signal safety of the handler so it's tricky to reason about.

@tenderlove
Copy link
Collaborator

Perhaps adding an at_exit { StackProf.stop } would be a possible solution? I've ran into this in #157 as well.

I think stackprof should probably do this by default. Either that or its callees should check for existence of the VM.

We could check for the VM in stackprof, but that seems kind of fraught as the global is private which means it can be renamed at any time and the compiler is free to do whatever it wants with the symbol name.

@ianks
Copy link
Contributor Author

ianks commented Mar 16, 2023

We could check for the VM in stackprof, but that seems kind of fraught as the global is private which means it can be renamed at any time and the compiler is free to do whatever it wants with the symbol name.

Yeah I have similar feelings, and very open to other ways of solving this. Maybe a null check in rb_during_gc would be better?

@peterzhu2118
Copy link
Contributor

Maybe a null check in rb_during_gc would be better?

No I don't think that will solve the underlying problem since the code in stackprof that is crashing will then instead crash a few lines down in places like rb_postponed_job_register_one if the VM doesn't exist.

if (!_stackprof.ignore_gc && rb_during_gc()) {
VALUE mode = rb_gc_latest_gc_info(sym_state);
if (mode == sym_marking) {
_stackprof.unrecorded_gc_marking_samples++;
} else if (mode == sym_sweeping) {
_stackprof.unrecorded_gc_sweeping_samples++;
}
_stackprof.unrecorded_gc_samples++;
rb_postponed_job_register_one(0, stackprof_job_record_gc, (void*)0);
} else {
if (stackprof_use_postponed_job) {
rb_postponed_job_register_one(0, stackprof_job_sample_and_record, (void*)0);
} else {
// Buffer a sample immediately, if an existing sample exists this will
// return immediately
stackprof_buffer_sample();
// Enqueue a job to record the sample
rb_postponed_job_register_one(0, stackprof_job_record_buffer, (void*)0);
}
}

@ianks
Copy link
Contributor Author

ianks commented Mar 17, 2023

Maybe a null check in rb_during_gc would be better?

No I don't think that will solve the underlying problem since the code in stackprof that is crashing will then instead crash a few lines down in places like rb_postponed_job_register_one if the VM doesn't exist.

if (!_stackprof.ignore_gc && rb_during_gc()) {
VALUE mode = rb_gc_latest_gc_info(sym_state);
if (mode == sym_marking) {
_stackprof.unrecorded_gc_marking_samples++;
} else if (mode == sym_sweeping) {
_stackprof.unrecorded_gc_sweeping_samples++;
}
_stackprof.unrecorded_gc_samples++;
rb_postponed_job_register_one(0, stackprof_job_record_gc, (void*)0);
} else {
if (stackprof_use_postponed_job) {
rb_postponed_job_register_one(0, stackprof_job_sample_and_record, (void*)0);
} else {
// Buffer a sample immediately, if an existing sample exists this will
// return immediately
stackprof_buffer_sample();
// Enqueue a job to record the sample
rb_postponed_job_register_one(0, stackprof_job_record_buffer, (void*)0);
}
}

I think you are right.

A good stop gap may be to register a new at_exit proc with our own variable (bool ruby_vm_exited) that we can test against, rather than using the VM*

@ianks ianks force-pushed the ruby-signal-safety branch from b9e762e to 455d76a Compare March 20, 2023 15:59
@ianks ianks changed the title Guard ruby_current_vm_ptr is non-null in signal handler Ensure VM is running in signal handler Mar 20, 2023
@ianks
Copy link
Contributor Author

ianks commented Mar 20, 2023

@tenderlove @peterzhu2118 Thoughts on my updates?

@tenderlove
Copy link
Collaborator

@ianks lgtm! Thanks for the patch!

@ianks
Copy link
Contributor Author

ianks commented Mar 20, 2023

FWIW, I think the failing test may be flaky? I noticed it failing on master as well locally (on 3.2).

@tenderlove
Copy link
Collaborator

FWIW, I think the failing test may be flaky? I noticed it failing on master as well locally (on 3.2).

Ya, it's flaky. I'll merge this and ship it. I think we're trying to measure some GC stuff in the tests and it's not predictable.

@tenderlove tenderlove merged commit eb1db9f into tmm1:master Mar 20, 2023
@ianks ianks deleted the ruby-signal-safety branch March 21, 2023 14:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants