Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

signal handling: User-defined interrupt handlers #49541

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jpsamaroo
Copy link
Member

@jpsamaroo jpsamaroo commented Apr 27, 2023

Interrupt handling is a tricky problem, not just in terms of implementation, but in terms of desired behavior: when an interrupt is received, which code should handle it? Julia's current answer to this is effectively to throw an InterruptException to the first task to hit a safepoint. While this seems sensible (the code that's running gets interrupted), it only really works for very basic numerical code. In the case that multiple tasks are running concurrently, or when try-catch handlers are registered, this system breaks down, and results in unpredictable behavior. This unpredictable behavior includes:

  • Interrupting background/runtime tasks which don't want to be interrupted, as they do little bits of important work (and are critical to library runtime functionality)
  • Interrupting only one task, when multiple coordinating tasks would want to receive the interrupt to safely terminate a computation
  • Interrupting only one library's task, when multiple libraries really would want to be notified about the interrupt

The above behavior makes it nearly impossible to provide reliable Ctrl-C behavior, and results in very confused users who get stuck hitting Ctrl-C continuously, sometimes getting caught in a hang, sometimes triggering unrelated exception handling code they didn't mean to, sometimes getting a segfault, and very rarely getting the behavior they desire (with unpredictable safety of being able to continue using the active session as intended).

This commit provides an alternative behavior for interrupts which is more predictable: user code may now register tasks as "interrupt handlers", which will be guaranteed to receive an InterruptException whenever the session receives an interrupt signal. Additionally, when any interrupt handlers are registered, no other tasks will receive InterruptExceptions; only the handlers may receive them.

This behavior allows one or more libraries to register handler tasks which will all be concurrently awoken to handle each interrupt and do whatever is necessary to safely interrupt any running code; the extent to which other tasks are interrupted is library-defined. For example, GPU libraries like AMDGPU.jl can register a handler to safely interrupt GPU kernels running on all GPU queues and do resource cleanup. Concurrently, a complex runtime like the scheduler in Dagger.jl can register a handler to interrupt running tasks on other workers when possible.

Todo:

  • Consider whether to allow force mode to still operate with handlers registered Not doing this for now
  • Add @interrupthandler API (suggested by @Seelengrab ) Punting as we don't have a way to register the handler during __init__
  • Add a "kill foreground task" menu option
  • Interrupt the foreground or root task on multi-Ctrl-C in non-REPL mode
  • Address missing logic in jl_task_get_next
  • Add tests
  • Add docs

Fixes #34184
Fixes #19222

@jpsamaroo jpsamaroo added multithreading Base.Threads and related functionality needs tests Unit tests are required for this change needs docs Documentation for this change is required labels Apr 27, 2023
@Seelengrab
Copy link
Contributor

For @interrupthandler - the idea was to have a macro register a function as an interrupt handler directly, on a per-module level. So something like

module Foo

@interrupthandler [SIGINT] function bar()
     # handle SIGINT
end

end

would register Task(bar) as a handler for InterruptException for module Foo. The API leaves room for expansion to other signals, should we want that in the future.

@jpsamaroo jpsamaroo force-pushed the jps/interrupt-handlers branch from cb46a12 to ef6a037 Compare April 27, 2023 16:03
src/threading.h Outdated Show resolved Hide resolved
src/signals-unix.c Outdated Show resolved Hide resolved
base/task.jl Outdated Show resolved Hide resolved
@jpsamaroo jpsamaroo force-pushed the jps/interrupt-handlers branch from 8a9b460 to c969a35 Compare May 5, 2023 18:39
@jpsamaroo jpsamaroo marked this pull request as ready for review May 5, 2023 18:40
@jpsamaroo jpsamaroo removed needs tests Unit tests are required for this change needs docs Documentation for this change is required labels May 5, 2023
src/jl_exported_data.inc Outdated Show resolved Hide resolved
src/jl_exported_funcs.inc Outdated Show resolved Hide resolved
src/signals-unix.c Outdated Show resolved Hide resolved
@jpsamaroo
Copy link
Member Author

jpsamaroo commented May 6, 2023

To make this feature more useful, I've put together a small package, InterruptHandling.jl. This package registers an interrupt handler which, on interrupt, shows a TerminalMenus prompt with a list of libraries/modules to interrupt; it additionally provides the ability for libraries to register their own handlers with InterruptHandling so that they can be selectively interrupted from the prompt.

It's my hope that we can standardize on this package or another like it for a number of reasons:

  • It provides an easy interface for users to interrupt only the libraries which are hanging, misbehaving, etc.
  • It makes it possible to re-enable the legacy interrupt behavior by just unregistering a single handler (as only one interrupt handler is actually registered with the runtime)
  • It can (eventually) provide a flexible interface to choosing different kinds of interrupt behavior, such as printing debugging information, interrupting specific misbehaving tasks, interfacing to a debugger, etc.
  • It can (eventually) provide alternative user interfaces without having to modify the libraries that use it

I would love to know what people think about this approach! 😄

EDIT: This functionality is now built-in to this PR when running in the REPL

@Seelengrab
Copy link
Contributor

I wanted to test how this behaves on multiple interrupts, one after another, but the package does not appear to work on my end:

julia> using InterruptHandling

julia> Base.register_interrupt_handler # I'm on your branch
register_interrupt_handler (generic function with 1 method) 

julia> InterruptHandling.start_root_handler()

julia> ^C

julia> ^C

julia> ^C

julia> ^C

@jpsamaroo
Copy link
Member Author

@Seelengrab sorry, I really need to add some docs; you use InterruptHandling.register!(mod::Module, handler::Task) to register handler, which does not add it to Julia's runtime handler set. Instead, it's added only to InterruptHandling's internal handler set, which will be invoked based on the responses to the menu prompt.

@jpsamaroo
Copy link
Member Author

jpsamaroo commented May 6, 2023

Also, after thinking on this, I realized that we'd probably need InterruptHandling.jl to become part of Base, because otherwise Base and stdlibs wouldn't be able to use it (which I think is important). It only depends on TerminalMenus, although we could also provide a non-TerminalMenus fallback if we don't want a hard dependency (in the event we want to move TerminalMenus out of the sysimage).

EDIT: Implemented in latest push for REPL sessions, with an interrupt-all behavior for non-REPL sessions

@jpsamaroo
Copy link
Member Author

Another suggested feature for discussion: an @interruptible macro which, when applied to a function definition or block of code, registers the current task with a handler that will throw an InterruptException to the task via throwto or some equivalent approach (and the task will be unregistered after exiting the @interruptible block). I'd also want to be able to pass a descriptive String to the macro so that InterruptHandling would be able to show that to identify the code to be interrupted.

@jpsamaroo jpsamaroo changed the title signal handling: User-defined interrupt handlers [RFC] signal handling: User-defined interrupt handlers May 6, 2023
Copy link
Member

@vtjnash vtjnash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think people might care about force-deliver-sigint remaining available, particularly since this usually delays handling much more?

src/task.c Outdated Show resolved Hide resolved
@Seelengrab
Copy link
Contributor

Some issues/PRs I stumbled across while looking for something different, that seem related/solved by this:

@ViralBShah
Copy link
Member

@vtjnash What does it take to get this done?

@jpsamaroo
Copy link
Member Author

I have the integration of the TerminalMenus-based handler working fine locally, but I still need to:

  • Document the new menu-driven behavior
  • Fix force-interrupt
  • Address the current review comment

And then I think this will be good-to-go, if people are OK with it.

@jpsamaroo jpsamaroo force-pushed the jps/interrupt-handlers branch from c969a35 to e76c157 Compare August 2, 2023 20:26
@jpsamaroo
Copy link
Member Author

The latest push adds the fancy TerminalMenus-based interrupt handler for REPL sessions, and moves multi-handler registration into Base (instead of doing this in the runtime). We also now have interrupt-all-handlers behavior when not running the REPL, to which I am planning to extend with something like a force-interrupt on rapid Ctrl-C in succession (like our current force behavior, but quicker to trigger).

There's just a few things to address (see the Todo list in the first comment in this PR), but I think this is just about ready to go!

@jpsamaroo jpsamaroo force-pushed the jps/interrupt-handlers branch from e76c157 to aa243ed Compare August 2, 2023 20:44

const INTERRUPT_HANDLERS_LOCK = Threads.ReentrantLock()
const INTERRUPT_HANDLERS = Dict{Module,Vector{Task}}()
const INTERRUPT_HANDLER_RUNNING = Threads.Atomic{Bool}(false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't these atomics deprecated?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would you use instead? The @atomics interface doesn't provide an alternative for a single but entirely atomically accessed resource, as far as I'm aware.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can turn it into a struct, but there's only ever one instance of it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make it just a global, but currently those don't support custom atomic annotations (they are always just release/consume)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "clean" solution would be to have a struct with @atomic fields, stored in a global Ref. The constructor would check if the Ref is already assigned and refuse being instantiated otherwise.

@jpsamaroo jpsamaroo changed the title [RFC] signal handling: User-defined interrupt handlers signal handling: User-defined interrupt handlers Aug 3, 2023
@jpsamaroo jpsamaroo force-pushed the jps/interrupt-handlers branch 2 times, most recently from 4013d97 to 2a087a7 Compare August 4, 2023 01:53
@jpsamaroo
Copy link
Member Author

Would people with access to Windows and Mac systems mind helping figure out what's wrong with those sides of the implementation? I have it working quite nicely locally on Linux, but I don't know how best to debug the hangs in CI in other platforms.

Interrupt handling is a tricky problem, not just in terms of
implementation, but in terms of desired behavior: when an interrupt is
received, which code should handle it? Julia's current answer to this is
effectively to throw an `InterruptException` to the first task to hit a
safepoint. While this seems sensible (the code that's running gets
interrupted), it only really works for very basic numerical code. In the case that
multiple tasks are running concurrently, or when try-catch handlers are
registered, this system breaks down, and results in unpredictable
behavior. This unpredictable behavior includes:

- Interrupting background/runtime tasks which don't want to be
  interrupted, as they do little bits of important work (and are
  critical to library runtime functionality)
- Interrupting only one task, when multiple coordinating tasks would
  want to receive the interrupt to safely terminate a computation
- Interrupting only one library's task, when multiple libraries really
  would want to be notified about the interrupt

The above behavior makes it nearly impossible to provide reliable Ctrl-C
behavior, and results in very confused users who get stuck hitting
Ctrl-C continuously, sometimes getting caught in a hang, sometimes
triggering unrelated exception handling code they didn't mean to,
sometimes getting a segfault, and very rarely getting the behavior they
desire (with unpredictable safety of being able to continue using the
active session as intended).

This commit provides an alternative behavior for interrupts which is
more predictable: user code may now register tasks as "interrupt
handlers" (via `Base.register_interrupt_handler`), which will be
guaranteed to receive an `InterruptException` whenever the session
receives an interrupt signal. Additionally, unlike the previous
behavior, no other tasks will receive `InterruptException`s; only
explicitly registered handlers may receive them.

This behavior allows one or more libraries to register handler tasks which
will all be concurrently awoken to handle each interrupt and do whatever
is necessary to safely interrupt any running code; the extent to which
other tasks are interrupted is arbitrary and library-defined. For
example, GPU libraries like AMDGPU.jl can register a handler to safely
interrupt GPU kernels running on all GPU queues and do resource cleanup.
Concurrently, a complex runtime like the scheduler in Dagger.jl can
register a handler to interrupt running tasks on other workers when
possible.

This commit also adds a more convenient interface for when the REPL is
running. When a Ctrl-C is received and the user is not at the REPL
prompt, a TerminalMenus-powered prompt will be shown, where the user
will have a variety of possible actions, including:
- Ignore the interrupt (do nothing)
- Activate all module's interrupt handlers
- Activate a specific module's interrupt handlers
- Disable the interrupt handler (reverting to the old Ctrl-C behavior)
- Exit Julia gracefully (with `exit()`)
- Exit Julia forcefully (with a `ccall` to `abort`)
@ron-wolf
Copy link
Contributor

Thanks so much for your work here @jpsamaroo! I was sent here from fonsp/Pluto.jl#452, and I took an interest in the progress of this pull request. I hope you don't mind my sharing this, but I just made a list of the commits on this branch to help myself track the history of this patch's development. This was called for because of the liberal use of rebases and force pushes on the branch. Hopefully this is useful to you and/or to anyone else who's interested in following the development of this feature. Once again, thank you so much for your hard work; I'm eager to see this merged into master!

@stemann
Copy link

stemann commented Oct 31, 2023

On a related note, likely apropos #6283, I have had a need for cancellable take!/put! on channels - in the context of a Dagger-like architecture with tasks typically awaiting values from channels.

In some cases, these channels represent input from external systems (hardware), so a task (A) might be blocked on take! from a channel indefinitely (e.g., if no user activates the hardware button) - until e.g. another task (B) of the system decides it is time to shut down - in which case, it would be nice to be able to cancel the blocked task (A) to get graceful shutdown.

I was expecting https://github.com/davidanthoff/CancellationTokens.jl to be part of the solution (cf. davidanthoff/CancellationTokens.jl#14): It might be an idea to also peek at the .NET cancellation framework.

@jpsamaroo
Copy link
Member Author

On a related note, likely apropos #6283, I have had a need for cancellable take!/put! on channels - in the context of a Dagger-like architecture with tasks typically awaiting values from channels.

In some cases, these channels represent input from external systems (hardware), so a task (A) might be blocked on take! from a channel indefinitely (e.g., if no user activates the hardware button) - until e.g. another task (B) of the system decides it is time to shut down - in which case, it would be nice to be able to cancel the blocked task (A) to get graceful shutdown.

I was expecting https://github.com/davidanthoff/CancellationTokens.jl to be part of the solution (cf. davidanthoff/CancellationTokens.jl#14): It might be an idea to also peek at the .NET cancellation framework.

This PR is the first necessary step towards robust cancellation - without a safe, predictable interrupt mechanism, there is no way to reliably chain interrupts to cancellation behaviors. Any attempt to implement "good" cancellation support in the language will require this PR, or something like it, to build upon.

@stemann
Copy link

stemann commented Oct 31, 2023

This PR is the first necessary step towards robust cancellation - without a safe, predictable interrupt mechanism, there is no way to reliably chain interrupts to cancellation behaviors. Any attempt to implement "good" cancellation support in the language will require this PR, or something like it, to build upon.

Thanks - looking forward to this landing! 🚀

I will probably be short on time, but let me know if further testing on macOS or Windows is needed.

@jpsamaroo
Copy link
Member Author

I will probably be short on time, but let me know if further testing on macOS or Windows is needed.

I would really appreciate both testing and fixes for those platforms - I don't have easy access to them, and don't understand how they handle signals. Implementing an equivalent jl_deliver_handled_sigint for each (and calling it at the right time) is probably all that's needed for this PR to be ready to go.

@stemann
Copy link

stemann commented Nov 4, 2023

OK - looking into the Windows-part by building on this branch and the work did by @jakubwro in jakubwro/interrupt-handlers (in jakubwro@265eff0).

Disclaimer: I have zero experience with implementing handling of Windows signals :-)

The signal handling for Windows is a bit confusing when comparing to the signal handling for Unix (and Mach):

  1. On Unix, it seems, restore_signals hooks up signal_listener (using a pthread), and jl_install_sigint_handler does nothing (it is empty, except for a TODO comment).
  2. On Windows, it seems,
    1. restore_signals removes the Windows Console Control handler sigint_handler, whereas jl_install_sigint_handler adds sigint_handler.
    2. sigint_handler only implements the Windows Console Control signal CTRL_C_EVENT (there's a couple more), and seems to translate it into the POSIX SIGINT signal - and handles this signal by exiting (with exit code 128 + SIGINT).
    3. jl_install_default_signal_handlers sets up POSIX signal handling with crt_sig_handler.
    4. crt_sig_handler handles SIGINT by exiting (with exit code 130 (128 + SIGINT)).

It sounds like Julia mainly relies on the POSIX signal handling on Windows, and then in some circumstances have the Windows Console Control handling enabled, e.g. on start-up before restore_signals is called?
Edit: It is the other way around: julia_init calls restore_signals first, then jl_install_default_signal_handlers and eventually calls jl_install_sigint_handler, cf. slightly outdated documentation of julia_init - i.e. Windows Console Control signals are disabled during init, but most of the time, it seems both types of signalling are enabled.

I guess those two signal handlers should then call "an equivalent jl_deliver_handled_sigint" instead of exiting?

Not sure how to implement jl_deliver_handled_sigint (I don't currently understand the Julia internals in this regard), but I guess something like the Unix version should work...

And perhaps sigint_handler should handle the remaining Windows Console Control signals?

Seemingly relevant information

There is a peculiarity in Windows Console Control Handler compared to the way POSIX handle CTRL+C (SIGINT) signal. In Windows, a new thread is created by Windows which invoke the registered control handler to process the signal, see: CTRL+C and CTRL+BREAK Signals. Contrary, in POSIX, the OS doesn't run the signal handler in a new thread.

Cf. http://darmawan-salihun.blogspot.com/2017/05/signal-handling-in-windows-console.html?m=1

SIGINT is not supported for any Win32 application. When a CTRL+C interrupt occurs, Win32 operating systems generate a new thread to specifically handle that interrupt. This can cause a single-thread application, such as one in UNIX, to become multithreaded and cause unexpected behavior.

Cf. https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/signal?view=msvc-170#remarks

@stemann
Copy link

stemann commented Nov 4, 2023

Pushed a WIP to stemann@0db38ecstemann/interrupt-handlers, i.e. stemann@7c37470.

Struggled a bit with just compiling (with msys2 provided mingw), but #51682 and workaround from #51740 seems to have helped.

@stemann
Copy link

stemann commented Nov 6, 2023

It might work...?

On Windows, using stemann@7c37470, running while true; yield(); end, and then hitting CTRL+C brings up the terminal menu:

  • “Interrupt all” seems to do nothing.
  • “Interrupt only…” displays an empty list of libraries to interrupt.
  • “Interrupt root task (REPL/script)” yields an InterruptException thrown (by the REPL yield())
  • “Exit Julia” exits with exit code 0.
  • “Force-exit Julia” exits with a SIGABRT in signals-win.c:108 - and exit code is 3

@vtjnash
Copy link
Member

vtjnash commented Nov 6, 2023

That analysis reminds me that packages have had the option of opting into safe signal handling with disable_sigint for years (only a few packages already do). It seems woefully underspecified (e.g. it is not clear that it is a task-local property) and there is no global mechanism or other better options. But would it be helpful to build on top of that here, so that new platform-specific code is not needed?

@jakubwro
Copy link

jakubwro commented Nov 7, 2023

That analysis reminds me that packages have had the option of opting into safe signal handling with disable_sigint for years (only a few packages already do). It seems woefully underspecified (e.g. it is not clear that it is a task-local property) and there is no global mechanism or other better options. But would it be helpful to build on top of that here, so that new platform-specific code is not needed?

This is really interesting, I understand idea here is to run all :interactive tasks inside disable_sigint, except some task that actually knows how to handle interrupt signal. This prototype works for me on 1.9.3 as I would expect. https://gist.github.com/jakubwro/47de35fdf4c875aef54cbae15faa5c11
Will try use this approach in some real life scenario.

@vtjnash
Copy link
Member

vtjnash commented Nov 7, 2023

Yes, roughly so. Though note that you should not be using Base.throwto in that way, as it introduces a data race that may crash the process, and if you are going through the effort of handling InterruptException it is somewhat pointless to throw away that work by re-introducing the issue you intended to fix.

@jakubwro
Copy link

jakubwro commented Nov 8, 2023

Yes, roughly so. Though note that you should not be using Base.throwto in that way, as it introduces a data race that may crash the process, and if you are going through the effort of handling InterruptException it is somewhat pointless to throw away that work by re-introducing the issue you intended to fix.

Right, I was able to reproduce this crash. My idea was to interrupt REPL when ran in interactive session, and do actual cleanup when ran as app without REPL. Still not sure how to do this REPL interrupt correctly.
I implemented similar approach in my package and it seems to work fine. Limitation is that I need to run with exactly one interactive thread and other tasks ran by user on interactive thread pool might interfere with this SIGINT handling logic when not wrapped in disable_sigint. Summarised it in this note: https://jakubwro.github.io/NATS.jl/dev/interrupt_handling/

@vtjnash
Copy link
Member

vtjnash commented Nov 9, 2023

Right. One of the main distinctions of this PR is that it effectively sets disable_sigint on all others tasks, except the ones that are given the task of handling the signal

@jakubwro
Copy link

jakubwro commented Nov 9, 2023

Right. One of the main distinctions of this PR is that it effectively sets disable_sigint on all others tasks, except the ones that are given the task of handling the signal

I did more experimenting with disable_sigint. With a little bit of piracy I have it working for my package regardless --threads configuration. It seems if threadpool is not set, and sticky is true (what @async macro sets) task it is always scheduled on thread 1, so no risk to skip interrupt signal.
https://gist.github.com/jakubwro/03418b75aa0fc5742e6a40a0fe97fcbc
I am pretty convinced now, that changes in the platform specific signal handling are not needed to make this PR mergeable.
I also implemented this pirated method on current Julia master branch, and locally on macOS only one test failed, not sure why, when tried to rerun this single test in REPL, it passed.
Failing test was:

@test_throws ErrorException begin

@davidizzle
Copy link

Issue flag: when using this updated version of Julia and after sending signal interrupts, REPL continuously spawns signal handler: PTLS defer sigint (src/signals-unix.c +399) prints, and at signal interrupt sometimes prints "Deferral not requested??? (src\signal-handling.c, +322)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
multithreading Base.Threads and related functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

fatal: error thrown and no exception handler available. Segfault with SIGINT
9 participants