Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

keeping node alive while workers are working #23092

Open
bakkot opened this issue Dec 6, 2024 · 23 comments
Open

keeping node alive while workers are working #23092

bakkot opened this issue Dec 6, 2024 · 23 comments

Comments

@bakkot
Copy link
Contributor

bakkot commented Dec 6, 2024

Version of emscripten/emsdk:

emcc (Emscripten gcc/clang-like replacement + linker emulating GNU ld) 3.1.73 (ac676d5e437525d15df5fd46bc2c208ec6d376a3)
clang version 20.0.0git (https:/github.com/llvm/llvm-project 1d810ece2b2c8fab77720493864257f0ea3336a9)

Failing command line in full:

emcc  --pre-js pre.js -pthread -s MODULARIZE=1 -s EXPORTED_FUNCTIONS="_spawn" main.cc && node run.mjs

I have a library which is intended to be driven from JS (i.e., no main loop in C, and using -s MODULARIZE=1).

I'm trying to spawn a thread, do some work, and then get the result back in the main thread. I have the following three bits of code, with the first two being the inputs to emcc and the third being the driver:

// pre.js
let spawnTimeout;
let resolve;

Module.async_sum = () => {
  _spawn();
  return new Promise((res) => {
    resolve = res;
  });
};
// main.cc
#include <thread>
#include <stdio.h>
#include <emscripten.h>

extern "C" void spawn() {
  printf("spawning\n");
  std::thread t([] {
    MAIN_THREAD_ASYNC_EM_ASM({
      clearTimeout(spawnTimeout);
    });

    printf("spawned\n");

    double v = 1.0;
    for (int i = 1; i < 1e8; ++i) {
      v += 1.0 / i;
    }
    printf("sum: %f\n", v);
    MAIN_THREAD_ASYNC_EM_ASM({
      resolve($0);
    }, v);
  });
  t.detach();
  EM_ASM({
    spawnTimeout = setTimeout(() => {}, 5000);
  });
}
// run.mjs
import { default as init } from './a.out.js';

let x = await init();

console.log(await x.async_sum());

Running node run.mjs prints

spawning
spawned
Warning: Detected unsettled top-level await at file:///Users/kevin/code/emscripten-thread-test/run.mjs:5
console.log(await x.async_sum());

and then exits, without printing the sum or giving me the result.

I was really hoping it would instead do the work and give me the result. That worked prior to #19073 (or to be more precise, it worked in version 3.1.15; I'm assuming that PR is the relevant change). I assume this is because node exits when the main thread has no tasks queued and the workers are all unref'd.

If I comment out the "clearTimeout" line it works, but then of course I need to wait for the timeout. I can move the clearTimeout later, but then if the worker crashes or is killed (as sometimes happens) it will again need to wait for the timeout.

Is there some way I can get it to stay alive? Maybe keep the workers .ref'd while they're actually doing work?

@kripken
Copy link
Member

kripken commented Dec 11, 2024

ccing people from that PR: @kleisauke @sbc100

@kleisauke
Copy link
Collaborator

Ah, this is indeed due to PR #19073. Calling the internal _emscripten_thread_set_strongref() function would prevent Node.js from exiting.

How about exposing this function in Emscripten's public API? That way, you could do:

--- a/main.cc
+++ b/main.cc
@@ -2,14 +2,11 @@
 #include <thread>
 #include <stdio.h>
 #include <emscripten.h>
+#include <emscripten/threading.h>
 
 extern "C" void spawn() {
     printf("spawning\n");
     std::thread t([] {
-      MAIN_THREAD_ASYNC_EM_ASM({
-        clearTimeout(spawnTimeout);
-      });
-
       printf("spawned\n");
 
       double v = 1.0;
@@ -21,8 +18,6 @@ extern "C" void spawn() {
         resolve($0);
       }, v);
     });
+    emscripten_thread_set_strongref(t.native_handle());
     t.detach();
-    EM_ASM({
-      spawnTimeout = setTimeout(() => {}, 5000);
-    });
 }

I just opened PR #23152 for this.

@bakkot
Copy link
Contributor Author

bakkot commented Dec 13, 2024

Nice! Is there an easy way to try the PR locally? I'd be happy to test it for my use case if so.

@kleisauke
Copy link
Collaborator

kleisauke commented Dec 13, 2024

@bakkot To test that PR with the tip-of-tree compilers, you could use:

$ cd emsdk/
$ ./emsdk install tot emscripten-main-64bit --override-repository emscripten-main-64bit@https://github.com/kleisauke/emscripten/tree/node-strong-ref-helper-public
$ ./emsdk activate tot emscripten-main-64bit
$ source ./emsdk_env.sh

# Verify commit
$ git -C emscripten/main rev-parse HEAD
cfe6160f56273902cf9e52f875ce4083d967c35a

# Invalidate cache
$ emcc --clear-cache

Alternatively, you can test that private function using:

--- a/main.cc
+++ b/main.cc
@@ -3,13 +3,11 @@
 #include <stdio.h>
 #include <emscripten.h>
 
+extern "C" void _emscripten_thread_set_strongref(pthread_t thread);
+
 extern "C" void spawn() {
     printf("spawning\n");
     std::thread t([] {
-      MAIN_THREAD_ASYNC_EM_ASM({
-        clearTimeout(spawnTimeout);
-      });
-
       printf("spawned\n");
 
       double v = 1.0;
@@ -21,8 +19,6 @@ extern "C" void spawn() {
         resolve($0);
       }, v);
     });
+    _emscripten_thread_set_strongref(t.native_handle());
     t.detach();
-    EM_ASM({
-      spawnTimeout = setTimeout(() => {}, 5000);
-    });
 }

However, this causes Node.js to hang indefinitely due the && PROXY_TO_PTHREAD guard here:

#if ENVIRONMENT_MAY_BE_NODE && PROXY_TO_PTHREAD

(i.e. it doesn't .unref() the worker when it's finished)

To fix this, you can temporarily remove that conditional in $EMSDK/upstream/emscripten/src/library_pthread.js (as done in the referenced PR).

@sbc100
Copy link
Collaborator

sbc100 commented Dec 13, 2024

I would hope we could make fix this without exposing a custom API like that, but maybe there is no other option.

Also I wonder if that API should be in the form a pthread_attr_xxx call? (Like we have for emscripten_pthread_attr_settransferredcanvases)?

@sbc100
Copy link
Collaborator

sbc100 commented Dec 13, 2024

Also, it makes me wonder what the corresponding technique would be in the native world. If you return from main with pthreads running does you app always exit? Is there any way to prevent that and keep it alive for async signals from its worker threads?

@sbc100
Copy link
Collaborator

sbc100 commented Dec 13, 2024

Also, can you explain a little more why don't want to use the emscripten_set_timeout / emscripten_clear_timeout approach?

It seems that you could store the result of emscripten_set_timeout globally and then call emscripten_clear_timeout after you call resolve? What is wrong with this approach?

You say "then if the worker crashes or is killed (as sometimes happens) it will again need to wait for the timeout." couldn't you clear the timeout in those cases? Also if there is a possiblity that the thread dies or is killed wouldn't you want a settimeout to fire at some point so you can report that error (e.g. "thread unresponsive!")

@sbc100
Copy link
Collaborator

sbc100 commented Dec 13, 2024

It seems fairly common to use setTimeout / setInterval to keep a node process alive longer than it normally would be: https://stackoverflow.com/questions/23622051/how-to-forcibly-keep-a-node-js-process-from-terminating

@bakkot
Copy link
Contributor Author

bakkot commented Dec 14, 2024

Also, can you explain a little more why don't want to use the emscripten_set_timeout / emscripten_clear_timeout approach?

It seems that you could store the result of emscripten_set_timeout globally and then call emscripten_clear_timeout after you call resolve? What is wrong with this approach?

Because then the timer won't get cleared if the thread dies before resolving. And in my application the timer needs to be extremely long, because the work done in the thread may take a very long time (hours).

You say "then if the worker crashes or is killed (as sometimes happens) it will again need to wait for the timeout." couldn't you clear the timeout in those cases?

If there's an easy way to do that, sure, that would work too. I didn't see one offhand.

Also if there is a possiblity that the thread dies or is killed wouldn't you want a settimeout to fire at some point so you can report that error (e.g. "thread unresponsive!")

I have a have a process.on('exit', handler) which prints an error when the main thread exits while something is waiting on work from a worker thread. A timeout isn't suitable for this because the timer would need to be unreasonably long.

@sbc100
Copy link
Collaborator

sbc100 commented Dec 14, 2024

Also, can you explain a little more why don't want to use the emscripten_set_timeout / emscripten_clear_timeout approach?
It seems that you could store the result of emscripten_set_timeout globally and then call emscripten_clear_timeout after you call resolve? What is wrong with this approach?

Because then the timer won't get cleared if the thread dies before resolving. And in my application the timer needs to be extremely long, because the work done in the thread may take a very long time (hours).

But that setInterval can go on a long a you want.. forever if you want.

How does your application detect the thread "dying"?

You say "then if the worker crashes or is killed (as sometimes happens) it will again need to wait for the timeout." couldn't you clear the timeout in those cases?

If there's an easy way to do that, sure, that would work too. I didn't see one offhand.

Also if there is a possiblity that the thread dies or is killed wouldn't you want a settimeout to fire at some point so you can report that error (e.g. "thread unresponsive!")

I have a have a process.on('exit', handler) which prints an error when the main thread exits while something is waiting on work from a worker thread. A timeout isn't suitable for this because the timer would need to be unreasonably long.

If you have a setInterval or setTimeout going on the main thread then the process.on('exit', handler) should not be called.

You would do something like this:

var intervalID = setInterval(() => {
  if (workIsDone) {
    clearInterval(intervalId);
  }
}, 100);

You would also write the same thing in C/C++ using emscripten_set_interval/emscripten_clear_interval.

@bakkot
Copy link
Contributor Author

bakkot commented Dec 14, 2024

But that setInterval can go on a long a you want.. forever if you want.

I'm fine with having a long timer. I just don't want to have a long timer if it's possible that no work is actually happening, which can happen if the worker thread exits without clearing the timer.

How does your application detect the thread "dying"?

The on('exit') handler, like I said.

If you have a setInterval or setTimeout going on the main thread then the process.on('exit', handler) should not be called.

... Right, which is part of why I'm trying to avoid a setInterval or setTimeout. If I have a timer going then the main thread will be kept alive. That is exactly what I want to not happen when the worker dies.


I am trying to find a solution that does not require me to have a long-running setInterval or setTimeout so that I do not run into the possibility of the main thread being kept alive while no work is happening. With the above PR, that works great: while the worker is actually doing work it is kept alive, and when it stops the main thread exits cleanly. If the worker dies then the main thread will still exit and will trigger my process.on('exit') handler so I know it crashed and can report the error. Trying to do the same thing with timers would require me to have some way of detecting the worker being killed other than the exit handler, and I don't know of a straightforward way to do that.

Also, timers are a bit of a hack here. Workers which are doing work you care about are supposed to be strongly ref'd in order to keep the main thread from exiting.

@sbc100
Copy link
Collaborator

sbc100 commented Dec 14, 2024

Also, timers are a bit of a hack here. Workers which are doing work you care about are supposed to be strongly ref'd in order to keep the main thread from exiting.

While this is true for node Workers its not true for pthreads and POSIX. When then main thread exits here the semantics are that it brings the whole application down, including any threads that might be doing work. So we have conflicting requirements.

Perhaps using -sWASM_WORKERS would be a way around this since they don't have this semantics and we don't use unref there.

@sbc100
Copy link
Collaborator

sbc100 commented Dec 14, 2024

The on('exit') handler, like I said.

I would hope we can find a better way to detect the thread dying. I assume we are talking about the pthread exiting of its own accord and the worker being returned to pool? Assuming you want stick with pthreads, one way to do would be call pthread_tryjoin_np in your setInterval. You would also need to avoid detaching the pthread.

@bakkot
Copy link
Contributor Author

bakkot commented Dec 14, 2024

While this is true for node Workers its not true for pthreads and POSIX. When then main thread exits here the semantics are that it brings the whole application down, including any threads that might be doing work. So we have conflicting requirements.

Sure, but the main thread in this case is in node, not POSIX; I'm just calling into emscripten's output as a library. And it isn't really intentionally exiting the main thread: the code on that thread is in the middle of an await. It's just that node kills the main thread when there are no tasks queued if you've unref'd all the workers, which you're not supposed to do if you actually care about the work they're doing.

Perhaps using -sWASM_WORKERS would be a way around this since they don't have this semantics and we don't use unref there.

I'm currently relying on MAIN_THREAD_ASYNC_EM_ASM to get values out of the worker, which the docs say can't be done with WASM_WORKERS, so I don't know if that would work. (Incidentally there's a cut-off sentence at the end of the linked section of the docs.)

pthread_tryjoin_np

It's awkward to retrofit that into the C++ threads I'm currently using, but I could see if I can make std::async work, I guess. Still, I'd much prefer to just mark the thread as being used while it's actually being used.

@sbc100
Copy link
Collaborator

sbc100 commented Dec 14, 2024

How about this solution:

We already have a mechanism for signally that the main thread is not actually dead and should the runtime should be kept going. We have emscripten_exit_with_live_runtime and emscripten_runtime_keepalive_push and emscripten_runtime_keepalive_pop.

How about if said: active pthreads are not enough to the runtime alive on their own, but if the runtime is otherwise being kept alive then they are?

In other words which should more accurately reflect that fact "then the main thread exits, pthreads should be killed".

Then all you would need is add emscripten_runtime_keepalive_push to the end your main function. No new API would be needed.

@bakkot
Copy link
Contributor Author

bakkot commented Dec 14, 2024

Then all you would need is add emscripten_runtime_keepalive_push to the end your main function. No new API would be needed.

What do you mean by "main function"? I'm writing a library, not an application. The library is intended to be consumed by JavaScript programmers. The interface for users of the library is that they call async functions exported by the library, and when those functions finish doing work the promise they returned settles with the result of the call just like any other async function, and this doesn't block the main thread. And also when their program finishes node exits.

I don't want to tell users of this library that they have to call some other function sometimes. Users ideally should not have to know that the library is using emscripten at all.

@sbc100
Copy link
Collaborator

sbc100 commented Dec 14, 2024

If you don't have a main function and you are writing a library then your runtime should stay alive until you explicitly call exit().

In this configuration I think all pthread should also keep the program alive until the call to exit().

@sbc100
Copy link
Collaborator

sbc100 commented Dec 14, 2024

In other words, I think all your pthreads should be strong referenced. I think we can figure out how to make that happen without exposing a new API.

@bakkot
Copy link
Contributor Author

bakkot commented Dec 14, 2024

In this configuration I think all pthread should also keep the program alive until the call to exit().

What call to exit()? There isn't a call to exit() anywhere.

@sbc100
Copy link
Collaborator

sbc100 commented Dec 14, 2024

In this configuration I think all pthread should also keep the program alive until the call to exit().

What call to exit()? There isn't a call to exit() anywhere.

In that case the runtime would never exit, yes.

@sbc100
Copy link
Collaborator

sbc100 commented Dec 14, 2024

In the library configuration I think its reasonable for pthreads to keep node alive.

@bakkot
Copy link
Contributor Author

bakkot commented Dec 14, 2024

Assuming you mean pthreads which are actively doing work and not just workers in the pool, that sounds good to me.

@sbc100
Copy link
Collaborator

sbc100 commented Dec 14, 2024

Yes, we are only talking about "active" threads here. Inactive workers in the pool should never keep anything alive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants