Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation leaks still there (2.0.0) #51

Open
Zinggi opened this issue Aug 21, 2020 · 6 comments
Open

Implementation leaks still there (2.0.0) #51

Zinggi opened this issue Aug 21, 2020 · 6 comments

Comments

@Zinggi
Copy link
Contributor

Zinggi commented Aug 21, 2020

Oh no, I made it worse 😱

It looks like my PR that was supposed to get rid of implementation leaks introduced some new ones.
I should have tested my changes thoroughly with our production code...

It looks like my GenServer that calls NodeJS.call(..) is sometimes getting back the messages sent by Port:

{:ok, [message: {#Port<0.2131>, {:data, {:eol, []}}}, module: NodeJS.Worker, name: #PID<0.1656.0>]}

Also, I'm still getting the timeout messages in my mailbox:

** (FunctionClauseError) no function clause matching in CallerGenServer.handle_info/2
    (foo 1.1.0) lib/foo/callerGenServer.ex:91: CallerGenServer.handle_info({#Reference<0.221168823.2908225537.225316>, {:error, :timeout}}, nil)
    (stdlib 3.13) gen_server.erl:680: :gen_server.try_dispatch/4
    (stdlib 3.13) gen_server.erl:756: :gen_server.handle_msg/6
    (stdlib 3.13) proc_lib.erl:226: :proc_lib.init_p_do_apply/3

I suspect some race condition with timeouts...

Also, we might need to add a wrapper GenServer that catches all these messages such that the caller is truly isolated from the worker process.


Unfortunately I wont have much time to investigate exactly when and how it happens, as I'm soon leaving for vacations.
Otherwise I would solve it myself and contribute the solution.
I feel obliged to fix it, but I won't have the time (earliest after ca. 4 weeks).

I'm terribly sorry about that.

@Zinggi Zinggi changed the title Implementation leaks still there Implementation leaks still there (2.0) Aug 21, 2020
@Zinggi Zinggi changed the title Implementation leaks still there (2.0) Implementation leaks still there (2.0.0) Aug 21, 2020
@oohnoitz
Copy link
Contributor

@Zinggi could you create a test repo that can easily reproduce this issue?

@Zinggi
Copy link
Contributor Author

Zinggi commented Aug 21, 2020

Yeah, that's the least I can do.
Hope I can make it reproducible...

@Zinggi
Copy link
Contributor Author

Zinggi commented Aug 21, 2020

@oohnoitz
Alright, I managed to reproduce the issue 😌 :
https://github.com/Zinggi/elixir-nodejs-implementation-leaks

Sample log output:

before call
result
{:error, "Call timed out."}
after call

17:54:37.361 [error] GenServer Leaks.GenServer terminating
** (FunctionClauseError) no function clause matching in Leaks.GenServer.handle_info/2
    (elixir_nodejs_leaks 0.1.0) lib/leaks/gen_server.ex:20: Leaks.GenServer.handle_info({#Reference<0.4098332406.3456892929.143995>, {:error, :timeout}}, nil)
    (stdlib 3.13) gen_server.erl:680: :gen_server.try_dispatch/4
    (stdlib 3.13) gen_server.erl:756: :gen_server.handle_msg/6
    (stdlib 3.13) proc_lib.erl:226: :proc_lib.init_p_do_apply/3
Last message: {#Reference<0.4098332406.3456892929.143995>, {:error, :timeout}}
State: nil
before call
result
{:ok, 42}
after call

17:54:38.575 [error] [message: {#Port<0.8>, {:data, {:eol, []}}}, module: NodeJS.Worker, name: #PID<0.186.0>]

17:54:38.575 [error] [message: {#Port<0.8>, {:data, {:eol, '__elixirnodejs__UOSBsDUP6bp9IF5__[true,42]'}}}, module: NodeJS.Worker, name: #PID<0.186.0>]
before call
result
{:error, "Call timed out."}
after call

17:54:40.612 [error] GenServer Leaks.GenServer terminating
** (FunctionClauseError) no function clause matching in Leaks.GenServer.handle_info/2
    (elixir_nodejs_leaks 0.1.0) lib/leaks/gen_server.ex:20: Leaks.GenServer.handle_info({#Reference<0.4098332406.3456892932.142055>, {:error, :timeout}}, nil)
    (stdlib 3.13) gen_server.erl:680: :gen_server.try_dispatch/4
    (stdlib 3.13) gen_server.erl:756: :gen_server.handle_msg/6
    (stdlib 3.13) proc_lib.erl:226: :proc_lib.init_p_do_apply/3
Last message: {#Reference<0.4098332406.3456892932.142055>, {:error, :timeout}}
State: nil
before call
result
{:ok, 42}
after call

17:54:41.817 [error] [message: {#Port<0.8>, {:data, {:eol, []}}}, module: NodeJS.Worker, name: #PID<0.186.0>]

17:54:41.817 [error] [message: {#Port<0.8>, {:data, {:eol, '__elixirnodejs__UOSBsDUP6bp9IF5__[true,42]'}}}, module: NodeJS.Worker, name: #PID<0.186.0>]
before call
result
{:error, "Call timed out."}


...

Hope it helps.

I might still be able to reply while on vacation, but I won't be able to do any coding

@oohnoitz
Copy link
Contributor

Had some time to do a bit more digging. It's definitely a timeout issue, with a few other things. Timeous are definitely awkward with the way GenServer works and still a bit confusing for me.

Even though we have a timeout specified for GenServer.call, it doesn't prevent the underlying process to continue running until there is an actual response. The port/process still remains running until there is a response, which means that we can be returning an error early, but we'll still send the response back at a much later time. This would explain why we get another message in the inbox with the resolved value from the GenServer.call we should have ignored.

https://hexdocs.pm/elixir/GenServer.html#call/3-timeouts has some details about the caller doing this. Though, we don't really want to have developers handle the discarding of that message in their code. Would need to mess with some other timeout values and combinations to be sure.

I have this working branch: 20f3c47 which does seem to solve the issue? But would need some more testing to make sure this is a good solution.

@Zinggi
Copy link
Contributor Author

Zinggi commented Aug 26, 2020

Thanks for looking into it.
I can't really test it out for some time, but that commit looks like it could work.

However, maybe we also need a timeout on the JS side, since without it our next calls to the same worker would be blocked without canceling the previous call. This could be done with Promise.race and a setTimeout

@Zinggi
Copy link
Contributor Author

Zinggi commented Sep 21, 2020

Hey there
Just wanted to let you know that I'm back from vacations and that I'll look into it again once I find time.
Hopefully sometimes this week 🤞

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants