-
Notifications
You must be signed in to change notification settings - Fork 534
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
potential for token.Wait() to hang indefinitely #613
Comments
I think I will switch to using token.WaitTimeout() instead of token.Wait(). |
We do need to take another look at |
Unfortunately, I had encountered this issue before. After I upgraded to v1.3.5, it had appeared the day before yesterday |
@MattBrittan, why token.Wait() don't redirect to token.WaitTimeout() with a default time? |
Because that's not what historical users would expect - this is quite an old library and we really don't want to introduce changes that might break existing code - if you want a timeout then use Please try |
@MattBrittan, thank for your response. I worry if will cause performance issue in high concurrences publish message case with WaitTimeout() ? |
1 similar comment
@MattBrittan, thank for your response. I worry if will cause performance issue in high concurrences publish message case with WaitTimeout() ? |
@apolloyang2017 well if the library was changed to do this by default then the performance impact would be the same. |
I wonder if the unexpected behavior I'm seeing is related to this issue:
At this point, I would expect the client to send the message and |
@nkvoll - there have been a number of releases since this issue was raised. The last release, 1.4.2, fundamentally changes that way the connection status is managed so it's really difficult to say if the two are related (as no logs were provided I don't know if the connection was being lost). Can you please raise a new issue and provide your test code (it's useful to see exactly what options you are using etc) and I'll try to add a test for this circumstance (checking the token in this one might do the job). |
Significant changes (including a revamp of status handling)) have been made to the library since this issue was reported; I'd appreciate it if anyone still encountering the original issue could add a comment. |
@MattBrittan We are seeing a similar behavior @nkvoll was seeing back in 2022:
We believe this only happens when the device running the mqtt client loses and regains network connection, but are not 100% positive this is the root of the issue with the client. We are using the most up to date version - v1.4.3 Here is a partial stack trace
In lieu of this, we are wrapping the call to Publish as such:
Is this check of QOS 0 expected? Should Thanks! |
@jismithc - can you please raise this as a new issue (I really should close this one off as I believe the original issue was fixed by the change to the way the connection status is managed). With QOS 0 messages the token will be closed when the message is transmitted, an attempt to publish it fails, or |
I recently ran into an issue where my paho based message publisher got into a state where it was blocked forever.
The process appears to have hung up on the token.Wait() call after Publish() was called. Here is a thread dump that shows the go routine blocked on Wait():
My application is currently using paho 1.3.5.
Looking at the code, it looks like the Publish() method checks to see if the connection is down at the top of the method. If the connection is down, then it marks the token as completed (flowComplete() or setError() as setError() calls flowComplete()) and returns the token. This all happens around line 690 in the code.
However, it looks like there is the possibility that the connection is good at the top of the method, but by the time the go routine makes it down to line 724 (which checks the connection status again) the connection is either "connecting" or "reconnecting". When this happens (connection status == "connecting" or "reconnecting"), then nothing marks the token as complete. This will cause the caller to block indefinitely on token.Wait().
It looks like the tokens could be marked as complete by the messageids.cleanUp() method:
This method appears to be called when the client explicitly calls Disconnect():
https://github.com/eclipse/paho.mqtt.golang/blob/v1.3.5/client.go#L483
In my case, the client should run indefinitely, so it will not be calling Disconnect().
messageids.cleanUp() is also called here.
However, the logic here appears to be such that messagesids.cleanUp() will not be called if paho is trying to reconnect. In my case, c.options.AutoReconnect is true, c.options.CleanSession is true and I suspect c.connectionStatus() > connecting is true as well (I don't have a way to prove this actually is the case for the stack dump provided). This appears to make it so that there is nothing that calls messageids.cleanUp().
To boil all of this down, it appears to be possible for the connection status to change between the top of Publish() and the bottom, when this happens the token is not marked as completed. This will result in go routine hanging on token.Wait().
I believe this explains why I saw my publisher hang on token.Wait(). However, I could easily be wrong here. :)
I can attach the paho related go routine stack traces if that would be helpful.
I see that master has changed the logic around the call to
c.messageIds.cleanUp().
Should
cleanUp()
get called now when a connection has reconnected?I will upgrade to the latest release and see if I notice the error again
(although it was very rare).
The text was updated successfully, but these errors were encountered: