Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closing TCP connection after SSL_ERROR_SYSCALL #7

Open
botanegg opened this issue Feb 12, 2023 · 6 comments
Open

Closing TCP connection after SSL_ERROR_SYSCALL #7

botanegg opened this issue Feb 12, 2023 · 6 comments

Comments

@botanegg
Copy link
Contributor

botanegg commented Feb 12, 2023

I've test library with mumble-server based on mumblevoip/mumble-server:v1.4.287

After I have connected (and sent Ping messages to server) I got an SSL_ERROR_SYSCALL in TLS.cpp and then got Shutdown and closed TCP.
You need to be connected a lot of time (from 10 minutes to 10 hours) to reproduce
https://github.com/mumble-voip/libmumble/blob/master/src/TLS.cpp#L184-L190

top of stacktrace:

#0  mumble::SocketTLS::interpretLibCode (this=0x55555566b760, code=0, processed=false, remaining=true) at /PATH/libmumble/src/TLS.cpp:194
#1  0x00007ffff7f72682 in mumble::SocketTLS::read (this=0x55555566b760, buf=...) at /PATH/libmumble/src/TLS.cpp:161
#2  0x00007ffff7f2b21a in mumble::Connection::P::read(gsl::span<std::byte, 18446744073709551615ul>, bool, std::function<bool ()>) (this=0x55555566b760, buf=..., wait=false, halt=...) at /PATH/libmumble/src/Connection.cpp:156
#3  0x00007ffff7f2ac81 in mumble::Connection::process(bool, std::function<bool ()>) (this=0x5555556193c0, wait=false, halt=...) at /PATH/libmumble/src/Connection.cpp:101
#4  0x00007ffff7f591ef in operator() (__closure=0x7ffff0028dc0, event=...) at /PATH/libmumble/src/Peer.cpp:299
#5  0x00007ffff7f59ac1 in operator() (__closure=0x7ffff0028dc0, i=0) at /PATH/libmumble/3rdparty/quickpool/quickpool.hpp:928
#6  0x00007ffff7f5a35b in quickpool::loop::Worker<quickpool::ThreadPool::parallel_for_each<gsl::span<mumble::Monitor::Event>, 

I check ERR_get_error() is 0 and errno is 11 inside SSL_ERROR_SYSCALL case

auto sslError = ERR_get_error();
auto err = errno;

After my little patch a have no random disconnect anymore

diff --git a/src/TLS.cpp b/src/TLS.cpp
index 8ea8176..136f4db 100644
--- a/src/TLS.cpp
+++ b/src/TLS.cpp
@@ -182,6 +182,9 @@ uint32_t SocketTLS::pending() const {
 		case SSL_ERROR_WANT_WRITE:
 			return WaitOut;
 		case SSL_ERROR_SYSCALL:
+			if (ERR_get_error() == 0 && errno == 11) {
+				return Retry;
+			}
 			m_closed = true;
 
 			if (!processed) {

But I'm not sure this is enough and this is idiomatic code for handling ssl problems

Related links from so:
https://stackoverflow.com/questions/13686398/ssl-read-failing-with-ssl-error-syscall-error
https://stackoverflow.com/questions/13554691/errno-11-resource-temporarily-unavailable

@botanegg
Copy link
Contributor Author

botanegg commented Feb 12, 2023

After a lot of tests of #8 PR

I found a reproduce steps (do it on same machine):

  1. run libmumble based client and connect to server
  2. open mumble-client
  3. open server list, add same server to favorites
  4. close server list
  5. open server list
  6. repeat 4 and 5 over and over again (on my machine 5-6 times is enough)
  7. libmumble based client will return shutdown code in next line https://github.com/mumble-voip/libmumble/blob/master/src/TLS.cpp#L188

ps. sometimes you need to connect once to server via mumble-client before open-close loop

I think it is another bug besides EAGAIN handle

@botanegg
Copy link
Contributor Author

Also
High ping to server - reduces reproducibility
Set threads to 1 in code = client.startTCP(peerFeedback(), 1); - reduces reproducibility

@davidebeatrici
Copy link
Member

Thank you very much for the detailed report!

I would expect OpenSSL to handle EAGAIN, but according to your findings it doesn't seem to be the case.

If your pull request effectively fixes the issue, I'm going to merge it right away.

@botanegg
Copy link
Contributor Author

pr #8 just reduces reproducibility but not completely solve the problem
I think it can be merged but it not solve all the problems in this place

@botanegg
Copy link
Contributor Author

OK
Next point is thread safety
Seems like SSL objects can be used via multiple threads without any mutex
When we used this way we got "Requesting crypt-nonce resync" in server logs, and then drop the connection

@davidebeatrici
Copy link
Member

Now that both #8 and #9 were merged, are there any issues left (that you discovered)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants