-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CCT-176: Handle incorrectly closed TLS connections #3358
Conversation
Marked as draft, waiting for final Candlepin verification. The solution works, but it may be caused by TLS 1.3 half-close policy instead. |
0850499
to
0bd1693
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks (and it also simplifies things a bit)
I will leave it to @jirihnidek for an additional review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest: I do not like this change. TLS should be closed at the end of communication. If server does something that does not follow TLS specification, then it should be fixed on server side. Doing such hack on client side is not good idea.
@nikosmoum could it be fixed on server side?
If I understand correctly, this isn't a hack, and it is conforming to TLS 1.3 specification. |
We need to find a solution that works for both TLS 1.2 and 1.3. My alternative approach to this would be a 1-2 second timeout: try our best to close the connection, but if we don't hear from the server, just tear it down. |
TLS 1.3 uses a half-close policy, while TLS 1.2 and earlier use a duplex-close policy. For applications that depend on the duplex-close policy, there may be compatibility issues when upgrading to TLS 1.3. Source We can force again duplex-close policy on server side for TLS 1.3. @jirihnidek |
The specification says is clearly. The
How can I test this issue with Quarkus? Is there any feature branch or is it already in main? I still do not understand one thing. If server side does not response on |
We know RTT for each connection. It would not be necessary to wait some fixed time. |
i will send you example |
The specification of TLS 1.3 does not say that server should do nothing, when We use HTTPs and REST API. There is zero chance that server will have to send anything asynchronously to client. |
0bd1693
to
01b2277
Compare
Right, I forgot about it. The function had to grow a bit, but I think it is still clear what is happening there. |
a543df2
to
377eb57
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think that server should do it properly... I have few comments for your PR.
raise TimeoutError(f"Did not get response in {timeout_time}s") | ||
|
||
signal.signal(signalnum=signal.SIGALRM, handler=on_timeout) | ||
signal.alarm(timeout_time) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the best approach, because this code is also used for rhsm.service
. It means that TimeoutError
would be raised despite connection was closed properly...
try: | ||
self.__conn.sock.unwrap() | ||
except (ssl.SSLError, TimeoutError) as err: | ||
log.debug(f"TLS connection could not be closed: {err}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The alarm should be canceled in the else
statement using signal.alarm(0)
. See: https://man7.org/linux/man-pages/man2/alarm.2.html
Also, not sure how will it work with multi-threaded rhsm.service
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also add original debug message to else statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After a bit of investigating I found out that connection reuse is disabled for the D-Bus server and this code will never run in multi-threaded context.
subscription-manager/src/rhsmlib/dbus/server.py
Lines 84 to 88 in 47d2729
# Do not allow reusing connection, because server uses multiple threads. If reusing connection | |
# was used, then two threads could try to use the connection in the almost same time. | |
# It means that one thread could send request and the second one could send second request | |
# before the first thread received response. This could lead to raising exception CannotSendRequest. | |
connection.REUSE_CONNECTION = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this code could be run in multi-threaded context.
When reusing of connection is disabled, then it only means that each REST API request uses it's own TLS connection, but the connection still have to be closed.
Correct me if I'm wrong, but I believe that there could be more threads doing REST API calls, when there are two apps using rhsm.service concurrently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole setup does not seem thread-safe. The __conn
is class attribute, it would be shared between the threads. If we want to make it thread-safe we'd need to change the architecture of the whole connection reuse.
I'm still not sure this PR makes sense. We're not handling "the sad path" on so many other places this seems unnecessary -- we know the initial cause will be fixed server-side and this patch will not be executed in real conditions.
log.debug( | ||
f"Latest response time was {response_time:.5f}s, " | ||
f"smoothed response time is {self.smoothed_rt:.5f}s" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not related to this issue, but I don't care ;-)
src/rhsm/connection.py
Outdated
# See RHEL-17345. | ||
log.debug(f"Closing TLS connection {self.__conn.sock}") | ||
|
||
response_time: float = self.smoothed_rt or 0.5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 0.5
? Shouldn't we move it to some constant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure. In theory this can't even happen, not having smoothed_rt
means we haven't got any connection to close.
If it should be a constant, where? Is it important enough to be in a config file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Option in configuration file would be overkill. I would just create some constant in connection.py
with some comment. I don't like magic numbers in the code without any explanation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
@jirihnidek Well, since patch only complicates things, and we know the server is capable of closing the connection properly, should we close this PR? |
No, I think that we should continue to work on this PR. Remember that server should be reliable, when clients become crazy... and clients also should be reliable, when server becomes crazy. It is good to consider timeouts in any moment of communication between client and server. |
377eb57
to
eab5b64
Compare
* Card ID: CCT-176 * Card ID: RHEL-17345 Some servers, like Quarkus, do not send the `close_notify` alert before closing their connection by default. This would cause subscription-manager to freeze until the TLS connection timeout was reached. This patch ensures we set our own timeout equal to 3x the response time of the connection, to kill the connection if we don't get a response back.
eab5b64
to
ec1b1ae
Compare
Some servers, like Quarkus, do not send the
close_notify
alert beforeclosing their connection by default. This would cause
subscription-manager to freeze until the TLS connection timeout was
reached.
This patch ensures we set our own timeout equal to 3x the response time
of the connection, to kill the connection if we don't get a response
back.