CCT-176: Handle incorrectly closed TLS connections #3358

m-horky · 2023-11-24T14:17:48Z

Card ID: CCT-176
Card ID: RHEL-17345

Some servers, like Quarkus, do not send the close_notify alert before
closing their connection by default. This would cause
subscription-manager to freeze until the TLS connection timeout was
reached.

This patch ensures we set our own timeout equal to 3x the response time
of the connection, to kill the connection if we don't get a response
back.

github-actions · 2023-11-24T14:19:49Z

Coverage (computed on Fedora latest) •

File	Stmts	Miss	Cover	Missing
rhsm
connection.py	1035	472	54%	51–52, 56, 58–59, 91, 111–112, 160, 294, 325, 391–396, 400–409, 470, 472, 574, 577, 584–590, 595, 651, 698, 705–706, 708–709, 711–712, 714–717, 719–720, 722–725, 735, 762, 765–766, 768–769, 771, 782–786, 790, 794, 796–797, 824, 827, 831–832, 837, 840–841, 856, 860, 862–863, 890–891, 893, 896, 901–902, 905–906, 908, 910–914, 916–917, 920–927, 929–939, 941, 943–944, 955–957, 959–961, 963–965, 967–969, 971, 974–980, 982–983, 985–986, 988, 999–1001, 1003–1004, 1006–1008, 1010, 1022–1025, 1030, 1094, 1096–1101, 1103, 1108–1112, 1118–1121, 1123–1128, 1132–1137, 1144, 1181, 1183, 1188, 1199, 1208–1211, 1215, 1217–1219, 1223–1224, 1226–1233, 1235, 1237, 1240–1247, 1250–1251, 1256, 1258, 1311, 1328–1331, 1355, 1377, 1407, 1412, 1415, 1418–1419, 1424, 1427, 1432, 1435, 1478–1482, 1489–1490, 1492, 1500–1501, 1503, 1520, 1533–1535, 1538, 1551, 1558, 1562, 1590–1592, 1597–1598, 1600–1601, 1603–1604, 1606–1620, 1622–1624, 1626–1637, 1639, 1656–1658, 1660–1662, 1664–1666, 1671, 1676–1678, 1683, 1710, 1742–1770, 1775–1776, 1778–1780, 1783–1784, 1787–1788, 1791–1792, 1811–1812, 1821–1822, 1832–1833, 1840–1841, 1847–1850, 1856–1859, 1865–1866, 1872–1873, 1893–1894, 1903–1907, 1915–1916, 1942–1945, 1970–1971, 1980–1981, 1989–1990, 2011, 2013–2015, 2017, 2019, 2022, 2024–2037, 2039–2040, 2049–2051, 2063–2064, 2073–2074, 2076, 2078–2080, 2087–2089, 2098–2100, 2108–2109, 2120, 2122–2123, 2125, 2127–2130, 2132–2134, 2137, 2139, 2146–2147, 2154–2155, 2165–2166, 2176–2179, 2189–2195, 2202–2205
TOTAL	18240	4632	74%

Tests	Skipped	Failures	Errors	Time
2640	14 💤	0 ❌	0 🔥	44.958s ⏱️

m-horky · 2023-11-27T09:05:53Z

Marked as draft, waiting for final Candlepin verification. The solution works, but it may be caused by TLS 1.3 half-close policy instead.

ptoscano

LGTM, thanks (and it also simplifies things a bit)

I will leave it to @jirihnidek for an additional review.

jirihnidek

To be honest: I do not like this change. TLS should be closed at the end of communication. If server does something that does not follow TLS specification, then it should be fixed on server side. Doing such hack on client side is not good idea.

@nikosmoum could it be fixed on server side?

nikosmoum · 2023-12-13T14:20:29Z

To be honest: I do not like this change. TLS should be closed at the end of communication. If server does something that does not follow TLS specification, then it should be fixed on server side. Doing such hack on client side is not good idea.

@nikosmoum could it be fixed on server side?

If I understand correctly, this isn't a hack, and it is conforming to TLS 1.3 specification.
@bakajstep can you verify this, or am I wrong here?^

m-horky · 2023-12-13T14:36:24Z

We need to find a solution that works for both TLS 1.2 and 1.3. My alternative approach to this would be a 1-2 second timeout: try our best to close the connection, but if we don't hear from the server, just tear it down.

bakajstep · 2023-12-13T14:52:17Z

To be honest: I do not like this change. TLS should be closed at the end of communication. If server does something that does not follow TLS specification, then it should be fixed on server side. Doing such hack on client side is not good idea.
@nikosmoum could it be fixed on server side?

If I understand correctly, this isn't a hack, and it is conforming to TLS 1.3 specification. @bakajstep can you verify this, or am I wrong here?^

TLS 1.3 uses a half-close policy, while TLS 1.2 and earlier use a duplex-close policy. For applications that depend on the duplex-close policy, there may be compatibility issues when upgrading to TLS 1.3. Source

We can force again duplex-close policy on server side for TLS 1.3. @jirihnidek

jirihnidek · 2023-12-14T11:55:57Z

The specification says is clearly. The close_notify has to be supported: https://datatracker.ietf.org/doc/html/rfc8446#section-6.1

Each party MUST send a "close_notify" alert before closing its write side of the connection, unless it has already sent some error alert.

How can I test this issue with Quarkus? Is there any feature branch or is it already in main?

I still do not understand one thing. If server side does not response on close_notify with close_notify, but it only close TCP connection, then we should not notice any timeout on client side... It is current wrong behavior of candlepin server (candleping does not response on close_notify with close_notify, but it closes TCP connection).

jirihnidek · 2023-12-14T12:17:16Z

We need to find a solution that works for both TLS 1.2 and 1.3. My alternative approach to this would be a 1-2 second timeout: try our best to close the connection, but if we don't hear from the server, just tear it down.

We know RTT for each connection. It would not be necessary to wait some fixed time.

bakajstep · 2023-12-14T12:24:21Z

The specification says is clearly. The close_notify has to be supported: https://datatracker.ietf.org/doc/html/rfc8446#section-6.1

Each party MUST send a "close_notify" alert before closing its write side of the connection, unless it has already sent some error alert.

How can I test this issue with Quarkus? Is there any feature branch or is it already in main?

I still do not understand one thing. If server side does not response on close_notify with close_notify, but it only close TCP connection, then we should not notice any timeout on client side... It is current wrong behavior of candlepin server (candleping does not response on close_notify with close_notify, but it closes TCP connection).

i will send you example

jirihnidek · 2023-12-14T13:00:56Z

The specification of TLS 1.3 does not say that server should do nothing, when close_notify is received. It says that it is not enforced to send close_notify immediately, because truncating of data could happen in TLS prior 1.3. When server side has nothing to send and it receives close_notify, then it is logical to also send close_notify to client.

We use HTTPs and REST API. There is zero chance that server will have to send anything asynchronously to client.

m-horky · 2023-12-19T12:57:42Z

We know RTT for each connection.

Right, I forgot about it. The function had to grow a bit, but I think it is still clear what is happening there.

jirihnidek

I still think that server should do it properly... I have few comments for your PR.

jirihnidek · 2024-01-04T14:01:16Z

src/rhsm/connection.py

+            raise TimeoutError(f"Did not get response in {timeout_time}s")
+
+        signal.signal(signalnum=signal.SIGALRM, handler=on_timeout)
+        signal.alarm(timeout_time)


This is not the best approach, because this code is also used for rhsm.service. It means that TimeoutError would be raised despite connection was closed properly...

jirihnidek · 2024-01-04T15:10:30Z

src/rhsm/connection.py

+        try:
+            self.__conn.sock.unwrap()
+        except (ssl.SSLError, TimeoutError) as err:
+            log.debug(f"TLS connection could not be closed: {err}")


The alarm should be canceled in the else statement using signal.alarm(0). See: https://man7.org/linux/man-pages/man2/alarm.2.html

Also, not sure how will it work with multi-threaded rhsm.service.

I would also add original debug message to else statement.

After a bit of investigating I found out that connection reuse is disabled for the D-Bus server and this code will never run in multi-threaded context.

subscription-manager/src/rhsmlib/dbus/server.py

Lines 84 to 88 in 47d2729

# Do not allow reusing connection, because server uses multiple threads. If reusing connection

# was used, then two threads could try to use the connection in the almost same time.

# It means that one thread could send request and the second one could send second request

# before the first thread received response. This could lead to raising exception CannotSendRequest.

connection.REUSE_CONNECTION = False

I think that this code could be run in multi-threaded context.

When reusing of connection is disabled, then it only means that each REST API request uses it's own TLS connection, but the connection still have to be closed.

Correct me if I'm wrong, but I believe that there could be more threads doing REST API calls, when there are two apps using rhsm.service concurrently.

The whole setup does not seem thread-safe. The __conn is class attribute, it would be shared between the threads. If we want to make it thread-safe we'd need to change the architecture of the whole connection reuse.

I'm still not sure this PR makes sense. We're not handling "the sad path" on so many other places this seems unnecessary -- we know the initial cause will be fixed server-side and this patch will not be executed in real conditions.

jirihnidek · 2024-01-04T15:11:17Z

src/rhsm/connection.py

+        log.debug(
+            f"Latest response time was {response_time:.5f}s, "
+            f"smoothed response time is {self.smoothed_rt:.5f}s"
+        )


Not related to this issue, but I don't care ;-)

jirihnidek · 2024-01-04T15:12:23Z

src/rhsm/connection.py

+        # See RHEL-17345.
+        log.debug(f"Closing TLS connection {self.__conn.sock}")
+
+        response_time: float = self.smoothed_rt or 0.5


Why 0.5? Shouldn't we move it to some constant?

I am not sure. In theory this can't even happen, not having smoothed_rt means we haven't got any connection to close.
If it should be a constant, where? Is it important enough to be in a config file?

Option in configuration file would be overkill. I would just create some constant in connection.py with some comment. I don't like magic numbers in the code without any explanation.

Sounds good.

m-horky · 2024-01-09T12:04:37Z

server should do it properly

@jirihnidek Well, since patch only complicates things, and we know the server is capable of closing the connection properly, should we close this PR?

jirihnidek · 2024-01-09T14:39:33Z

server should do it properly

@jirihnidek Well, since patch only complicates things, and we know the server is capable of closing the connection properly, should we close this PR?

No, I think that we should continue to work on this PR. Remember that server should be reliable, when clients become crazy... and clients also should be reliable, when server becomes crazy.

It is good to consider timeouts in any moment of communication between client and server.

* Card ID: CCT-176 * Card ID: RHEL-17345 Some servers, like Quarkus, do not send the `close_notify` alert before closing their connection by default. This would cause subscription-manager to freeze until the TLS connection timeout was reached. This patch ensures we set our own timeout equal to 3x the response time of the connection, to kill the connection if we don't get a response back.

m-horky requested a review from jirihnidek November 24, 2023 14:17

cnsnyder requested a review from a team November 24, 2023 14:18

m-horky marked this pull request as draft November 27, 2023 09:04

m-horky force-pushed the mhorky/CCT-176_tls-teardown-handshake branch from 0850499 to 0bd1693 Compare November 30, 2023 13:26

m-horky marked this pull request as ready for review November 30, 2023 13:29

ptoscano approved these changes Dec 11, 2023

View reviewed changes

jirihnidek reviewed Dec 13, 2023

View reviewed changes

m-horky force-pushed the mhorky/CCT-176_tls-teardown-handshake branch from 0bd1693 to 01b2277 Compare December 19, 2023 12:56

m-horky force-pushed the mhorky/CCT-176_tls-teardown-handshake branch 2 times, most recently from a543df2 to 377eb57 Compare December 19, 2023 13:16

m-horky changed the title ~~CCT-176: Do not wait until the server closes the TLS connection~~ CCT-176: Handle incorrectly closed TLS connections Dec 19, 2023

jirihnidek requested changes Jan 4, 2024

View reviewed changes

m-horky force-pushed the mhorky/CCT-176_tls-teardown-handshake branch from 377eb57 to eab5b64 Compare January 11, 2024 12:01

m-horky force-pushed the mhorky/CCT-176_tls-teardown-handshake branch from eab5b64 to ec1b1ae Compare January 11, 2024 16:41

m-horky closed this Jan 22, 2024

m-horky deleted the mhorky/CCT-176_tls-teardown-handshake branch May 21, 2024 11:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CCT-176: Handle incorrectly closed TLS connections #3358

CCT-176: Handle incorrectly closed TLS connections #3358

m-horky commented Nov 24, 2023 •

edited

Loading

github-actions bot commented Nov 24, 2023 •

edited

Loading

m-horky commented Nov 27, 2023

ptoscano left a comment

jirihnidek left a comment •

edited

Loading

nikosmoum commented Dec 13, 2023

m-horky commented Dec 13, 2023

bakajstep commented Dec 13, 2023 •

edited

Loading

jirihnidek commented Dec 14, 2023 •

edited

Loading

jirihnidek commented Dec 14, 2023

bakajstep commented Dec 14, 2023

jirihnidek commented Dec 14, 2023

m-horky commented Dec 19, 2023

jirihnidek left a comment

jirihnidek Jan 4, 2024

jirihnidek Jan 4, 2024

jirihnidek Jan 4, 2024

m-horky Jan 11, 2024

jirihnidek Jan 12, 2024

m-horky Jan 19, 2024

jirihnidek Jan 4, 2024

jirihnidek Jan 4, 2024

m-horky Jan 11, 2024

jirihnidek Jan 11, 2024

m-horky Jan 11, 2024

m-horky commented Jan 9, 2024

jirihnidek commented Jan 9, 2024 •

edited

Loading

	# Do not allow reusing connection, because server uses multiple threads. If reusing connection
	# was used, then two threads could try to use the connection in the almost same time.
	# It means that one thread could send request and the second one could send second request
	# before the first thread received response. This could lead to raising exception CannotSendRequest.
	connection.REUSE_CONNECTION = False

CCT-176: Handle incorrectly closed TLS connections #3358

CCT-176: Handle incorrectly closed TLS connections #3358

Conversation

m-horky commented Nov 24, 2023 • edited Loading

github-actions bot commented Nov 24, 2023 • edited Loading

m-horky commented Nov 27, 2023

ptoscano left a comment

Choose a reason for hiding this comment

jirihnidek left a comment • edited Loading

Choose a reason for hiding this comment

nikosmoum commented Dec 13, 2023

m-horky commented Dec 13, 2023

bakajstep commented Dec 13, 2023 • edited Loading

jirihnidek commented Dec 14, 2023 • edited Loading

jirihnidek commented Dec 14, 2023

bakajstep commented Dec 14, 2023

jirihnidek commented Dec 14, 2023

m-horky commented Dec 19, 2023

jirihnidek left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

m-horky commented Jan 9, 2024

jirihnidek commented Jan 9, 2024 • edited Loading

m-horky commented Nov 24, 2023 •

edited

Loading

github-actions bot commented Nov 24, 2023 •

edited

Loading

jirihnidek left a comment •

edited

Loading

bakajstep commented Dec 13, 2023 •

edited

Loading

jirihnidek commented Dec 14, 2023 •

edited

Loading

jirihnidek commented Jan 9, 2024 •

edited

Loading