Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CCT-176: Handle incorrectly closed TLS connections #3358

Closed
wants to merge 1 commit into from

Conversation

m-horky
Copy link
Contributor

@m-horky m-horky commented Nov 24, 2023

  • Card ID: CCT-176
  • Card ID: RHEL-17345

Some servers, like Quarkus, do not send the close_notify alert before
closing their connection by default. This would cause
subscription-manager to freeze until the TLS connection timeout was
reached.

This patch ensures we set our own timeout equal to 3x the response time
of the connection, to kill the connection if we don't get a response
back.

@m-horky m-horky requested a review from jirihnidek November 24, 2023 14:17
@cnsnyder cnsnyder requested a review from a team November 24, 2023 14:18
Copy link

github-actions bot commented Nov 24, 2023

Coverage

Coverage (computed on Fedora latest) •
FileStmtsMissCoverMissing
rhsm
   connection.py103547254%51–52, 56, 58–59, 91, 111–112, 160, 294, 325, 391–396, 400–409, 470, 472, 574, 577, 584–590, 595, 651, 698, 705–706, 708–709, 711–712, 714–717, 719–720, 722–725, 735, 762, 765–766, 768–769, 771, 782–786, 790, 794, 796–797, 824, 827, 831–832, 837, 840–841, 856, 860, 862–863, 890–891, 893, 896, 901–902, 905–906, 908, 910–914, 916–917, 920–927, 929–939, 941, 943–944, 955–957, 959–961, 963–965, 967–969, 971, 974–980, 982–983, 985–986, 988, 999–1001, 1003–1004, 1006–1008, 1010, 1022–1025, 1030, 1094, 1096–1101, 1103, 1108–1112, 1118–1121, 1123–1128, 1132–1137, 1144, 1181, 1183, 1188, 1199, 1208–1211, 1215, 1217–1219, 1223–1224, 1226–1233, 1235, 1237, 1240–1247, 1250–1251, 1256, 1258, 1311, 1328–1331, 1355, 1377, 1407, 1412, 1415, 1418–1419, 1424, 1427, 1432, 1435, 1478–1482, 1489–1490, 1492, 1500–1501, 1503, 1520, 1533–1535, 1538, 1551, 1558, 1562, 1590–1592, 1597–1598, 1600–1601, 1603–1604, 1606–1620, 1622–1624, 1626–1637, 1639, 1656–1658, 1660–1662, 1664–1666, 1671, 1676–1678, 1683, 1710, 1742–1770, 1775–1776, 1778–1780, 1783–1784, 1787–1788, 1791–1792, 1811–1812, 1821–1822, 1832–1833, 1840–1841, 1847–1850, 1856–1859, 1865–1866, 1872–1873, 1893–1894, 1903–1907, 1915–1916, 1942–1945, 1970–1971, 1980–1981, 1989–1990, 2011, 2013–2015, 2017, 2019, 2022, 2024–2037, 2039–2040, 2049–2051, 2063–2064, 2073–2074, 2076, 2078–2080, 2087–2089, 2098–2100, 2108–2109, 2120, 2122–2123, 2125, 2127–2130, 2132–2134, 2137, 2139, 2146–2147, 2154–2155, 2165–2166, 2176–2179, 2189–2195, 2202–2205
TOTAL18240463274% 

Tests Skipped Failures Errors Time
2640 14 💤 0 ❌ 0 🔥 44.958s ⏱️

@m-horky m-horky marked this pull request as draft November 27, 2023 09:04
@m-horky
Copy link
Contributor Author

m-horky commented Nov 27, 2023

Marked as draft, waiting for final Candlepin verification. The solution works, but it may be caused by TLS 1.3 half-close policy instead.

@m-horky m-horky force-pushed the mhorky/CCT-176_tls-teardown-handshake branch from 0850499 to 0bd1693 Compare November 30, 2023 13:26
@m-horky m-horky marked this pull request as ready for review November 30, 2023 13:29
Copy link
Contributor

@ptoscano ptoscano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks (and it also simplifies things a bit)

I will leave it to @jirihnidek for an additional review.

Copy link
Contributor

@jirihnidek jirihnidek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest: I do not like this change. TLS should be closed at the end of communication. If server does something that does not follow TLS specification, then it should be fixed on server side. Doing such hack on client side is not good idea.

@nikosmoum could it be fixed on server side?

@nikosmoum
Copy link
Member

To be honest: I do not like this change. TLS should be closed at the end of communication. If server does something that does not follow TLS specification, then it should be fixed on server side. Doing such hack on client side is not good idea.

@nikosmoum could it be fixed on server side?

If I understand correctly, this isn't a hack, and it is conforming to TLS 1.3 specification.
@bakajstep can you verify this, or am I wrong here?^

@m-horky
Copy link
Contributor Author

m-horky commented Dec 13, 2023

We need to find a solution that works for both TLS 1.2 and 1.3. My alternative approach to this would be a 1-2 second timeout: try our best to close the connection, but if we don't hear from the server, just tear it down.

@bakajstep
Copy link
Member

bakajstep commented Dec 13, 2023

To be honest: I do not like this change. TLS should be closed at the end of communication. If server does something that does not follow TLS specification, then it should be fixed on server side. Doing such hack on client side is not good idea.
@nikosmoum could it be fixed on server side?

If I understand correctly, this isn't a hack, and it is conforming to TLS 1.3 specification. @bakajstep can you verify this, or am I wrong here?^

TLS 1.3 uses a half-close policy, while TLS 1.2 and earlier use a duplex-close policy. For applications that depend on the duplex-close policy, there may be compatibility issues when upgrading to TLS 1.3. Source

We can force again duplex-close policy on server side for TLS 1.3. @jirihnidek

@jirihnidek
Copy link
Contributor

jirihnidek commented Dec 14, 2023

The specification says is clearly. The close_notify has to be supported: https://datatracker.ietf.org/doc/html/rfc8446#section-6.1

Each party MUST send a "close_notify" alert before closing its write side of the connection, unless it has already sent some error alert.

How can I test this issue with Quarkus? Is there any feature branch or is it already in main?

I still do not understand one thing. If server side does not response on close_notify with close_notify, but it only close TCP connection, then we should not notice any timeout on client side... It is current wrong behavior of candlepin server (candleping does not response on close_notify with close_notify, but it closes TCP connection).

@jirihnidek
Copy link
Contributor

We need to find a solution that works for both TLS 1.2 and 1.3. My alternative approach to this would be a 1-2 second timeout: try our best to close the connection, but if we don't hear from the server, just tear it down.

We know RTT for each connection. It would not be necessary to wait some fixed time.

@bakajstep
Copy link
Member

The specification says is clearly. The close_notify has to be supported: https://datatracker.ietf.org/doc/html/rfc8446#section-6.1

Each party MUST send a "close_notify" alert before closing its write side of the connection, unless it has already sent some error alert.

How can I test this issue with Quarkus? Is there any feature branch or is it already in main?

I still do not understand one thing. If server side does not response on close_notify with close_notify, but it only close TCP connection, then we should not notice any timeout on client side... It is current wrong behavior of candlepin server (candleping does not response on close_notify with close_notify, but it closes TCP connection).

i will send you example

@jirihnidek
Copy link
Contributor

The specification of TLS 1.3 does not say that server should do nothing, when close_notify is received. It says that it is not enforced to send close_notify immediately, because truncating of data could happen in TLS prior 1.3. When server side has nothing to send and it receives close_notify, then it is logical to also send close_notify to client.

We use HTTPs and REST API. There is zero chance that server will have to send anything asynchronously to client.

@m-horky m-horky force-pushed the mhorky/CCT-176_tls-teardown-handshake branch from 0bd1693 to 01b2277 Compare December 19, 2023 12:56
@m-horky
Copy link
Contributor Author

m-horky commented Dec 19, 2023

We know RTT for each connection.

Right, I forgot about it. The function had to grow a bit, but I think it is still clear what is happening there.

@m-horky m-horky force-pushed the mhorky/CCT-176_tls-teardown-handshake branch 2 times, most recently from a543df2 to 377eb57 Compare December 19, 2023 13:16
@m-horky m-horky changed the title CCT-176: Do not wait until the server closes the TLS connection CCT-176: Handle incorrectly closed TLS connections Dec 19, 2023
Copy link
Contributor

@jirihnidek jirihnidek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think that server should do it properly... I have few comments for your PR.

raise TimeoutError(f"Did not get response in {timeout_time}s")

signal.signal(signalnum=signal.SIGALRM, handler=on_timeout)
signal.alarm(timeout_time)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the best approach, because this code is also used for rhsm.service. It means that TimeoutError would be raised despite connection was closed properly...

try:
self.__conn.sock.unwrap()
except (ssl.SSLError, TimeoutError) as err:
log.debug(f"TLS connection could not be closed: {err}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alarm should be canceled in the else statement using signal.alarm(0). See: https://man7.org/linux/man-pages/man2/alarm.2.html

Also, not sure how will it work with multi-threaded rhsm.service.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also add original debug message to else statement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a bit of investigating I found out that connection reuse is disabled for the D-Bus server and this code will never run in multi-threaded context.

# Do not allow reusing connection, because server uses multiple threads. If reusing connection
# was used, then two threads could try to use the connection in the almost same time.
# It means that one thread could send request and the second one could send second request
# before the first thread received response. This could lead to raising exception CannotSendRequest.
connection.REUSE_CONNECTION = False

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this code could be run in multi-threaded context.

When reusing of connection is disabled, then it only means that each REST API request uses it's own TLS connection, but the connection still have to be closed.

Correct me if I'm wrong, but I believe that there could be more threads doing REST API calls, when there are two apps using rhsm.service concurrently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole setup does not seem thread-safe. The __conn is class attribute, it would be shared between the threads. If we want to make it thread-safe we'd need to change the architecture of the whole connection reuse.

I'm still not sure this PR makes sense. We're not handling "the sad path" on so many other places this seems unnecessary -- we know the initial cause will be fixed server-side and this patch will not be executed in real conditions.

log.debug(
f"Latest response time was {response_time:.5f}s, "
f"smoothed response time is {self.smoothed_rt:.5f}s"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to this issue, but I don't care ;-)

# See RHEL-17345.
log.debug(f"Closing TLS connection {self.__conn.sock}")

response_time: float = self.smoothed_rt or 0.5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 0.5? Shouldn't we move it to some constant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure. In theory this can't even happen, not having smoothed_rt means we haven't got any connection to close.
If it should be a constant, where? Is it important enough to be in a config file?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Option in configuration file would be overkill. I would just create some constant in connection.py with some comment. I don't like magic numbers in the code without any explanation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

@m-horky
Copy link
Contributor Author

m-horky commented Jan 9, 2024

server should do it properly

@jirihnidek Well, since patch only complicates things, and we know the server is capable of closing the connection properly, should we close this PR?

@jirihnidek
Copy link
Contributor

jirihnidek commented Jan 9, 2024

server should do it properly

@jirihnidek Well, since patch only complicates things, and we know the server is capable of closing the connection properly, should we close this PR?

No, I think that we should continue to work on this PR. Remember that server should be reliable, when clients become crazy... and clients also should be reliable, when server becomes crazy.

It is good to consider timeouts in any moment of communication between client and server.

@m-horky m-horky force-pushed the mhorky/CCT-176_tls-teardown-handshake branch from 377eb57 to eab5b64 Compare January 11, 2024 12:01
* Card ID: CCT-176
* Card ID: RHEL-17345

Some servers, like Quarkus, do not send the `close_notify` alert before
closing their connection by default. This would cause
subscription-manager to freeze until the TLS connection timeout was
reached.

This patch ensures we set our own timeout equal to 3x the response time
of the connection, to kill the connection if we don't get a response
back.
@m-horky m-horky force-pushed the mhorky/CCT-176_tls-teardown-handshake branch from eab5b64 to ec1b1ae Compare January 11, 2024 16:41
@m-horky m-horky closed this Jan 22, 2024
@m-horky m-horky deleted the mhorky/CCT-176_tls-teardown-handshake branch May 21, 2024 11:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants