How to properly handle connection issues and reconnecting? (request for comments) #1660

cerlestes · 2024-06-18T18:51:37Z

Hello everyone!

First off, thanks a lot to every contributor of this repository; it's a great library that has helped us out tremendously in multiple projects. Secondly, I hope it's okay that I'm using an issue to open a discussion. I'd like to gather some insights from people who are more knowledgable than I am about OPC-UA and this library, hoping that I'll be able to contribute a well-rounded feature out of this discussion in the future.

The topic is handling connection issues and reconnecting properly. Right now, whenever our application loses the connection to the OPC-UA server, for example because the PLC config changed and it's reloading the server, we're reconnecting the client from our application once we try to interact with a node and it fails (we catch the UaError and simply try connecting up to a few times). This was fine until subscriptions came into play. With subscriptions, I'm really having a hard time finding the proper way to detect issues, reconnect and restart the subscriptions.

I've found the Client._monitor_server_loop() method, which is started as a task into Client._monitor_server_task. Once the connection dies, it'll inform the subscriptions of the BadShutdown. This seems to be about the only way to be informed about a connection issue other than emulating that behaviour externally to the client, polling and catching errors when they are raised. Another method of detecting connection issues is the Client.check_connection() method. But again, this method must be polled from the application external to the client.

I think ideally the client itself should provide a mechanism to allow applications to react to connection issues and states in general, i.e. callback when the client lost the connection. On top of that, it should then implement an optional reconnect mechanism that, when enabled, automatically attempts to reconnect upon losing connection, including restoring any subscriptions.

My current proposal would be the following:

Add three asyncio.Event instances Client.connected, Client.disconnected, Client.failed. These events are set() when the respective connection state is reached and clear()-ed when the respectice state is left. This would allow application code to simply await client.connected.wait() before each interaction with the client. It would also allow to run error handler tasks once the connection fails with await client.failed.wait().
Maybe add a set of methods Client.add_connected_callback(), Client.add_disconnected_callback(), Client.add_failed_callback() to register callback functions which are called once the respective state is reached.
Add a new optional parameter to Client() which could be as simple as auto_reconnect: bool = False.
Whenever the auto_reconnect is enabled, an additional task Client._auto_reconnect_task will be created by the client upon connecting, which continously calls Client.check_connection() similiar to how the Client._monitor_server_loop() works, and in case of an error automatically tries connecting the client again.
Probably a bit more configuration is required for that feature, so maybe add a dataclass AutoReconnectSettings. The following settings come to mind:
- How often to try reconnecting before giving up
- How long to wait in between connection attempts (maybe with exponential backoff?)
- Whether or not to restart the subscriptions after reestablishing the connection
Maybe even allow deeper customization by pulling the reconnection logic into its own class ClientReconnectHandler, which would implement a simple strategy pattern to allow interchangeable reconnection mechanisms, providing a ExponentialBackoffReconnectHandler by default. The parameter could then have the signature of auto_reconnect: bool | ClientReconnectHandler = False, applying a default handler with default values when simply set to True.

I'd love to hear what you guys think about this and how you would approach this. Maybe someone has already implemented a similiar reconnect mechanism and would like to share their thoughts, I'd greatly appreciate that.

Thanks a lot!

The text was updated successfully, but these errors were encountered:

cerlestes · 2024-06-27T17:09:35Z

A small update after a week of trying to work on this:

I've implemented the solution I've stated above in an adapter class of our application code. It worked absolutely fine when I tried it on my Windows machine by cutting the SSH forwarding to the PLC network and reconnecting it; the client would pick up on this and restablish connection just fine. But when we tried running it from a Linux host, where we'd sporadically blackhole the PLC IP address via nftables to simulate packet loss, the application would sometimes end up hanging ad infinitum when trying to connect the same Client instance again after it had lost its connection. I couldn't find the solution to this problem; I don't want to point fingers at somebody else's code, but it seemed like the bug is either somewhere in the UaClient/UASocketProtocol or even deeper within Python. The UaClient calls asyncio.wait_for() when connecting, but this timeout never fires; the function simply hangs forever. Adding another asyncio.wait_for() around it in our application code didn't work either, which sounds to me like the whole thread is getting stuck somewhere deep down the rabbit hole.

But I've found an alternative solution that isn't a downgrade and works perfectly fine: simply recreate the Client instance when reconnecting. So the workflow is as follows:

Wait for connection loss using await client._monitor_server_task
Try to disconnect the existing instance to clean it up, this will always raise Exceptions as expected
Create a new Client instance
Connect the new instance
Rewrire the subscriptions and everything else to work with the new instance

I think the same workflow could be adapted to the Client class itself, so that it would simply replace its internal UaClient instance when reconnecting.

We'll put this method into 24/7 test for a week or two now and if that succeeds, we'll roll it out to two applications on the field. I'll report back if further adjustments were required to achieve a robust reconnect mechanism.

oroulet · 2024-07-01T15:09:37Z

@cerlestes thanks for a good analysis and also some interesting propositions.
I see we could recreate the ua_client object in a task at connection loose, but then we would also need to remember all the subscriptions that client has (or even worse, check with server if the subscription still exists and try to reconnect them), sounds quite hard to make it work reliably in all cases

btw we had a similar issue at work and I just implemented that one: #1670 maybe that also helps you

oroulet · 2024-07-01T15:17:41Z

but your last proposition is the async way of doing the same as a callback. We just need to make public that monitor_server_task() and it does the same

cerlestes · 2024-07-01T17:02:53Z

@oroulet Nice commit, we would have needed that a few weeks ago 🤣 It's basically the same solution that I arrived at, only that my solution was external to the library. So +1 to that feature, looking forward to it releasing soon 👍 Maybe make it an async function though and await it?

Re subscriptions: I don't think it's a lot of work to get them working again; maybe I don't know enough about how they work to see the effects it would have on more complex cases though. In my adapter class, I simply try to delete the subscriptions by their ID once I reconnect, and then recreate them. This works absolutely fine for our use case. Maybe there are use-cases where it'd be preferred to keep the subscriptions if they still exist on the server, so that solution might not be the best way to go ahead, but it'd work. I'd implement the reconnect mechanism on the Client, not on UaClient, so the library would simply need to remember the values that subscriptions were created with in the Client and then it's as easy as calling create_subscription() again with those values after reconnecting.

PS: the reconnect mechanism I've described above has been running all weekend now, experiencing a controlled connection loss about every 3 minutes and reconnecting successfully afterwards for almost 1500 times already. So it seems to be working fine and stable.

oroulet · 2024-07-01T17:44:09Z

I made it async now

marcvanR · 2024-08-15T13:21:15Z

Hi,

First of all, thanks a lot for this library, it helps us a lot!

This discussions interested me, but now I'm not sure how to implement this properly, is there some documentation somewhere that shows how to implement this??

I'm currently polling the client.check_connection() which raises an error which I catch to recreate the client, but what you describe sounds a bit less blunt.

Also I use the same strategy wrt subscriptions as @cerlestes describes, but now I am curious what the pitfalls are with that approach.

Thanks again!

AndreasHeine · 2024-08-15T13:52:05Z

@oroulet @cerlestes @marcvanR

just for clarification:

https://reference.opcfoundation.org/Core/Part4/v104/docs/6.7

currently the problem is that the transport, session and subscription are to tightly coupled...

opcua-asyncio/asyncua/common/session_interface.py

Lines 218 to 225 in 1fe61df

    
               async def transfer_subscriptions(self, params: ua.TransferSubscriptionsParameters) -> List[ua.TransferResult]: 
        
                   ''' 
        
                   https://reference.opcfoundation.org/Core/Part4/v104/5.13.7/ 
        
                   This Service is used to transfer a Subscription and its MonitoredItems from one Session to another. 
        
                   For example, a Client may need to reopen a Session and then transfer its Subscriptions to that Session. 
        
                   It may also be used by one Client to take over a Subscription from another Client by transferring the Subscription to its Session. 
        
                   '''

Transport Reconnect must be independent of Session and Subscription!
One Client can have multiple Sessions and one Subscription is kind bound to One Session but can be transferred!

Transfer-Subscription must be implemented!

opcua-asyncio/asyncua/client/ua_client.py

Lines 810 to 813 in 1fe61df

    
           async def transfer_subscriptions(self, params: ua.TransferSubscriptionsParameters) -> List[ua.TransferResult]: 
        
               # Subscriptions aren't bound to a Session and can be transferred! 
        
               # https://reference.opcfoundation.org/Core/Part4/v104/5.13.7/ 
        
               raise NotImplementedError

what i saw in the source code of ua_client is the crux and it would need to be refactored to enable lousily coupling:

session class with all Services (Read/Write/Browse see "AbstractSession" in session_interface.py) which need to get the client injected and not the other way around so a client can have multiple sessions...

thats a lot of work and i have almost 0 time... i am sorry i started with the AbstractSession stuff but couldn't bring it much further!

edit:

and of course it would break almost every existing client -> 2.0.0

edit2:

the Node-Class need to be removed or reworked as well... almost forgot to mention that

oroulet · 2024-08-17T17:56:43Z

honestly I do not even have the time to try to understand. At work we use the check_connection() and reconnect when there is an issue. It works perfectly for us, and I do not feel the need for anything more.

AndreasHeine · 2024-08-17T18:29:15Z

in simple words ua_client needs to be split into a Session-Class (implements session_interface) and a Client-Class (implements session less stuff e.g. Connect / Reconnect of the Transport and SecureChannel related stuff.

so if you interact with a server e.g. Read you use the Session-Class-Instance which got the Client-Class-Instance injected...

currently its structurly not possible to use the client according how spec describes its features... currently you will lose monitored item data thru not reuse the session and not transfer subscription to the new session 🤷

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to properly handle connection issues and reconnecting? (request for comments) #1660

How to properly handle connection issues and reconnecting? (request for comments) #1660

cerlestes commented Jun 18, 2024 •

edited

Loading

cerlestes commented Jun 27, 2024 •

edited

Loading

oroulet commented Jul 1, 2024

oroulet commented Jul 1, 2024 •

edited

Loading

cerlestes commented Jul 1, 2024

oroulet commented Jul 1, 2024

marcvanR commented Aug 15, 2024

AndreasHeine commented Aug 15, 2024 •

edited

Loading

oroulet commented Aug 17, 2024

AndreasHeine commented Aug 17, 2024

How to properly handle connection issues and reconnecting? (request for comments) #1660

How to properly handle connection issues and reconnecting? (request for comments) #1660

Comments

cerlestes commented Jun 18, 2024 • edited Loading

cerlestes commented Jun 27, 2024 • edited Loading

oroulet commented Jul 1, 2024

oroulet commented Jul 1, 2024 • edited Loading

cerlestes commented Jul 1, 2024

oroulet commented Jul 1, 2024

marcvanR commented Aug 15, 2024

AndreasHeine commented Aug 15, 2024 • edited Loading

oroulet commented Aug 17, 2024

AndreasHeine commented Aug 17, 2024

cerlestes commented Jun 18, 2024 •

edited

Loading

cerlestes commented Jun 27, 2024 •

edited

Loading

oroulet commented Jul 1, 2024 •

edited

Loading

AndreasHeine commented Aug 15, 2024 •

edited

Loading