-
Notifications
You must be signed in to change notification settings - Fork 366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to properly handle connection issues and reconnecting? (request for comments) #1660
Comments
A small update after a week of trying to work on this: I've implemented the solution I've stated above in an adapter class of our application code. It worked absolutely fine when I tried it on my Windows machine by cutting the SSH forwarding to the PLC network and reconnecting it; the client would pick up on this and restablish connection just fine. But when we tried running it from a Linux host, where we'd sporadically blackhole the PLC IP address via nftables to simulate packet loss, the application would sometimes end up hanging ad infinitum when trying to connect the same Client instance again after it had lost its connection. I couldn't find the solution to this problem; I don't want to point fingers at somebody else's code, but it seemed like the bug is either somewhere in the But I've found an alternative solution that isn't a downgrade and works perfectly fine: simply recreate the Client instance when reconnecting. So the workflow is as follows:
I think the same workflow could be adapted to the We'll put this method into 24/7 test for a week or two now and if that succeeds, we'll roll it out to two applications on the field. I'll report back if further adjustments were required to achieve a robust reconnect mechanism. |
@cerlestes thanks for a good analysis and also some interesting propositions. btw we had a similar issue at work and I just implemented that one: #1670 maybe that also helps you |
but your last proposition is the async way of doing the same as a callback. We just need to make public that monitor_server_task() and it does the same |
@oroulet Nice commit, we would have needed that a few weeks ago 🤣 It's basically the same solution that I arrived at, only that my solution was external to the library. So +1 to that feature, looking forward to it releasing soon 👍 Maybe make it an async function though and await it? Re subscriptions: I don't think it's a lot of work to get them working again; maybe I don't know enough about how they work to see the effects it would have on more complex cases though. In my adapter class, I simply try to delete the subscriptions by their ID once I reconnect, and then recreate them. This works absolutely fine for our use case. Maybe there are use-cases where it'd be preferred to keep the subscriptions if they still exist on the server, so that solution might not be the best way to go ahead, but it'd work. I'd implement the reconnect mechanism on the PS: the reconnect mechanism I've described above has been running all weekend now, experiencing a controlled connection loss about every 3 minutes and reconnecting successfully afterwards for almost 1500 times already. So it seems to be working fine and stable. |
I made it async now |
Hi, First of all, thanks a lot for this library, it helps us a lot! This discussions interested me, but now I'm not sure how to implement this properly, is there some documentation somewhere that shows how to implement this?? I'm currently polling the Also I use the same strategy wrt subscriptions as @cerlestes describes, but now I am curious what the pitfalls are with that approach. Thanks again! |
just for clarification: https://reference.opcfoundation.org/Core/Part4/v104/docs/6.7 currently the problem is that the transport, session and subscription are to tightly coupled... opcua-asyncio/asyncua/common/session_interface.py Lines 218 to 225 in 1fe61df
Transport Reconnect must be independent of Session and Subscription! Transfer-Subscription must be implemented! opcua-asyncio/asyncua/client/ua_client.py Lines 810 to 813 in 1fe61df
what i saw in the source code of ua_client is the crux and it would need to be refactored to enable lousily coupling: session class with all Services (Read/Write/Browse see "AbstractSession" in session_interface.py) which need to get the client injected and not the other way around so a client can have multiple sessions... thats a lot of work and i have almost 0 time... i am sorry i started with the AbstractSession stuff but couldn't bring it much further! edit: and of course it would break almost every existing client -> 2.0.0 edit2: the Node-Class need to be removed or reworked as well... almost forgot to mention that |
honestly I do not even have the time to try to understand. At work we use the check_connection() and reconnect when there is an issue. It works perfectly for us, and I do not feel the need for anything more. |
in simple words ua_client needs to be split into a Session-Class (implements session_interface) and a Client-Class (implements session less stuff e.g. Connect / Reconnect of the Transport and SecureChannel related stuff. so if you interact with a server e.g. Read you use the Session-Class-Instance which got the Client-Class-Instance injected... currently its structurly not possible to use the client according how spec describes its features... currently you will lose monitored item data thru not reuse the session and not transfer subscription to the new session 🤷 |
Hello everyone!
First off, thanks a lot to every contributor of this repository; it's a great library that has helped us out tremendously in multiple projects. Secondly, I hope it's okay that I'm using an issue to open a discussion. I'd like to gather some insights from people who are more knowledgable than I am about OPC-UA and this library, hoping that I'll be able to contribute a well-rounded feature out of this discussion in the future.
The topic is handling connection issues and reconnecting properly. Right now, whenever our application loses the connection to the OPC-UA server, for example because the PLC config changed and it's reloading the server, we're reconnecting the client from our application once we try to interact with a node and it fails (we catch the UaError and simply try connecting up to a few times). This was fine until subscriptions came into play. With subscriptions, I'm really having a hard time finding the proper way to detect issues, reconnect and restart the subscriptions.
I've found the
Client._monitor_server_loop()
method, which is started as a task intoClient._monitor_server_task
. Once the connection dies, it'll inform the subscriptions of theBadShutdown
. This seems to be about the only way to be informed about a connection issue other than emulating that behaviour externally to the client, polling and catching errors when they are raised. Another method of detecting connection issues is theClient.check_connection()
method. But again, this method must be polled from the application external to the client.I think ideally the client itself should provide a mechanism to allow applications to react to connection issues and states in general, i.e. callback when the client lost the connection. On top of that, it should then implement an optional reconnect mechanism that, when enabled, automatically attempts to reconnect upon losing connection, including restoring any subscriptions.
My current proposal would be the following:
asyncio.Event
instancesClient.connected
,Client.disconnected
,Client.failed
. These events areset()
when the respective connection state is reached andclear()
-ed when the respectice state is left. This would allow application code to simplyawait client.connected.wait()
before each interaction with the client. It would also allow to run error handler tasks once the connection fails withawait client.failed.wait()
.Client.add_connected_callback()
,Client.add_disconnected_callback()
,Client.add_failed_callback()
to register callback functions which are called once the respective state is reached.Client()
which could be as simple asauto_reconnect: bool = False
.auto_reconnect
is enabled, an additional taskClient._auto_reconnect_task
will be created by the client upon connecting, which continously callsClient.check_connection()
similiar to how theClient._monitor_server_loop()
works, and in case of an error automatically tries connecting the client again.AutoReconnectSettings
. The following settings come to mind:ClientReconnectHandler
, which would implement a simple strategy pattern to allow interchangeable reconnection mechanisms, providing aExponentialBackoffReconnectHandler
by default. The parameter could then have the signature ofauto_reconnect: bool | ClientReconnectHandler = False
, applying a default handler with default values when simply set toTrue
.I'd love to hear what you guys think about this and how you would approach this. Maybe someone has already implemented a similiar reconnect mechanism and would like to share their thoughts, I'd greatly appreciate that.
Thanks a lot!
The text was updated successfully, but these errors were encountered: