-
Notifications
You must be signed in to change notification settings - Fork 7.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[mesh][v4.4] While in fixed root, root stops accepting children until reboot (IDFGH-11699) #12806
Comments
i added this piece of code:
with these changes, i was able to trigger the issue by leaving 4 devices on for a couple of days in fixed root.
which means the devices are disconnecting without ever reaching the connected state -> should never happen! on the chidren i see
|
We have faced same issue on ESP IDF v5.1. Did you resolve that issue? That time faced two reason:- |
@dspworks-swaroop the code i added above allows you to recognize the issue and do a workaround. |
For the root node, when there is a child connects, only when the 4-way handshake completed, the For the child node, the log
indicates that the connection was interrupted because the root node switched to a non-fixed. |
mmm, but then why does it happen only after days of running in fixed root? it feels like some invalid condition is reached if you "try long enough". Also the fact that you just need to restart the ROOT to recover the whole system is strange.
Can we print the result of the handshake to see if that is the issue? |
I think the |
but if the issue was the child, it would not fix itself by rebooting the root. I'll try the additional prints. |
@zhangyanjiaoesp
EDIT: EDIT2:
|
We currently can't retrigger the issue because we experience continuous
Investigating, this is triggered by WDT_MWDT1,
What value do you recommend for WIFI interrupts? Apparently 300ms is not enough. |
https://docs.espressif.com/projects/esp-idf/en/latest/esp32/api-reference/system/wdts.html?highlight=watchdog |
we have our tick at 1ms, so the issue is that the wifi interrupts sometimes takes 300ms to run. |
with 500ms the system is not resetting anymore, but we will continue to investigate, 300ms for an interrupt means something is wrong. |
i think the increase in latency in the interrupts is given by enabling debugging options of LWIP. Stopping the children from spamming connections somehow helped on this topic, but we are still testing |
Adding the following code allowed us to recover the functionality on the field. It still feels like a workaround tho. |
@KonssnoK , yes, we know the restart root node is a workaround. Can you provide the log with supplicant debug log enabled? Or can you provide the sniffer packets when the issue happens? |
@zhangyanjiaoesp sadly most of these issues occur on the field where devices logs cannot be changed. I'm currently seeing this problem in a installation with 17 devices where periodically we receive disconnections from the Wifi. These are some logs from the devices:
apparently the workaround of resetting the mesh makes just the mesh build up more unstable (which makes sense) but on the other side, it avoids blocking the device completely. |
@zhangyanjiaoesp I am trying to make an example that would work with devkits, so you could reproduce locally. My problem is that of course there is no modem available. Otherwise, i need another way to create a secondary connection Another option i'm thinking of is a ethernet bridge, but that would require more work. Meanwhile I'll use our device+devkits to try to trigger the issue |
while working on this issue i'm able to trigger the same
as seen in #13212 240625dev1_8.txt the devices are working in fixed root and connecting to a 4th device that has the modem attached |
with current library 0627 everything gets stucks in new ways, we'll have to continue once we have a final fix for #13212 |
@zhangyanjiaoesp question: while in fixed root i see a lot of (I'm using library 0702 for now)
|
This log is expected, you don't need to care about it. If necessary, I can provide you a wifi lib with only fixes, so that the debug log will not affect your testing. |
ok sure. |
hello @zhangyanjiaoesp , According to the previously reported changes i see these logs a lot
Do you have any idea why a device should continuosuly disconnect from the parent without even trying to connect? i will try to setup devices again and see if it's possible to replicate |
Also could you tell me when We have some devices failing on setting the parent with such value
And how is it possible that i see devices with layer cap 1 and current layer 5 ? Can you explain again what
indicate? Our code for selection of layers is available here and based on your example: |
small update: |
@zhangyanjiaoesp I do not understand why, but each time that a node connects to this fixed root node, it does the following:
I think i can force the children avoid the WIFI reconnection when receiving the fixed root event after a connection, but i'm not sure this will cover also the transitioning back to dynamic root. also,
|
@KonssnoK I just returned from the National Day holiday, and I will review your question this week. |
thanks @zhangyanjiaoesp I think now this specific issue should be solved because i force the devices without a modem to stay in fixed root (not look for wifi) until they receive from the network a dynamic root event. But please, yes, there are still a lot of questions in my previous comments :) |
I couldn't find the instance where the return value is 0x1. Could you describe the circumstances under which you returned such a value?
|
|
Hi @zhangyanjiaoesp , I am a bit confused now:
In our logs we see devices reporting in the scan result values of This is also the case where we get error 0x1 as result of calling
Maybe layer_cap and layer2_cap are:
|
sorry for the incorrect explanation regarding
the explanation for
The comments in the |
@zhangyanjiaoesp i can understand the first comment and it seems in line with what i reported above. Could you check? Also, |
@KonssnoK |
@KonssnoK |
I don't have a direct way to reproduce it because it's coming only from field devices. So, if you want to replicate, i think you just have to use our code in fixed root and decrease the number of layers.
|
@KonssnoK I will try to test, should I use the branch KonssnoK@f42e22c ? |
yes that is in line with our current implementation, but i would take |
@KonssnoK Have you turned on the WiFi log? Normally, when an error is returned, the corresponding log will also be printed. |
@zhangyanjiaoesp as far as i know there is currently no way to redirect wifi/mesh logs to a function that is not ESP_LOGX. Meanwhile we started migrating to 5.3 |
@KonssnoK I can't reproduce the issue locally, since you can't see the log, then adding debug log for you to test would not be helpful. I will review the code again. |
Answers checklist.
General issue report
v4.4.6-176-g84a3442f5d
Hello @zhangyanjiaoesp ,
we are seeing a strange issue on installations where the fixed root is being used (which means no WIFI and connection to the LTE).
After some time (days), the ROOT node (which we force), stops accepting any connection from children.
As soon as the ROOT is rebooted, the connection from the children is restored.
What we see in the logs of the ROOT is the following:
As you can see there are multiple disconnections of children without any connection in the middle.
What is missing, is the reconnection of the children, which instead display simply
My setup is:
The text was updated successfully, but these errors were encountered: