Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[mesh][v4.4] While in fixed root, root stops accepting children until reboot (IDFGH-11699) #12806

Open
3 tasks done
KonssnoK opened this issue Dec 15, 2023 · 42 comments
Open
3 tasks done
Assignees
Labels
Status: In Progress Work is in progress

Comments

@KonssnoK
Copy link
Contributor

Answers checklist.

  • I have read the documentation ESP-IDF Programming Guide and the issue is not addressed there.
  • I have updated my IDF branch (master or release) to the latest version and checked that the issue is present there.
  • I have searched the issue tracker for a similar issue and not found a similar issue.

General issue report

v4.4.6-176-g84a3442f5d

Hello @zhangyanjiaoesp ,
we are seeing a strange issue on installations where the fixed root is being used (which means no WIFI and connection to the LTE).

After some time (days), the ROOT node (which we force), stops accepting any connection from children.
As soon as the ROOT is rebooted, the connection from the children is restored.

What we see in the logs of the ROOT is the following:

I (09:00:54.166) mesh_hand: Checking wifi connectivity (count 1927)
I (09:00:59.653) mesh_main: <MESH_EVENT_SCAN_DONE>number:54
I (09:01:03.296) network: Routing table update
I (57852802) wifi:new:<1,1>, old:<1,0>, ap:<1,1>, sta:<1,1>, prof:1
I (57852803) wifi:station: 7c:df:a1:e2:a3:04 join, AID=1, bgn, 40U
I (09:01:10.727) mesh_main: <MESH_EVENT_PS_CHILD_DUTY>cidx:0, 7CDFA1E2A304, duty:0
I (57852828) wifi:station: 7c:df:a1:e2:a3:04 leave, AID = 1, bss_flags is 658530, bss:0x3d8371a4
I (57852829) wifi:new:<1,0>, old:<1,1>, ap:<1,1>, sta:<1,1>, prof:1
I (09:01:10.751) mesh_main: <MESH_EVENT_CHILD_DISCONNECTED>aid:1, 7CDFA1E2A304
I (57861042) wifi:new:<1,1>, old:<1,0>, ap:<1,1>, sta:<1,1>, prof:1
I (57861043) wifi:station: 7c:df:a1:e2:7b:84 join, AID=1, bgn, 40U
I (09:01:18.967) mesh_main: <MESH_EVENT_PS_CHILD_DUTY>cidx:0, 7CDFA1E27B84, duty:0
I (57861067) wifi:station: 7c:df:a1:e2:7b:84 leave, AID = 1, bss_flags is 658530, bss:0x3d8371a4
I (57861069) wifi:new:<1,0>, old:<1,1>, ap:<1,1>, sta:<1,1>, prof:1
I (09:01:18.990) mesh_main: <MESH_EVENT_CHILD_DISCONNECTED>aid:1, 7CDFA1E27B84
I (09:01:23.778) processor: send status... (60000ms)
W (57865862) wifi:Haven't to connect to a suitable AP now!
I (09:01:24.166) mesh_hand: Checking wifi connectivity (count 1928)[LOCAL]
I (09:01:29.651) mesh_main: <MESH_EVENT_SCAN_DONE>number:53
I (09:01:33.296) network: Routing table update
I (09:01:54.166) mesh_hand: Checking wifi connectivity (count 1929)[LOCAL]
I (09:01:59.651) mesh_main: <MESH_EVENT_SCAN_DONE>number:59
I (09:02:03.296) network: Routing table update
I (09:02:23.778) processor: send status... (60000ms)
W (57925862) wifi:Haven't to connect to a suitable AP now!
I (09:02:24.166) mesh_hand: Checking wifi connectivity (count 1930)[LOCAL]
I (09:02:29.651) mesh_main: <MESH_EVENT_SCAN_DONE>number:57
I (57932917) wifi:new:<1,1>, old:<1,0>, ap:<1,1>, sta:<1,1>, prof:1
I (57932919) wifi:station: 7c:df:a1:e0:74:40 join, AID=1, bgn, 40U
I (09:02:30.841) mesh_main: <MESH_EVENT_PS_CHILD_DUTY>cidx:0, 7CDFA1E07440, duty:0
I (57932941) wifi:station: 7c:df:a1:e0:74:40 leave, AID = 1, bss_flags is 658530, bss:0x3d8371a4    
I (57932942) wifi:new:<1,0>, old:<1,1>, ap:<1,1>, sta:<1,1>, prof:1
I (09:02:30.866) mesh_main: <MESH_EVENT_CHILD_DISCONNECTED>aid:1, 7CDFA1E07440
I (09:02:33.296) network: Routing table update
I (09:02:54.166) mesh_hand: Checking wifi connectivity (count 1931)[LOCAL]
I (09:02:59.651) mesh_main: <MESH_EVENT_SCAN_DONE>number:55
I (09:03:03.296) network: Routing table update
I (57971079) wifi:new:<1,1>, old:<1,0>, ap:<1,1>, sta:<1,1>, prof:1
I (57971081) wifi:station: 7c:df:a1:e2:7b:84 join, AID=1, bgn, 40U
I (09:03:09.003) mesh_main: <MESH_EVENT_PS_CHILD_DUTY>cidx:0, 7CDFA1E27B84, duty:0
I (57971107) wifi:station: 7c:df:a1:e2:7b:84 leave, AID = 1, bss_flags is 658530, bss:0x3d8371a4
I (57971109) wifi:new:<1,0>, old:<1,1>, ap:<1,1>, sta:<1,1>, prof:1
I (09:03:09.031) mesh_main: <MESH_EVENT_CHILD_DISCONNECTED>aid:1, 7CDFA1E27B84
I (09:03:23.778) processor: send status... (60000ms)
W (57985862) wifi:Haven't to connect to a suitable AP now!
I (09:03:24.166) mesh_hand: Checking wifi connectivity (count 1932)[LOCAL]

As you can see there are multiple disconnections of children without any connection in the middle.

What is missing, is the reconnection of the children, which instead display simply

I (55277345) wifi:state: run -> init (0)
I (55277347) wifi:new:<1,0>, old:<1,1>, ap:<1,1>, sta:<1,1>, prof:1
W (16:28:15.742) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 8 WIFI_REASON_ASSOC_LEAVE

My setup is:

  • one device connected to LTE, acting as ROOT
  • 3 devices connecting to it, scanning every 30 seconds
@espressif-bot espressif-bot added the Status: Opened Issue is new label Dec 15, 2023
@github-actions github-actions bot changed the title [mesh][v4.4] While in fixed root, root stops accepting children until reboot [mesh][v4.4] While in fixed root, root stops accepting children until reboot (IDFGH-11699) Dec 15, 2023
@KonssnoK
Copy link
Contributor Author

i added this piece of code:

static void mesh_track_child_connect(mesh_event_child_connected_t* child_connected)
{
    // Keep track of connected children
    bool found = false;
    int8_t firstfreeslot = -1;
    for (int i = 0; i < CONFIG_MESH_AP_CONNECTIONS; ++i) {
        if (self->children[i].mac == ((child_connected->mac[4] << 8) | (child_connected->mac[5]))) {
            found = true;
            self->children[i].connection_count++;
            break;
        } else if (!self->children[i].mac && firstfreeslot < 0) {
            firstfreeslot = i;
        }
    }
    if (!found) {
        // Add
        if (firstfreeslot >= 0) {
            self->children[firstfreeslot].mac = ((child_connected->mac[4] << 8) | (child_connected->mac[5]));
            self->children[firstfreeslot].connection_count = 1;
        } else {
            LOGE(TAG, "Child connection with no free slots! "MACSTR_COMPACT, MAC2STR(child_connected->mac));
            for (int i = 0; i < CONFIG_MESH_AP_CONNECTIONS; ++i) {
                LOGW(TAG, "%04X %d", self->children[firstfreeslot].mac, self->children[firstfreeslot].connection_count);
            }
        }
    }
}

static void mesh_track_child_disconnect(mesh_event_child_disconnected_t* child_disconnected)
{
    // We keep track of connected children and check for strange behavior
    bool found = false;
    for (int i = 0; i < CONFIG_MESH_AP_CONNECTIONS; ++i) {
        if (self->children[i].mac == ((child_disconnected->mac[4] << 8) | (child_disconnected->mac[5]))) {
            found = true;
            self->children[i].connection_count--;
            if (!self->children[i].connection_count) {
                // Remove from list
                self->children[i].mac = 0;
            } else if (self->children[i].connection_count < -2) {
                // We saw this happening when the ROOT stops accepting children for no reason.
                // A reboot solved the issue -> workaround
                LOGE(TAG, "Negative children count! Reboot...");
                os_thread_sleep(100);
                esp_restart();
            }
            break;
        }
    }
    if (!found) {
        LOGW(TAG, "Disconnection from non registered child! "MACSTR_COMPACT, MAC2STR(child_disconnected->mac));
    }
}


        case MESH_EVENT_CHILD_CONNECTED: {
            mesh_event_child_connected_t* child_connected = (mesh_event_child_connected_t*)msg.data;
            ESP_LOGI(TAG, "<MESH_EVENT_CHILD_CONNECTED>aid:%d, "MACSTR_COMPACT"",
                child_connected->aid,
                MAC2STR(child_connected->mac));

            mesh_track_child_connect(child_connected);

            break;
        }
        case MESH_EVENT_CHILD_DISCONNECTED: {
            mesh_event_child_disconnected_t* child_disconnected = (mesh_event_child_disconnected_t*)msg.data;
            ESP_LOGI(TAG, "<MESH_EVENT_CHILD_DISCONNECTED>aid:%d, "MACSTR_COMPACT"",
                child_disconnected->aid,
                MAC2STR(child_disconnected->mac));

            mesh_track_child_disconnect(child_disconnected);
            break;
        }

with these changes, i was able to trigger the issue by leaving 4 devices on for a couple of days in fixed root.
The root reports:


I (19:36:58.687) mesh_main: <MESH_EVENT_PS_CHILD_DUTY>cidx:0, 7CDFA1E27B84, duty:0
I (111769170) wifi:station: 7c:df:a1:e2:7b:84 leave, AID = 1, bss_flags is 658530, bss:0x3d83b4b0
I (111769171) wifi:new:<1,0>, old:<1,1>, ap:<1,1>, sta:<1,1>, prof:1
I (19:36:58.715) mesh_main: <MESH_EVENT_CHILD_DISCONNECTED>aid:1, 7CDFA1E27B84
W (19:36:58.721) mesh_main: Disconnection from non registered child! 7CDFA1E27B84
I (19:37:07.252) ntp: SNTP notified change state to synching - new tval sec 1702838227 usec 735796
I (19:37:14.155) network: Routing table update
I (19:37:21.940) mesh_hand: Checking wifi connectivity (count 3725)[LOCAL]
I (19:37:27.512) mesh_main: <MESH_EVENT_SCAN_DONE>number:52
I (19:37:30.405) processor: send status... (60000ms)
W (111800513) wifi:Haven't to connect to a suitable AP now!
I (19:37:44.532) network: Routing table update
I (111818067) wifi:new:<1,1>, old:<1,0>, ap:<1,1>, sta:<1,1>, prof:1
I (111818069) wifi:station: 7c:df:a1:e0:74:40 join, AID=1, bgn, 40U
I (19:37:48.096) mesh_main: <MESH_EVENT_PS_CHILD_DUTY>cidx:0, 7CDFA1E07440, duty:0
I (111818091) wifi:station: 7c:df:a1:e0:74:40 leave, AID = 1, bss_flags is 658530, bss:0x3d83add8   
I (111818092) wifi:new:<1,0>, old:<1,1>, ap:<1,1>, sta:<1,1>, prof:1
I (19:37:48.120) mesh_main: <MESH_EVENT_CHILD_DISCONNECTED>aid:1, 7CDFA1E07440
W (19:37:48.125) mesh_main: Disconnection from non registered child! 7CDFA1E07440
I (19:37:52.197) mesh_hand: Checking wifi connectivity (count 3726)[LOCAL]
I (19:37:57.685) mesh_main: <MESH_EVENT_SCAN_DONE>number:41
I (19:38:14.532) network: Routing table update
I (19:38:22.197) mesh_hand: Checking wifi connectivity (count 3727)[LOCAL]
I (19:38:27.685) mesh_main: <MESH_EVENT_SCAN_DONE>number:51
I (111859501) wifi:new:<1,1>, old:<1,0>, ap:<1,1>, sta:<1,1>, prof:1
I (111859504) wifi:station: 7c:df:a1:e2:a3:04 join, AID=1, bgn, 40U
I (19:38:29.536) mesh_main: <MESH_EVENT_PS_CHILD_DUTY>cidx:0, 7CDFA1E2A304, duty:0
I (111859527) wifi:station: 7c:df:a1:e2:a3:04 leave, AID = 1, bss_flags is 658530, bss:0x3d83add8   
I (111859528) wifi:new:<1,0>, old:<1,1>, ap:<1,1>, sta:<1,1>, prof:1
I (19:38:29.556) mesh_main: <MESH_EVENT_CHILD_DISCONNECTED>aid:1, 7CDFA1E2A304
W (19:38:29.561) mesh_main: Disconnection from non registered child! 7CDFA1E2A304
I (19:38:30.534) processor: send status... (60000ms)
W (111860513) wifi:Haven't to connect to a suitable AP now!
I (19:38:44.532) network: Routing table update
I (111879272) wifi:new:<1,1>, old:<1,0>, ap:<1,1>, sta:<1,1>, prof:1
I (111879274) wifi:station: 7c:df:a1:e2:7b:84 join, AID=1, bgn, 40U
I (19:38:49.301) mesh_main: <MESH_EVENT_PS_CHILD_DUTY>cidx:0, 7CDFA1E27B84, duty:0
I (111879305) wifi:station: 7c:df:a1:e2:7b:84 leave, AID = 1, bss_flags is 658530, bss:0x3d83add8
I (111879306) wifi:new:<1,0>, old:<1,1>, ap:<1,1>, sta:<1,1>, prof:1
I (19:38:49.333) mesh_main: <MESH_EVENT_CHILD_DISCONNECTED>aid:1, 7CDFA1E27B84
W (19:38:49.339) mesh_main: Disconnection from non registered child! 7CDFA1E27B84

which means the devices are disconnecting without ever reaching the connected state -> should never happen!

on the chidren i see

I (20:14:16.087) mesh_hand: Checking wifi connectivity (count 0)[LOCAL]
I (20:14:21.569) mesh_main: <MESH_EVENT_SCAN_DONE>number:10
W (20:14:21.570) mesh_hand: <MESH>7CDFA1FFA515, layer:1/5, assoc:0/1, 0, ch:1
W (20:14:21.573) mesh_hand: [7] rssi:-66, parent_rssi:-120 root_rssi:-120 Type 1
W (20:14:21.580) mesh_hand: Parent found[1][0] L1 -66dB
W (20:14:21.587) mesh_hand: scan_done: found 1, fixed_root 1, has_parent 0
W (20:14:21.594) mesh_hand: <NEWPARENT>7CDFA1FFA515, layer:1/5, assoc:0/1, 0, channel:1, rssi:-66
I (20:14:25.762) processor: send status... (60000ms)
W (110349281) wifi:Haven't to connect to a suitable AP now!
I (110351023) wifi:new:<1,1>, old:<1,1>, ap:<1,1>, sta:<1,1>, prof:1
I (110351024) wifi:state: init -> auth (b0)
I (110351033) wifi:state: auth -> assoc (0)
I (110351052) wifi:state: assoc -> run (10)
W (20:14:27.537) mesh_main: <MESH_EVENT_ROOT_FIXED>not fixed
W (20:14:27.539) mesh_hand: Triggering DYNAMIC MESH handover
I (110351060) wifi:state: run -> init (0)
I (110351063) wifi:new:<1,0>, old:<1,1>, ap:<1,1>, sta:<1,1>, prof:1
W (20:14:27.559) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 8 WIFI_REASON_ASSOC_LEAVE
W (110351078) wifi:scan number is 0

@andy-danieal
Copy link

We have faced same issue on ESP IDF v5.1. Did you resolve that issue?

That time faced two reason:-
mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 5 - WIFI_REASON_ASSOC_TOOMANY
mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 10 - WIFI_REASON_NO_AP_FOUND

@KonssnoK
Copy link
Contributor Author

@dspworks-swaroop the code i added above allows you to recognize the issue and do a workaround.
I am waitinf for ESP to answer for a solution.

@zhangyanjiaoesp
Copy link
Collaborator

@KonssnoK

For the root node, when there is a child connects, only when the 4-way handshake completed, the MESH_EVENT_CHILD_CONNECTED can be generated. In your logs, the wifi:station: 7c:df:a1:e2:a3:04 join, AID=1, bgn, 40U log only indicates the association has finished, there is no MESH_EVENT_CHILD_CONNECTED, it means the 4-way handshake fails.

For the child node, the log

I (110351023) wifi:new:<1,1>, old:<1,1>, ap:<1,1>, sta:<1,1>, prof:1
I (110351024) wifi:state: init -> auth (b0)
I (110351033) wifi:state: auth -> assoc (0)
I (110351052) wifi:state: assoc -> run (10)
W (20:14:27.537) mesh_main: <MESH_EVENT_ROOT_FIXED>not fixed
W (20:14:27.539) mesh_hand: Triggering DYNAMIC MESH handover
I (110351060) wifi:state: run -> init (0)
I (110351063) wifi:new:<1,0>, old:<1,1>, ap:<1,1>, sta:<1,1>, prof:1
W (20:14:27.559) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 8 WIFI_REASON_ASSOC_LEAVE

indicates that the connection was interrupted because the root node switched to a non-fixed.

@espressif-bot espressif-bot added Status: In Progress Work is in progress and removed Status: Opened Issue is new labels Dec 19, 2023
@KonssnoK
Copy link
Contributor Author

@KonssnoK

For the root node, when there is a child connects, only when the 4-way handshake completed, the MESH_EVENT_CHILD_CONNECTED can be generated. In your logs, the wifi:station: 7c:df:a1:e2:a3:04 join, AID=1, bgn, 40U log only indicates the association has finished, there is no MESH_EVENT_CHILD_CONNECTED, it means the 4-way handshake fails.

For the child node, the log

I (110351023) wifi:new:<1,1>, old:<1,1>, ap:<1,1>, sta:<1,1>, prof:1
I (110351024) wifi:state: init -> auth (b0)
I (110351033) wifi:state: auth -> assoc (0)
I (110351052) wifi:state: assoc -> run (10)
W (20:14:27.537) mesh_main: <MESH_EVENT_ROOT_FIXED>not fixed
W (20:14:27.539) mesh_hand: Triggering DYNAMIC MESH handover
I (110351060) wifi:state: run -> init (0)
I (110351063) wifi:new:<1,0>, old:<1,1>, ap:<1,1>, sta:<1,1>, prof:1
W (20:14:27.559) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 8 WIFI_REASON_ASSOC_LEAVE

indicates that the connection was interrupted because the root node switched to a non-fixed.

mmm, but then why does it happen only after days of running in fixed root?

it feels like some invalid condition is reached if you "try long enough".

Also the fact that you just need to restart the ROOT to recover the whole system is strange.

In your logs, the wifi:station: 7c:df:a1:e2:a3:04 join, AID=1, bgn, 40U log only indicates the association has finished, there is no MESH_EVENT_CHILD_CONNECTED, it means the 4-way handshake fails.

Can we print the result of the handshake to see if that is the issue?

@zhangyanjiaoesp
Copy link
Collaborator

In your logs, the wifi:station: 7c:df:a1:e2:a3:04 join, AID=1, bgn, 40U log only indicates the association has finished, there is no MESH_EVENT_CHILD_CONNECTED, it means the 4-way handshake fails.

Can we print the result of the handshake to see if that is the issue?

I think the MESH_EVENT_CHILD_CONNECTED event on the root node is caused by voluntarily leaving of the child node. When the child node has the I (110351060) wifi:state: run -> init (0) log, it will send disassocation frame to the parent, then the root node will have the disconnect event.

@zhangyanjiaoesp
Copy link
Collaborator

You can capture packets between the devices, if the 4-way handshake succeed, there will be 4 EAPOL frames.
MicrosoftTeams-image (3)

Or you can enable the supplicant debug log (Component config -> Wi-Fi -> Print debug message from WPA Supplicant) to see EAPOL frame interaction.

@KonssnoK
Copy link
Contributor Author

In your logs, the wifi:station: 7c:df:a1:e2:a3:04 join, AID=1, bgn, 40U log only indicates the association has finished, there is no MESH_EVENT_CHILD_CONNECTED, it means the 4-way handshake fails.

Can we print the result of the handshake to see if that is the issue?

I think the MESH_EVENT_CHILD_CONNECTED event on the root node is caused by voluntarily leaving of the child node. When the child node has the I (110351060) wifi:state: run -> init (0) log, it will send disassocation frame to the parent, then the root node will have the disconnect event.

but if the issue was the child, it would not fix itself by rebooting the root.
Maybe the logs about disconnection are related to the child leaving, but still it is the root causing the problem 🤔

I'll try the additional prints.

@KonssnoK
Copy link
Contributor Author

KonssnoK commented Dec 19, 2023

@zhangyanjiaoesp
sadly enabling
Print debug message from WPA Supplicant
CONFIG_WPA_DEBUG_PRINT=y
completely blocks the WIFI task.
I thought it was a simple "watchdog too short" issue, but instead removing the watchdog the core running wifi (1 in our case) is completely stuck.


I (2228) wifi:wifi firmware version: 4647edf
I (2228) wifi:wifi certification version: v7.0
I (2228) wifi:config NVS flash: enabled
I (2229) wifi:config nano formating: disabled
I (2233) wifi:Init data frame dynamic rx buffer num: 32
I (2238) wifi:Init static rx mgmt buffer num: 5
I (2242) wifi:Init management short buffer num: 32
I (2247) wifi:Init static tx buffer num: 16
I (2251) wifi:Init tx cache buffer num: 32
I (2254) wifi:Init static tx FG buffer num: 2
I (2258) wifi:Init static rx buffer size: 1600
I (2263) wifi:Init static rx buffer num: 10
I (2266) wifi:Init dynamic rx buffer num: 32
I (00:00:01.262) wifi_init: rx ba win: 6
I (00:00:01.266) wifi_init: tcpip mbox: 32
I (00:00:01.271) wifi_init: udp mbox: 6
I (00:00:01.275) wifi_init: tcp mbox: 6
I (2286) mesh: mesh is not inited
I (00:00:01.280) wifi_init: tcp tx win: 5744
I (00:00:01.288) wifi_init: tcp rx win: 5744
I (00:00:01.293) wifi_init: tcp mss: 1416
I (00:00:01.297) wifi_init: WiFi/LWIP prefer SPIRAM
I (00:00:01.303) wifi_init: WiFi IRAM OP enabled
I (00:00:01.308) wifi_init: WiFi RX IRAM OP enabled
I (2323) wifi:Set ps type: 0

I (00:00:01.318) phy_init: phy_version 620,ec7ec30,Sep  5 2023,13:49:13
I (2380) wifi:mode : sta (7c:df:a1:ff:a5:14)
I (2381) wifi:enable tsf

EDIT:
the watchdog issue seems to be related to another change we are going through related to interrupts. i will re-enable the logs.

EDIT2:
The app is now periodically rebooting on MWDT1... investigating...

I (90848) mesh: <MESH_NWK_LOOK_FOR_NETWORK>need_scan:0x3, need_scan_router:0x1, look_for_nwk_count:30     
I (92560) mesh: [FIND][ch:0]AP:0, otherID:0, MAP:0, idle:0, candidate:0, root:0[00:00:00:00:00:00]<scan router>
I (92561) mesh: [FIND:30]fail to find a network, channel:0, cfg<channel:0, router:KI, 00:00:00:00:00:00>  

I (92569) mesh: <MESH_NWK_LOOK_FOR_NETWORK>need_scan:0x3, need_scan_router:0x0, look_for_nwk_count:31
W (21) boot.esp32s3: PRO CPU has been reset by WDT. 8
W (22) boot.esp32s3: APP CPU has been reset by WDT. 8
I (451) cpu_start: Multicore app

@KonssnoK
Copy link
Contributor Author

We currently can't retrigger the issue because we experience continuous

W (21) boot.esp32s3: PRO CPU has been reset by WDT. 8
W (22) boot.esp32s3: APP CPU has been reset by WDT. 8

Investigating, this is triggered by WDT_MWDT1,
which is used by

CONFIG_ESP_INT_WDT=y
CONFIG_ESP_INT_WDT_TIMEOUT_MS=300
CONFIG_ESP_INT_WDT_CHECK_CPU1=y

What value do you recommend for WIFI interrupts? Apparently 300ms is not enough.

@zhangyanjiaoesp
Copy link
Collaborator

What value do you recommend for WIFI interrupts? Apparently 300ms is not enough.

https://docs.espressif.com/projects/esp-idf/en/latest/esp32/api-reference/system/wdts.html?highlight=watchdog
The doc say The timeout should always at least twice longer than the period between FreeRTOS ticks (see [CONFIG_FREERTOS_HZ] . There is no clear recommendation value, you can set it to 600ms and have a try.

@KonssnoK
Copy link
Contributor Author

we have our tick at 1ms, so the issue is that the wifi interrupts sometimes takes 300ms to run.
it seems quite a big time for an interrupt.
we ll investigate more

@KonssnoK
Copy link
Contributor Author

with 500ms the system is not resetting anymore, but we will continue to investigate, 300ms for an interrupt means something is wrong.
I am currently trying to reproduce the issue with the added option CONFIG_WPA_DEBUG_PRINT=y

@KonssnoK
Copy link
Contributor Author

i think the increase in latency in the interrupts is given by enabling debugging options of LWIP.

Stopping the children from spamming connections somehow helped on this topic, but we are still testing

@KonssnoK
Copy link
Contributor Author

KonssnoK commented Mar 4, 2024

Adding the following code
#12806 (comment)

allowed us to recover the functionality on the field. It still feels like a workaround tho.

@zhangyanjiaoesp
Copy link
Collaborator

Adding the following code #12806 (comment)

allowed us to recover the functionality on the field. It still feels like a workaround tho.

@KonssnoK , yes, we know the restart root node is a workaround. Can you provide the log with supplicant debug log enabled? Or can you provide the sniffer packets when the issue happens?

@espressif-bot espressif-bot added Status: Opened Issue is new and removed Status: In Progress Work is in progress labels Mar 5, 2024
@espressif-bot espressif-bot added the Awaiting Response awaiting a response from the author label Mar 18, 2024
@espressif-bot espressif-bot added Status: Reviewing Issue is being reviewed and removed Status: Opened Issue is new labels Jun 19, 2024
@KonssnoK
Copy link
Contributor Author

@zhangyanjiaoesp sadly most of these issues occur on the field where devices logs cannot be changed.

I'm currently seeing this problem in a installation with 17 devices where periodically we receive disconnections from the Wifi.

These are some logs from the devices:

[4] Disconnection from non registered child! AC17540D4A7C
[3] Negative children count! Reset mesh...
[3] Mesh reset requested.
[15] STA: Send err 0x4004 ESP_ERR_MESH_NOT_START
[1] Disconnection from non registered child! AC17540D400C
[6] <MESH_EVENT_PARENT_DISCONNECTED>reason: 8 WIFI_REASON_ASSOC_LEAV
[3] MQTT_EVENT_ERROR 32792 78 0 1 0
[3] MQTT task stopped after 0ms
[9] <MESH_EVENT_PARENT_CONNECTED>layer:2-->2, parent:AC17540CC861<la
[9] <MESH_EVENT_TODS_REACHABLE>reachable:1
[3] GOT IP from mesh_sta
[1] STA: Send err 0x400A ESP_ERR_MESH_TIMEOUT
[11] Negative children count! Reset mesh...
[7] Mesh reset requested.
[9] Disconnection from non registered child! AC17540D4A7C
[15] STA: Send err 0x4004 ESP_ERR_MESH_NOT_START
[3] Disconnection from non registered child! AC17540D415C
[20] <MESH_EVENT_PARENT_DISCONNECTED>reason: 8 WIFI_REASON_ASSOC_LEAV
[3] MQTT_EVENT_ERROR 32792 78 0 1 0
[4] MQTT task stopped after 0ms
[25] <MESH_EVENT_PARENT_CONNECTED>layer:2-->2, parent:AC17540CC861<la
[24] <MESH_EVENT_TODS_REACHABLE>reachable:1
[4] GOT IP from mesh_sta
[1] MQTT_EVENT_ERROR 0 0 0 1 113
[6] STA: Send err 0x400A ESP_ERR_MESH_TIMEOUT
[1] QUEUEISR FAILED mesh_send_message
[1] <MESH_EVENT_PARENT_DISCONNECTED>reason: 15 WIFI_REASON_4WAY_HAND
[1] <MESH_EVENT_PARENT_CONNECTED>layer:0-->2, parent:AC17540CC861<la

apparently the workaround of resetting the mesh makes just the mesh build up more unstable (which makes sense) but on the other side, it avoids blocking the device completely.
i will try to replicate this issue again locally as soon as you are able to replicate the other issues in #13212

@KonssnoK
Copy link
Contributor Author

@zhangyanjiaoesp I am trying to make an example that would work with devkits, so you could reproduce locally.

My problem is that of course there is no modem available.
Do you have a board with a modem you can use?

Otherwise, i need another way to create a secondary connection
Is it possible to connect to a secondary wifi as STA while the AP is configured?

Another option i'm thinking of is a ethernet bridge, but that would require more work.

Meanwhile I'll use our device+devkits to try to trigger the issue

@KonssnoK
Copy link
Contributor Author

KonssnoK commented Jun 25, 2024

while working on this issue i'm able to trigger the same

W (505703) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:461, no_wnd_count:0, timeout_count:178

as seen in #13212
even if i'm using wifi_lib_s3_0625.zip (wifi firmware version: f736b07)

240625dev1_8.txt
240625dev2_8.txt
240625dev3_8.txt

the devices are working in fixed root and connecting to a 4th device that has the modem attached

@KonssnoK
Copy link
Contributor Author

while working on this issue i'm able to trigger the same

W (505703) mesh: [mesh_schedule.c,3131] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:461, no_wnd_count:0, timeout_count:178

as seen in #13212 even if i'm using wifi_lib_s3_0625.zip (wifi firmware version: f736b07)

240625dev1_8.txt 240625dev2_8.txt 240625dev3_8.txt

the devices are working in fixed root and connecting to a 4th device that has the modem attached

with current library 0627 everything gets stucks in new ways, we'll have to continue once we have a final fix for #13212

@KonssnoK
Copy link
Contributor Author

KonssnoK commented Jul 5, 2024

@zhangyanjiaoesp question:

while in fixed root i see a lot of
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
is this expected?

(I'm using library 0702 for now)

>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
W (514427) ping: 64 bytes from 8.8.8.8 icmp_seq=364 ttl=109 time=113 ms
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
W (515547) ping: 64 bytes from 8.8.8.8 icmp_seq=365 ttl=109 time=121 ms
I (515857) mesh_main: Tried to publish layer:2 IP:10.0.0.3
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
I (515857) mesh_mqtt: sent publish returned msg_id=12366
I (516017) mesh_mqtt: MQTT_EVENT_PUBLISHED, msg_id=12366
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
W (516667) ping: 64 bytes from 8.8.8.8 icmp_seq=366 ttl=109 time=87 ms
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
W (517777) ping: 64 bytes from 8.8.8.8 icmp_seq=367 ttl=109 time=111 ms
I (517867) mesh_main: Tried to publish layer:2 IP:10.0.0.3
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
I (517867) mesh_mqtt: sent publish returned msg_id=21058
I (518017) mesh_mqtt: MQTT_EVENT_PUBLISHED, msg_id=21058
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
W (518897) ping: 64 bytes from 8.8.8.8 icmp_seq=368 ttl=109 time=109 ms
I (519877) mesh_main: Tried to publish layer:2 IP:10.0.0.3
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
I (519877) mesh_mqtt: sent publish returned msg_id=44283
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
W (520057) ping: 64 bytes from 8.8.8.8 icmp_seq=369 ttl=109 time=136 ms
I (520097) mesh_mqtt: MQTT_EVENT_PUBLISHED, msg_id=44283
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
W (521237) ping: 64 bytes from 8.8.8.8 icmp_seq=370 ttl=109 time=171 ms
I (521877) mesh_main: Tried to publish layer:2 IP:10.0.0.

@zhangyanjiaoesp
Copy link
Collaborator

while in fixed root i see a lot of
>>>mesh_xon_process_disconnected,2638, (1, 1, 0)
is this expected?

This log is expected, you don't need to care about it. If necessary, I can provide you a wifi lib with only fixes, so that the debug log will not affect your testing.

@KonssnoK
Copy link
Contributor Author

KonssnoK commented Jul 5, 2024

ok sure.

@espressif-bot espressif-bot added Status: In Progress Work is in progress and removed Status: Reviewing Issue is being reviewed Awaiting Response awaiting a response from the author labels Jul 31, 2024
@KonssnoK
Copy link
Contributor Author

KonssnoK commented Oct 1, 2024

hello @zhangyanjiaoesp ,
i fear this bug has not been solved yet.
i'm seeing multiple devices with the latest version of the esp library ( 3167a3a612831ee2b66073cfc7877bac70cf3eb2 ) which are completely offline.
Only the root is able to remain online.

According to the previously reported changes i see these logs a lot

message: "Disconnection from non registered child! AC17540DF874"

message: "Negative children count! Reset mesh..."

Do you have any idea why a device should continuosuly disconnect from the parent without even trying to connect?

i will try to setup devices again and see if it's possible to replicate

@KonssnoK
Copy link
Contributor Author

KonssnoK commented Oct 1, 2024

Also could you tell me when esp_mesh_set_parent can return 1, which is an undocumented value ?

We have some devices failing on setting the parent with such value

[1] <MESH>AC1754175F65, layer:5/1, assoc:1/6, 0, ch:1
[1] <NEWPARENT>AC1754175F65, layer:5/1, assoc:1/6, 0, channel:1, rss
[1] Set parent err 0x1

And how is it possible that i see devices with layer cap 1 and current layer 5 ?

Can you explain again what

layer_cap
layer2_cap

indicate?

Our code for selection of layers is available here and based on your example:
b0ce83f

@KonssnoK
Copy link
Contributor Author

KonssnoK commented Oct 1, 2024

small update:
apparently an infinite loop where the devices never switch back to fixed root selection happens.
We rely on the devices to check for WIFIs by themselves, but in order to do that fixed root nodes have to disconnect and try to connect to the found wifi.
We currently do this on each device while it should be only the root device to do it, because the root device can use the STA freely since it's connected to PPP.
On top, if the wifi password of the found wifi is wrong, devices will fail to connect but try to keep the dynamic mesh instead of the fixed root one.

241001_dev1.txt
241001_dev3.txt
241001_dev2.txt

@KonssnoK
Copy link
Contributor Author

KonssnoK commented Oct 8, 2024

@zhangyanjiaoesp
so i managed to reproduce the issue by forcing one node to be in fixed root.

I do not understand why, but each time that a node connects to this fixed root node, it does the following:

  • the node finds a wifi with a wrong password, this is our input to move out of fixed root and try to reconnect to wifi
  • the device moves to auto-detection
  • the device finds the root
  • the root reports the fixed-root
  • the device fixes the root and starts scanning wifis again to find its peers to connect to in fixed root
  • loop restarts

I think i can force the children avoid the WIFI reconnection when receiving the fixed root event after a connection, but i'm not sure this will cover also the transitioning back to dynamic root.

also,
can you confirm that when connected to a parent, fixed root nodes can see only mesh devices when scanning for wifis?

I (6144302) mesh_hand: [22]KI, d2:0f:7d:8a:8b:29, channel:11, rssi:-38
W (6144312) mesh_hand: Triggering DYNAMIC MESH handover
I (6144322) mesh: [IO]enable self-organizing, search parent<reconnect>
I (6144322) mesh: mesh_schedule.c 2533 null args
I (6144332) mesh: [IO]enable self-organizing, search parent<adaptive>
W (6144332) mesh_hand: scan_done: found 0, fixed_root 0, has_parent 0, forced 0
I (6144332) mesh: <MESH_NWK_SCAN_REQ_PASSIVE>unexpected, unprocessed previous scan by parent selection
W (6144342) mesh_hand: Found IDLE devices and no good connection. Wait.
I (6144362) mesh_main: <MESH_EVENT_ROOT_FIXED>not fixed
W (6144362) mesh_hand: Triggering DYNAMIC MESH handover
I (6144372) mesh: <WIFI_EVENT_SCAN_DONE>status:fail, num:0, id:238
W (6144382) wifi:scan number is 0
I (6144382) mesh: [SCAN][ch:1]AP:0, other(ID:0, RD:0), MAP:0, idle:0, candidate:0, root:0, topMAP:0[c:2,i:2][00:00:00:00:00:00]<>
I (6144392) mesh: [FAIL][1]root:0, fail:1, normal:0, <pre>backoff:0

I (6145002) mesh: [SCAN][ch:1]AP:19, other(ID:0, RD:0), MAP:3, idle:2, candidate:1, root:1, topMAP:0[c:2,i:2][00:00:00:00:00:00]<>
I (6145002) mesh: 7383[selection]try rssi_threshold:-78, backoff times:0, max:5<-78,-82,-85>
I (6145012) mesh: [DONE]connect to parent:ESPM_E29ED4, channel:1, rssi:-50, 7c:df:a1:e2:9e:d5[layer:1, assoc:0], my_vote_num:0/voter_num:0, rc[00:00:00:00:00:00/-22/3]
I (6145202) wifi:new:<1,1>, old:<1,0>, ap:<1,1>, sta:<1,1>, prof:1
W (6145242) ping: From 8.8.8.8 icmp_seq=2385 timeout
I (6145772) wifi:state: init -> auth (b0)
I (6145782) wifi:state: auth -> assoc (0)
I (6145792) wifi:state: assoc -> run (10)
I (6145802) mesh: <MESH_NWK_MIE_CHANGE><><><><><><ROOT-FIXED><>
I (6145802) mesh_main: <MESH_EVENT_ROOT_FIXED>fixed
I (6145802) mesh: <MESH_NWK_MIE_CHANGE><><><><ROOT ADDR><><><>
I (6145802) mesh: <MESH_NWK_ROOT_ADDR>from assoc, layer:2, root_addr:7c:df:a1:e2:9e:d5, root_cap:2
I (6145812) wifi:connected with ESPM_E29ED4, aid = 2, channel 1, 40U, bssid = 7c:df:a1:e2:9e:d5
I (6145822) wifi:security: WPA2-PSK, phy: bgn, rssi: -50
W (6145802) mesh_hand: Triggering FIXED ROOT handover
I (6145862) wifi:pm start, type: 0

I (6145862) wifi:set rx beacon pti, rx_bcn_pti: 0, bcn_timeout: 0, mt_pti: 25000, mt_time: 10000
I (6145862) wifi:AP's beacon interval = 307200 us, DTIM period = 1
I (6145862) wifi:state: run -> init (0)
I (6145882) wifi:pm stop, total sleep time: 0 us / 20718 us

I (6145882) wifi:new:<1,0>, old:<1,1>, ap:<1,1>, sta:<1,1>, prof:1
I (6145992) mesh: [IO]enable self-organizing, search parent<reconnect>
I (6145992) mesh: mesh_schedule.c 2533 null args
I (6145992) mesh: <MESH_NWK_PARENT_DISCONNECTED>already disconnected, ignore it
I (6145992) mesh: [IO]disable self-organizing<stop reconnect>
I (6146002) mesh: <MESH_NWK_ROOT_ADDR>idle, layer:2, root_addr:7c:df:a1:e2:9e:d5, conflict_roots.num:0<>
I (6146012) mesh_main: <MESH_EVENT_ROOT_ADDRESS>root address:7c:df:a1:e2:9e:d5
I (6146012) mesh: <MESH_NWK_PARENT_CONNECTED>should not happen, err:0xffffffff
I (6146022) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason:8
I (6146022) mesh: [wifi]disconnected reason:8(assoc leave), continuous:0/max:12, non-root, vote(,stopped)<><>

@zhangyanjiaoesp
Copy link
Collaborator

zhangyanjiaoesp commented Oct 9, 2024

@KonssnoK I just returned from the National Day holiday, and I will review your question this week.

@KonssnoK
Copy link
Contributor Author

KonssnoK commented Oct 9, 2024

@KonssnoK I just returned from the National Day holiday, and I will review your question this week.

thanks @zhangyanjiaoesp
i pushed all changes to
KonssnoK@cb5df0c
KonssnoK@f42e22c

I think now this specific issue should be solved because i force the devices without a modem to stay in fixed root (not look for wifi) until they receive from the network a dynamic root event.

But please, yes, there are still a lot of questions in my previous comments :)

@zhangyanjiaoesp
Copy link
Collaborator

@KonssnoK

Also could you tell me when esp_mesh_set_parent can return 1, which is an undocumented value ?

I couldn't find the instance where the return value is 0x1. Could you describe the circumstances under which you returned such a value?

Can you explain again what

layer_cap
layer2_cap

indicate?

layer_cap indicates the max layer, and layer2_cap indicates the current capacity of the layer2 node

@zhangyanjiaoesp
Copy link
Collaborator

hello @zhangyanjiaoesp , i fear this bug has not been solved yet. i'm seeing multiple devices with the latest version of the esp library ( 3167a3a612831ee2b66073cfc7877bac70cf3eb2 ) which are completely offline. Only the root is able to remain online.

According to the previously reported changes i see these logs a lot

message: "Disconnection from non registered child! AC17540DF874"

message: "Negative children count! Reset mesh..."

Do you have any idea why a device should continuosuly disconnect from the parent without even trying to connect?

i will try to setup devices again and see if it's possible to replicate

@KonssnoK I believe this comment has already explained.

@KonssnoK
Copy link
Contributor Author

KonssnoK commented Oct 15, 2024

Hi @zhangyanjiaoesp ,
sorry for the delay.

I am a bit confused now:

layer_cap indicates the max layer, and layer2_cap indicates the current capacity of the layer2 node

In our logs we see devices reporting in the scan result values of
layer = 5
layer_cap = 1
layer2_cap = 0
So the device is current at layer 5 even tho it has a max layer of 1 (???) and the second layer has no capacity

This is also the case where we get error 0x1 as result of calling esp_mesh_set_parent

[1] <NEWPARENT>AC175415BA25, layer:5/1, assoc:1/6, 0, channel:1, rss
[1] Set parent err 0x1

Maybe layer_cap and layer2_cap are:

  • layer_cap : available slots in the current reported layer (1 slot on the 5 layer)
  • layer2_cap : available slots in the second layer (no capacity) - i still don't understand why only layer2 has a dedicated value

@zhangyanjiaoesp
Copy link
Collaborator

@KonssnoK

sorry for the incorrect explanation regarding layer_cap. the correct explanation is here:

uint8_t layer_cap; /**< layer_cap = max_layer - layer, indicates the number of remaining available layers of the mesh network */

the explanation for layer2_cap is here:

uint16_t layer2_cap; /**< the capacity of layer2 node, total child numbers plus 1, layer2 node updates layer2_cap and self_cap, root sets this to 0 */

The comments in the esp_mesh_internal.h file have been recently updated, and you can trust that they are correct.

@KonssnoK
Copy link
Contributor Author

@zhangyanjiaoesp i can understand the first comment and it seems in line with what i reported above.
But the second one it's not... In the log above we have a layer2_cap of 0, which means "0 total children + 1" according to the comment reported.
This doesn't feel right, I would expect layer2_cap to be the same as layer_cap
layer2_cap = totalcapacity_layer2 - connected_children_layer2

Could you check?

Also,
any idea why esp_mesh_set_parent would fail with 0x1 in this case?

@zhangyanjiaoesp
Copy link
Collaborator

@KonssnoK
I have already checked, and the description is correct. Please pay attention to the complete explanation that follows: this value will be at least 1 only when the current node is a layer 2 node; otherwise, it will be 0.

@zhangyanjiaoesp
Copy link
Collaborator

@KonssnoK
For the esp_mesh_set_parent return 0x1, can you provide me a case to reproduce it? I will add some debug log.

@KonssnoK
Copy link
Contributor Author

@KonssnoK For the esp_mesh_set_parent return 0x1, can you provide me a case to reproduce it? I will add some debug log.

I don't have a direct way to reproduce it because it's coming only from field devices.
What i can see is that the device is in fixed root, it is scanning and finds someone to connect at layer 5 when the maximum layer is 6.

So, if you want to replicate, i think you just have to use our code in fixed root and decrease the number of layers.

[3] <NEWPARENT>AC175413CEAD, layer:5/1, assoc:0/6, 0, channel:1, rss
[3] Set parent err 0x1

[3] <NEWPARENT>AC175413CEA5, layer:5/1, assoc:0/6, 0, channel:1, rss
[3] Set parent err 0x1

@zhangyanjiaoesp
Copy link
Collaborator

@KonssnoK I will try to test, should I use the branch KonssnoK@f42e22c ?

@KonssnoK
Copy link
Contributor Author

yes that is in line with our current implementation, but i would take
KonssnoK@cb5df0c
to have also the fix for wrong password

@zhangyanjiaoesp
Copy link
Collaborator

@KonssnoK Have you turned on the WiFi log? Normally, when an error is returned, the corresponding log will also be printed.

@KonssnoK
Copy link
Contributor Author

@zhangyanjiaoesp as far as i know there is currently no way to redirect wifi/mesh logs to a function that is not ESP_LOGX.
These errors are happening in the field, so we cannot see the logs there.

Meanwhile we started migrating to 5.3

@zhangyanjiaoesp
Copy link
Collaborator

@KonssnoK I can't reproduce the issue locally, since you can't see the log, then adding debug log for you to test would not be helpful. I will review the code again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: In Progress Work is in progress
Projects
None yet
Development

No branches or pull requests

4 participants