Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lilygo RTL433 runs fine, then reboots every couple of minutes then hangs... #2092

Open
puterboy opened this issue Oct 21, 2024 · 13 comments
Open

Comments

@puterboy
Copy link
Contributor

I am running 2 different Lilygo rtl433 ESP32 devices.
They had been running fine for the last couple of weeks.
Then recently, the one downstairs, started the following behavior.

  1. Run fine for 1 or more days
  2. Start cycle of rebooting every 1-10 minutes (for about 10-20 times)
  3. Freeze/hang (requiring physical reboot)

The upstairs device seems completely stable.
Both devices never seem to get below about 2K free rtl_433 stack space nor do they run out of memory.

It's unlikely to be a power issue as I have a LiIon battery pack backup on the Lilygo device.
More generally, I don't think it's the hardware or firmware per-se as swapping the devices caused the swapped device to start exhibiting the above behavior.

Any ideas on what could be causing this?

  • Why does instability arise only after a couple of days?
  • Why does it suddenly start repeatedly rebooting every couple of minutes?
  • Why does it suddenly freeze and not even reboot?
  • Why does it affect only the device in one location?

The only thing I can think of is that a new device or device reading comes online that is only received at the downstairs location -- and somehow that triggers a bug or even a stack overflow (before it gets reported to the MQTT broker)...

@1technophile @NorthernMan54 have you seen anything like this?

@NorthernMan54
Copy link
Collaborator

Need to see the serial output

@puterboy
Copy link
Contributor Author

puterboy commented Oct 27, 2024

OK...

N: [ OMG->MQTT ] topic: home-crap/OMG_lilygo_rtl_433_ESP-2/RTL_433toMQTT/TFA-303221/1/219 msg: {"model":"TFA-303221","id":219,"channel":1,"battery_ok":1,"temperature_C":16.4,"humidity":65,"sendmode":0,"mic":"CRC","protocol":"TFA Dostmann 30.3221.02 T/H Outdoor Sensor","rssi":-94,"duration":206998}
T: isAdupl?
T: Enqueue JSON
T: Queue length: 1
T: Min ind: 6
T: store code : 21 / 381081797
T: Col: val/timestamp
T: mem code : 194 / 381053903
T: mem code : 191 / 381056961
T: mem code : 103 / 381057973
T: mem code : 30 / 381076050
T: mem code : 218 / 381080207
T: mem code : 235 / 381081571
T: mem code : 21 / 381081797
T: mem code : 8 / 381036885
T: mem code : 83 / 381036930
T: mem code : 196 / 381045816
T: mem code : 30 / 381049598
T: mem code : 183 / 381052358
T: isAdupl?
T: no pub. dupl
T: Dequeue JSON
CORRUPT HEAP: Bad head at 0x3ffd9cf8. Expected 0xabba1234 got 0x00000020

assert failed: multi_heap_free multi_heap_poisoning.c:253 (head != NULL)

Backtrace: 0x40083d4d:0x3ffb2430 0x4008de3d:0x3ffb2450 0x40093d71:0x3ffb2470 0x400939c7:0x3ffb25a0 0x40084211:0x3ffb25c0 0x40093da1:0x3ffb25e0 0x40136115:0x3ffb2600 0x40136125:0x3ffb2620 0x400dbd49:0x3ffb2640 0x400dc4bf:0x3ffb26d0 0x400dc63e:0x3ffb26f0 0x400eac39:0x3ffb27d0 0x40139445:0x3ffb2810

Then a few minutes later: (note log level reverted to 'Notice')

N: [ OMG->MQTT ] topic: homeassistant-crap/sensor/LaCrosse-TX141THBv2-0-216-temperature_C/config msg: {"stat_t":"+/+/RTL_433toMQTT/LaCrosse-TX141THBv2/0/216","dev_cla":"temperature","unit_of_meas":"°C","name":"Temperature","uniq_id":"LaCrosse-TX141THBv2-0-216-temperature_C","val_tpl":"{{ value_json.temperature_C | is_defined }}","stat_cla":"measurement","device":{"ids":["LaCrosse-TX141THBv2-0-216"],"cns":[["mac","LaCrosse-TX141THBv2-0-216"]],"mdl":"LaCrosse-TX141THBv2","name":"LaCrosse-TX141THBv2-0-216","via_device":"OMG_lilygo_rtl_433_ESP-2"}}
CORRUPT HEAP: Bad head at 0x3ffba1f8. Expected 0xabba1234 got 0x0000001c

assert failed: multi_heap_free multi_heap_poisoning.c:253 (head != NULL)


Backtrace: 0x40083d4d:0x3ffb2430 0x4008de3d:0x3ffb2450 0x40093d71:0x3ffb2470 0x400939c7:0x3ffb25a0 0x40084211:0x3ffb25c0 0x40093da1:0x3ffb25e0 0x40136115:0x3ffb2600 0x40136125:0x3ffb2620 0x400dbd49:0x3ffb2640 0x400dc4bf:0x3ffb26d0 0x400dc63e:0x3ffb26f0 0x400eac39:0x3ffb27d0 0x40139445:0x3ffb2810

Then again:

N: [ OMG->MQTT ] topic: homeassistant-crap/sensor/LaCrosse-TX141THBv2-0-216-temperature_C/config msg: {"stat_t":"+/+/RTL_433toMQTT/LaCrosse-TX141THBv2/0/216","dev_cla":"temperature","unit_of_meas":"°C","name":"Temperature","uniq_id":"LaCrosse-TX141THBv2-0-216-temperature_C","val_tpl":"{{ value_json.temperature_C | is_defined }}","stat_cla":"measurement","device":{"ids":["LaCrosse-TX141THBv2-0-216"],"cns":[["mac","LaCrosse-TX141THBv2-0-216"]],"mdl":"LaCrosse-TX141THBv2","name":"LaCrosse-TX141THBv2-0-216","via_device":"OMG_lilygo_rtl_433_ESP-2"}}
CORRUPT HEAP: Bad head at 0x3ffba1f8. Expected 0xabba1234 got 0x00000024

assert failed: multi_heap_free multi_heap_poisoning.c:253 (head != NULL)


Backtrace: 0x40083d4d:0x3ffb2430 0x4008de3d:0x3ffb2450 0x40093d71:0x3ffb2470 0x400939c7:0x3ffb25a0 0x40084211:0x3ffb25c0 0x40093da1:0x3ffb25e0 0x40136115:0x3ffb2600 0x40136125:0x3ffb2620 0x400dbd49:0x3ffb2640 0x400dc4bf:0x3ffb26d0 0x400dc63e:0x3ffb26f0 0x400eac39:0x3ffb27d0 0x40139445:0x3ffb2810

etc.

So somehow there is a heap corruption...

Note that the last reported stack low water mark was 936 bytes before the first crash and then about 3600 bytes before the subsequent crashes.

So, if it is exhausting stack, it must be doing that while trying to decode the current message so it doesn't get reported via the RFtoMQTT routine.

Again, it can run fine for hours or even days (in this case it ran fine for 4 days (!) before I start seeing the heap corruption. But once it occurs, then it seems to recur and reboot typically every few minutes for 5-10 or more times before resuming stability. In this case it rebooted about 6 times over the course of about 2.5 hours.

Any ideas what may be causing the heap corruption and/or how to troubleshoot?

The only thing I can think of is that there is some sensor that reports intermittently (maybe it's only turned on every once in a while) and which consumes excessive stack causing the heap corruption.
Interestingly, I can't say for sure, but it seems that the crashes typically occur between 6 and 10AM in the morning. I don't think it's one of my known sensors because all mine seem to be working fine (and have been doing so for quite a while)

@1technophile
Copy link
Owner

Maybe try to increase again the stack, and if this is still the problem we should check if there is a way to limit the decoder consumption and avoid such crashes.

@puterboy
Copy link
Contributor Author

That seems though to be a bit "brute force"
Is there any way to log which decoder is being called and how much stack it uses?
That way I could see what is causing the problem...

@puterboy
Copy link
Contributor Author

I like your idea of limiting decoder consumption and perhaps logging any time a decoder tries to use more than that amount so that we can know the name of the decoder and how much stack it sought to consume...

@NorthernMan54
Copy link
Collaborator

Add monitor_filters = esp32_exception_decoder to your configuration in platformio.ini

It should show exactly where the error occurred

@puterboy
Copy link
Contributor Author

Add monitor_filters = esp32_exception_decoder to your configuration in platformio.ini

It should show exactly where the error occurred

I did this btw last time I had crashes and it pointed to increasing the stack size but happy to enable it again to see if I get the same error.

@puterboy
Copy link
Contributor Author

I received another half dozen reboots this morning with monitor_filters = esp32_exception_decoder but the serial log error messages seem to be the same as before.

N: [ OMG->MQTT ] topic: home-crap/OMG_lilygo_rtl_433_ESP-2/RTL_433toMQTT/Ambientweather-F007TH/2/178 msg: {"model":"Ambientweather-F007TH","id":178,"channel":2,"battery_ok":1,"temperature_C":16.61111,"humidity":41,"mic":"CRC","protocol":"Ambient Weather F007TH, TFA 30.3208.02, SwitchDocLabs F016TH temperature sensor","rssi":-53,"duration":189001}
CORRUPT HEAP: Bad head at 0x3ffba2f8. Expected 0xabba1234 got 0xffffffff

assert failed: multi_heap_free multi_heap_poisoning.c:253 (head != NULL)


Backtrace: 0x40083d4d:0x3ffb2430 0x4008de3d:0x3ffb2450 0x40093d71:0x3ffb2470 0x400939c7:0x3ffb25a0 0x40084211:0x3ffb25c0 0x40093da1:0x3ffb25e0 0x40136115:0x3ffb2600 0x40136125:0x3ffb2620 0x400dbd49:0x3ffb2640 0x400dc4bf:0x3ffb26d0 0x400dc63e:0x3ffb26f0 0x400eac39:0x3ffb27d0 0x40139445:0x3ffb2810




ELF file SHA256: cc2ead56a1306f41

AND

CORRUPT HEAP: Bad head at 0x3ffba1f8. Expected 0xabba1234 got 0x00000024

assert failed: multi_heap_free multi_heap_poisoning.c:253 (head != NULL)


Backtrace: 0x40083d4d:0x3ffb2430 0x4008de3d:0x3ffb2450 0x40093d71:0x3ffb2470 0x400939c7:0x3ffb25a0 0x40084211:0x3ffb25c0 0x40093da1:0x3ffb25e0 0x40136115:0x3ffb2600 0x40136125:0x3ffb2620 0x400dbd49:0x3ffb2640 0x400dc4bf:0x3ffb26d0 0x400dc63e:0x3ffb26f0 0x400eac39:0x3ffb27d0 0x40139445:0x3ffb2810




ELF file SHA256: cc2ead56a1306f41

etc.

So maybe it's not a decoder issue???? Or at least not like the ones causing #2043 where I got an error saying Debug exception reason: Stack canary watchpoint triggered (rtl_433_Decoder) which was fixed by increasing stack size.

Any ideas on how to debug this further?
(of course I could just try increasing stack size further to see if that helps but I would like to get to the root of the issue)

@NorthernMan54
Copy link
Collaborator

Oh, If you run a build and upload it to the board, then run the monitor it will convert the backtrace addresses to actual lines of code, so we can pin point the issue.

@puterboy
Copy link
Contributor Author

I did re-build and re-uploaded (using platformio) with monitor_filters = esp32_exception_decoder

@NorthernMan54
Copy link
Collaborator

That's weird, as it should have given a longer back trace. If you build a different project afterwards, it does break the feature.

@puterboy
Copy link
Contributor Author

I will try to rebuild again...

@puterboy
Copy link
Contributor Author

I am not seeing anything in the backtrace other than what I reported above...

Meanwhile it ran fine for 2 weeks without any crash or reboot. Then today for about 90 minutes it rebooted every minute before stabilizing.
I'm pretty sure it is being triggered by a wireless sensor that either appears periodically or periodically has an output state that causes the crash.

My evidence is as follows:

  1. My other Lilygo RTL433 was not triggering when on the 3rd floor but then also started triggering every minute when I brought it down to the first floor where my main Lilygo RTL433 is located -- so it seems to be location dependent (and not an issue with the device itself)
  2. Removing the antenna caused the reboots to stop -- so it seems to be triggered by receiving a sensor signal rather than independently by the device or code itself.

I just increased the stack size to 15000 to see if that solves the problem -- but unfortunately it's not crashing now so I have to wait until the bad sensor or sensor message comes online again...

Stay tuned...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants