Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bootloop #196

Open
LacsapOV opened this issue Feb 2, 2023 · 34 comments
Open

Bootloop #196

LacsapOV opened this issue Feb 2, 2023 · 34 comments
Assignees
Labels
bug Something isn't working
Milestone

Comments

@LacsapOV
Copy link

LacsapOV commented Feb 2, 2023

Wemos D1 reboots about every 10 seconds

Reboot log
2023-02-02 19:48:17 - reboot cause: Exception (2) - Access to invalid address (28)
ESP register contents: epc1=0x40241688, epc2=0x00000000, epc3=0x00000000, excvaddr=0x144c0000, depc=0x00000000
2023-02-02 19:48:07 - reboot cause: Exception (2) - Access to invalid address (28)
ESP register contents: epc1=0x40241688, epc2=0x00000000, epc3=0x00000000, excvaddr=0x144c0000, depc=0x00000000
2023-02-02 19:47:57 - reboot cause: Exception (2) - Access to invalid address (28)
ESP register contents: epc1=0x40241688, epc2=0x00000000, epc3=0x00000000, excvaddr=0x144c0000, depc=0x00000000

Firmware Version
0.10.0+eeeb22c

PIC Firmware Version
6.4

Settings
{
"hostname": "OTGW",
"MQTTenable": true,
"MQTTbroker": "192.168.2.13",
"MQTTbrokerPort": 1883,
"MQTTuser": "",
"MQTTpasswd": "",
"MQTTtoptopic": "otgw",
"MQTThaprefix": "homeassistant",
"MQTTuniqueid": "otgw",
"MQTTOTmessage": true,
"MQTTharebootdetection": true,
"NTPenable": true,
"NTPtimezone": "Europe/Amsterdam",
"NTPhostname": "pool.ntp.org",
"LEDblink": true,
"GPIOSENSORSenabled": true,
"GPIOSENSORSpin": 13,
"GPIOSENSORSinterval": 20,
"S0COUNTERenabled": false,
"S0COUNTERpin": 12,
"S0COUNTERdebouncetime": 80,
"S0COUNTERpulsekw": 1000,
"S0COUNTERinterval": 60,
"OTGWcommandenable": false,
"OTGWcommands": "GW=1",
"GPIOOUTPUTSenabled": false,
"GPIOOUTPUTSpin": 16,
"GPIOOUTPUTStriggerBit": 0
}

@DaveDavenport
Copy link
Collaborator

It looks like the flashing went wrong.
Can you remove the wemos from the board and reflash it using a usb cable directly from a pc ?

@LacsapOV
Copy link
Author

LacsapOV commented Feb 2, 2023

It looks like the flashing went wrong. Can you remove the wemos from the board and reflash it using a usb cable directly from a pc ?

Thank you Dave.
That's how i did it first time around. I tried again but no luck, issue remains.
It does not reboot when it's connected to the PC, only when connected to the board.

@DaveDavenport
Copy link
Collaborator

DaveDavenport commented Feb 2, 2023

that is odd. What node-shop board version do you have?

(I don't expect to see this error, when it is something with the board/power supply)

Did you try to do a full flash erase before flashing? I normally do this when having odd issues with an esp8266.

@LacsapOV
Copy link
Author

LacsapOV commented Feb 2, 2023

that is odd. What node-shop board version do you have?

(I don't expect to see this error, when it is something with the board/power supply)

Did you try to do a full flash erase before flashing? I normally do this when having odd issues with an esp8266.

The latest, i got it on Wednesday, soldered and ready to go.

I tried your suggestion with a full flash erase. Even dropped the baud rate to a lower rate, same issue.
And finally i took another Wemos D1 mini (Adafruit) i had laying around, still same issue. :-)

@DaveDavenport
Copy link
Collaborator

weird.. I recently did it fine.

@JvHummel
Copy link

JvHummel commented Feb 2, 2023

Hi,

Ended up here after having the same issue. Updated from 0.9.5 to 0.10.

2023-02-02 23:49:12 - reboot cause: Exception (2) - Access to invalid address (28)
ESP register contents: epc1=0x40241688, epc2=0x00000000, epc3=0x00000000, excvaddr=0x144c0000, depc=0x00000000
2106-02-07 07:28:19 - reboot cause: Exception (2) - Access to invalid address (28)
ESP register contents: epc1=0x40241688, epc2=0x00000000, epc3=0x00000000, excvaddr=0x144c0000, depc=0x00000000
2023-02-02 23:48:53 - reboot cause: Exception (2) - Access to invalid address (28)
ESP register contents: epc1=0x40241688, epc2=0x00000000, epc3=0x00000000, excvaddr=0x144c0000, depc=0x00000000

etc....

Took the OTGW out from the OpenTherm bus and the system seems stable now.

@DaveDavenport
Copy link
Collaborator

https://www.espressif.com/sites/default/files/documentation/esp8266_reset_causes_and_common_fatal_exception_causes_en.pdf

so, it sounds like we are following a invalid pointer.
I wonder why this happens, but not for everybody.

@JvHummel
Copy link

JvHummel commented Feb 2, 2023

Since the resets stop happening for me when I disconnect the OpenTherm connection, I'd wager it must be something device-specific. i.e. thermostat/boiler send data the firmware doesn't like.

I'm headed to bed for today but if you're interested, I'll try connecting boiler/thermostat separately tomorrow and see if either of them triggers the issue.

@DaveDavenport
Copy link
Collaborator

That would be useful, also please report your setup (boiler/thermostat).

@rvdbreemen rvdbreemen added the bug Something isn't working label Feb 3, 2023
@rvdbreemen rvdbreemen added this to the 0.11.0 milestone Feb 3, 2023
@Roos-AID
Copy link
Contributor

Roos-AID commented Feb 3, 2023

Can you change setting GPIOSENSORSenabled": true,
Into false?
what type sensors are connected to gpio 13?

@JvHummel
Copy link

JvHummel commented Feb 3, 2023

Hi Dave,

Yes, of course. The following is my setup:

  • Boiler: Intergas Kombi Kompakt HReco 30
  • Thermostat: Honeywell ChronoTherm Touch Modulation

Just did some tests.

  • Connecting the OTGW to only my thermostat produced a stable system, i.e. no bootloops.
  • Connecting the OTGW to only my boiler, also produced a stable system, i.e. no bootloops.
  • Connecting the OTGW between my boiler and thermostat produced the bootloop.

I reset the power on the boiler and OTGW every time I performed a test, so message intervals from any module shouldn't affect the tests, if my reasoning is correct.

I should also provide my settings; they are as follows.

{
"hostname": "OTGW",
"MQTTenable": false,
"MQTTbroker": "192.168.178.120",
"MQTTbrokerPort": 1883,
"MQTTuser": "xxx",
"MQTTpasswd": "xxx",
"MQTTtoptopic": "OTGW",
"MQTThaprefix": "homeassistant",
"MQTTuniqueid": "otgw-XXXXXXXXXXXX",
"MQTTOTmessage": false,
"MQTTharebootdetection": true,
"NTPenable": true,
"NTPtimezone": "Europe/Amsterdam",
"NTPhostname": "pool.ntp.org",
"LEDblink": true,
"GPIOSENSORSenabled": false,
"GPIOSENSORSpin": 13,
"GPIOSENSORSinterval": 20,
"S0COUNTERenabled": false,
"S0COUNTERpin": 12,
"S0COUNTERdebouncetime": 80,
"S0COUNTERpulsekw": 1000,
"S0COUNTERinterval": 60,
"OTGWcommandenable": false,
"OTGWcommands": "GW=1",
"GPIOOUTPUTSenabled": false,
"GPIOOUTPUTSpin": 16,
"GPIOOUTPUTStriggerBit": 0
}

The system was stable with 0.9.5 but since I run PIC FW 6.4, I figured I'd update to 0.10 since the changelog mentions improved compatibility.

If there's anything else I can provide or try out, please let me know.

@Roos-AID
Copy link
Contributor

Roos-AID commented Feb 3, 2023

Thanks, this one has no GPIO attached, great, so we can forget about that, a have seen a problem with Onewire detecting a strange device causing this. But with GPIOSENSORSenabled": false this code is not executed.

I have tested with the Honeywell ChronoTherm Touch Modulation as well, but different boiler. No problem there.

I think we need at least a telnet trace or better a trace of the opentherm with OTMonitor.

Suggestion, can we give a version compiled with 2.7.4 a try ?

@hvxl
Copy link

hvxl commented Feb 3, 2023

As this seems to be caused by some specific data the ESP receives from the PIC, it may be interesting to see what the PIC is sending. With at least hardware v2.3 and later, it is possible to power the board from a USB port of the PC and receive the serial data there, in addition to having a Wemos installed on the OTGW. Running a terminal emulator (or even OTmonitor) on the USB port may provide some valuable insights.

@JvHummel
Copy link

JvHummel commented Feb 3, 2023

@hvxl Can you confirm that this will work? Based on the manual for HW rev. 2.3, it seems not to be meant for this usecase: "Do not connect a Micro USB cable to the WeMos D1 Mini while it is connected to the gateway!", so I am a bit wary to destroy my new toy 😉

@Roos-AID Does 2.7.4 refer to a version of library or core, or something like that? Either way, I'd be happy to try.

@DaveDavenport
Copy link
Collaborator

I think @hvxl is talking about the USB board on the main board, not the wemos.

@Roos-AID
Copy link
Contributor

Roos-AID commented Feb 3, 2023

You can do a debug log display with Telnet ipadres. Alternative use OTMonitor and connect to port 25238

If you do Telnet , open the telnet before you connect power, otherwise you might miss the first messages

@hvxl
Copy link

hvxl commented Feb 3, 2023

Sorry, I should have been clearer. Yes, what I meant was to power the OTGW board from a USB port on the PC.

@Roos-AID It's not possible to connect telnet before connecting the power. With the ESP booting every 10 seconds, there is hardly any chance to connect via TCP at all. That's why I suggested to monitor via USB.

@JvHummel
Copy link

JvHummel commented Feb 3, 2023

Hi all,

I did a USB/TTY readout as @hvxl suggested.
I've attached a log file. Hopefully it can shed some light on the situation.

putty.log

@LacsapOV
Copy link
Author

LacsapOV commented Feb 3, 2023

In my case it's connected to a Honeywell Chronotherm Touch Modulation and Atlantic Loria heatpump.
GPIO is also off after i flashed it again.

From OTGW documentation error 03 suggests a voltage issue. That could have been an explanation for my issue since my board is new. But not for JvHummel.

I'll do my best to dump a log.

@JvHummel
Copy link

JvHummel commented Feb 3, 2023

@LacsapOV My board is also new, soldered it just 2 nights ago :) But shipped with v0.9.5. My theory is that nodo-shop batch pre-programs them ahead of time.

Anyhow, you are right that it doesn't explain why 0.9.5 was stable for me.

@hvxl
Copy link

hvxl commented Feb 3, 2023

When I replay that I also get exception 28:

--------------- CUT HERE FOR EXCEPTION DECODER ---------------

Exception (28):
epc1=0x40241688 epc2=0x00000000 epc3=0x00000000 excvaddr=0x144c0000 depc=0x00000000

>>>stack>>>

ctx: cont
sp: 3ffffc50 end: 3fffffc0 offset: 0190
3ffffde0:  4028616c 0000001a 3ffffe84 3fff3690  
3ffffdf0:  00000001 144c0000 3ffffe84 402371fb  
3ffffe00:  00000001 3fff5678 3fff3670 4023722d  
3ffffe10:  3fff36cc 3fff5678 3fff3670 4020cebb  
3ffffe20:  34303742 30353036 00090030 70460500  
3ffffe30:  05460701 00000000 00000f66 00000000  
3ffffe40:  00000046 00000000 00000006 3ffe9477  
3ffffe50:  3ffe9480 3ffe9fcd 00000000 3fff3a88  
3ffffe60:  00000000 4bc6a7f0 472b020c 3fff5768  
3ffffe70:  00000000 00000000 2d000000 2d2d2d2d  
3ffffe80:  002d2d2d 00000000 001a001f 00000000  
3ffffe90:  00000000 2d2d2d2d 00000000 b65d1ae8  
3ffffea0:  0000002a ff000000 3fff4c08 00000200  
3ffffeb0:  3fff5678 3fff3690 3fff3670 4021ab1e  
3ffffec0:  34303742 30353036 00090030 70460500  
3ffffed0:  05460701 00000000 00000f66 00000000  
3ffffee0:  00000046 00000000 00000006 3ffe9477  
3ffffef0:  3ffe9480 3ffe9fcd 3fffff2c b65d1ae8  
3fffff00:  0000002a 3fff3630 3ffffee0 00000009  
3fffff10:  40000200 00000000 67617373 00000000  
3fffff20:  00000000 30323030 00000000 b65d1ae8  
3fffff30:  3fff57d4 3fff341c 00000010 3fff3414  
3fffff40:  3fff57d4 3fff341c 3fff4c08 4021b108  
3fffff50:  3fff6844 40221738 030207e7 3fff3a88  
3fffff60:  3fff3b40 3fff3b70 3fff3ba0 4021d29f  
3fffff70:  3fff3b40 3fff3b70 3fff3ba0 4021dcde  
3fffff80:  00000000 00000000 00000001 401004a8  
3fffff90:  3fffdad0 00000000 3fff5d84 3fff5d98  
3fffffa0:  3fffdad0 00000000 3fff5d84 40238460  
3fffffb0:  feefeffe feefeffe 3ffe86a8 401013b1  
<<<stack<<<

--------------- CUT HERE FOR EXCEPTION DECODER ---------------

@DaveDavenport
Copy link
Collaborator

DaveDavenport commented Feb 3, 2023

aah nice, we should be able to translate that back into a readable backtrace (if we have original elf)

(I think there is a plugin for this: https://github.com/me-no-dev/EspExceptionDecoder)

@hvxl
Copy link

hvxl commented Feb 3, 2023

Strangely I can't reproduce the issue on v0.10.0rc5 or when I compile v0.10.0 myself.

@rvdbreemen
Copy link
Owner

With what core are you compiling 3.0.2 or 2.7.4?

@DaveDavenport
Copy link
Collaborator

I cannot reproduce with 3.0.2 or 2.7.4.

@hvxl
Copy link

hvxl commented Feb 4, 2023

I tried both. However, comparing the reported debug information, I seem to end up with a different binary than you:

Firmware Version	0.10.0+36108cf
Free Heap Mem (bytes)	13488
Max. Free Mem (bytes)	12464
Arduino Core Version	3.0.2
Espressif SDK Version	2.2.2-dev(38a443e)
CPU speed (MHz)		160
Sketch Size (bytes)	601072
Sketch Free (bytes)	1495040
Flash ID		001620C2
Flash Chip Size (MB)	4
Real Flash Chip (MB)	4
LittleFSsize		1
Flash Chip Speed (MHz)	40
Flash Mode		DIO
Board Type		WEMOS_D1MINI

Firmware version is probably different because I didn't use autoinc-semver. Heap usage changes dynamically. But I expected the sketch size to be the same. There's probably a difference in the libraries we use. I have the impression your "How to compile the OTGW firmware" wiki page is not current.

Did you manage to run the stack trace through the exception decoder?

@DaveDavenport
Copy link
Collaborator

We need the original elf I think to decode the stacktrace.

@DaveDavenport
Copy link
Collaborator

I can reproduce the crash with release btw:

-----------DER ---------------

Exception (28):
epc1=0x40241688 epc2=0x00000000 epc3=0x00000000 excvaddr=0x144c0000 depc=0x00000000

>>>stack>>>

ctx: cont
sp: 3ffffc50 end: 3fffffc0 offset: 0190
3ffffde0:  4028616c 0000001a 3ffffe84 3fff3690
3ffffdf0:  00000001 144c0000 3ffffe84 402371fb
3ffffe00:  00000001 3fff5678 3fff3670 4023722d
3ffffe10:  3fff36cc 3fff5678 3fff3670 4020cebb
3ffffe20:  34303742 30353036 00090030 70460500
3ffffe30:  05460701 00000000 000159eb 00000000
3ffffe40:  00000046 00000000 00000006 3ffe9477
3ffffe50:  3ffe9480 3ffe9fcd 00000000 3fff59b8
3ffffe60:  00000000 4bc6a7f0 0c49ba5e 3fff5768
3ffffe70:  00000000 00000000 2d000000 2d2d2d2d
3ffffe80:  002d2d2d 00000000 001a001f 00000000
3ffffe90:  00000000 2d2d2d2d 00000000 4637f0eb
3ffffea0:  0000002a ff000000 3fff3670 00000200
3ffffeb0:  3fff5678 3fff3690 3fff3670 4021ab1e
3ffffec0:  34303742 30353036 00090030 70460500
3ffffed0:  05460701 00000000 000159eb 00000000
3ffffee0:  00000046 00000000 00000006 3ffe9477
3ffffef0:  3ffe9480 3ffe9fcd 3fffff2c 3fff58ac
3fffff00:  0000002a 00000000 3ffffee0 00000009
3fffff10:  80000200 00000001 00000010 4010158c
3fffff20:  00000000 3fff341c 00000000 4637f0eb
3fffff30:  3fff57d4 3fff341c 00000010 3fff3414
3fffff40:  3fff57d4 3fff341c 3fff4c08 4021b108
3fffff50:  3fff6844 40221738 040207e7 3fff3a88
3fffff60:  3fff3b40 3fff3b70 3fff3ba0 4021d29f
3fffff70:  3fff3b40 3fff3b70 3fff3ba0 4021dcde
3fffff80:  00000000 00000000 00000001 401004a8
3fffff90:  3fffdad0 00000000 3fff5d84 3fff5d98
3fffffa0:  3fffdad0 00000000 3fff5d84 40238460
3fffffb0:  feefeffe feefeffe 3ffe86a8 401013b1
<<<stack<<<

--------------- CUT HERE FOR EXCEPTION DECODER ---------------

@DaveDavenport
Copy link
Collaborator

This backtrace is not correct as far as I can tell, so I really need the original elf:

x106-elf-gcc/3.0.4-gcc10.3-1757bed/bin/xtensa-lx106-elf-addr2line  build/esp8266.esp8266.d1_mini/OTGW-firmware.ino.elf dump.txt
Exception Cause: 28  [LoadProhibited: A load referenced a page mapped with an attribute that does not permit loads]

0x40241688: _ungetc_r at /workdir/repo/newlib/newlib/libc/stdio/ungetc.c:202
0x4028616c: etharp_output at ??:?
0x402371fb: _ZN12experimentalL11_SPICommandEjjjjjPjjj$constprop$0 at /home/qball/Programming/Other/OTGW-firmware/arduino/packages/esp8266/hardware/esp8266/3.0.2/cores/esp8266/core_esp8266_spi_utils.cpp:89
0x4023722d: _ZN12experimentalL11_SPICommandEjjjjjPjjj$constprop$0 at /home/qball/Programming/Other/OTGW-firmware/arduino/packages/esp8266/hardware/esp8266/3.0.2/cores/esp8266/core_esp8266_spi_utils.cpp:102
0x4020cebb: startTelnet() at /home/qball/Programming/Other/OTGW-firmware/networkStuff.h:167
0x4021ab1e: updateSetting(char const*, char const*) at /home/qball/Programming/Other/OTGW-firmware/settingStuff.ino:268
0x4010158c: pm_rtc_clock_cali_trig at ??:?
0x4021b108: handleMQTTcallback(char*, unsigned char*, unsigned int) at /home/qball/Programming/Other/OTGW-firmware/MQTTstuff.ino:142
0x40221738: DallasTemperature::calculateTemperature(unsigned char const*, unsigned char*) at /home/qball/Programming/Other/OTGW-firmware/libraries/DallasTemperature/DallasTemperature.cpp:638
0x4021d29f: OTGWSerial::processorToString() at /home/qball/Programming/Other/OTGW-firmware/src/libraries/OTGWSerial/OTGWSerial.cpp:945
0x4021dcde: OTGWUpgrade::stateMachine(unsigned char const*, int) at /home/qball/Programming/Other/OTGW-firmware/src/libraries/OTGWSerial/OTGWSerial.cpp:697
0x401004a8: esp_schedule at ??:?
0x40238460: ClientContext::state() const at /home/qball/Programming/Other/OTGW-firmware/arduino/packages/esp8266/hardware/esp8266/3.0.2/libraries/ESP8266WiFi/src/include/ClientContext.h:370
0x401013b1: timer1_isr_handler at /home/qball/Programming/Other/OTGW-firmware/arduino/packages/esp8266/hardware/esp8266/3.0.2/cores/esp8266/core_esp8266_timer.cpp:43

@LacsapOV
Copy link
Author

LacsapOV commented Feb 5, 2023

I've got mine up and running and stable. I compiled it myself using the steps in the documentation. Seems the binary in the installation documentation is faulty.

I did have an issue with the Acetime version. It states version 1.9.0 but it's missing a function unixSeconds64, i updated tot the latest 1.x branch.

@rvdbreemen
Copy link
Owner

Glad to hear that a new build does work. It’s the same conclusion we are reaching on the firmware chat on discord.

What was the issue you ran into, so I can correct it. Also AceTime needs the latest and Brian is very actively improving his lib too.

@JvHummel
Copy link

JvHummel commented Feb 7, 2023

Can confirm that doing a build myself and flashing that, remedies the bootloops.

@rvdbreemen
Copy link
Owner

@JvHummel thanks for confirming that. I just build a new release 0.10.1... would you be so kind to test this, it's in the beta channel on discord.

@JvHummel
Copy link

JvHummel commented Feb 8, 2023

Good evening Robert, was just messing with a DS18B20 so had my OTGW out anyway. Good timing. Flashed 0.10.1-beta+7b22d7d and connected it to boiler/thermostat. No bootloops!

@rvdbreemen rvdbreemen moved this to In Progress in OTGW ESP firmware backlog Feb 10, 2023
@rvdbreemen rvdbreemen modified the milestones: 0.11.0, 0.10.1 Feb 10, 2023
@rvdbreemen rvdbreemen moved this from In Progress to Done in OTGW ESP firmware backlog Feb 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Development

When branches are created from issues, their pull requests are automatically linked.

6 participants