Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coredump failed on overflowed tasks due to excessive stack usage of panic handler (IDFGH-12163) #13219

Open
andylinpersonal opened this issue Feb 20, 2024 · 6 comments
Labels
Status: Opened Issue is new Type: Bug bugs in IDF

Comments

@andylinpersonal
Copy link
Contributor

andylinpersonal commented Feb 20, 2024

Title

Core dump failed on overflowed tasks due to excessive stack usage of panic handler

IDF version.

release/v5.1 7380f96
release/v5.2 93ea06f
master c460e1c

Espressif SoC revision.

esp32s3: v0.1
esp32c3: v0.4
esp32c6: v0.0

Operating System used.

Linux

How did you build your project?

Command line with idf.py

Development Kit.

esp32c3: LuatOS CORE-ESP32-C3 (custom esp32-c3 with DIO mode flash)
esp32c6: esp32-c6-devkitc-1-n8
esp32s3: esp32-s3-devkitc-1-n32r8v

Power Supply used.

USB

What is the expected behavior?

Core dump should be reliably performed, even for the overflowed tasks.

Following is the normal core dump flow:

  1. [Exception occurred]
  2. Run the exception handler on the current stack.
  3. Fatal exception -> invoke panic handler.
  4. If core dump is enabled, invoke the core dump function.
  5. If core dump stack is enabled, swap the stack to the pre-allocated one.
  6. Perform real core dump stuff.
  7. Swap back the stack to the panic handler's one.
  8. Reset.

What is the actual behavior?

  • espcoredump crashed randomly when the original task was overflowed.

  • After reserving a large space before the stack, most configurations can perform core dump successfully.

    • To proof and check the memory region polluted by the overflowed panic handler and the core dump function, I have placed a second canary watchpoint below the first one. It seems that reserving some space under the end of stack could be a possible workaround as the second watchpoint will not be triggered duing core dump if the space is sufficiently large.

      • If the reserved space is way too small, we'll see this:
      Guru Meditation Error: Core  0 panic'ed (Double exception).
      
      • If the reserved space is slightly smaller than required space, we'll see this:
      Guru Meditation Error: Core  0 panic'ed (Unhandled debug exception).
      Debug exception reason: Watchpoint 0 triggered
      
    Original stack:
    
    |  ---  stack top   ---
    |  ====================
    |  <-   stack body   ->
    |
    |    [1st canary WP]
    |  ====================
    V  ---  stack end   ---
    
    Reserve some space under the stack end as workaround:
    
    G  ---  stack top   ---
    R  ====================
    O  <-   stack body   ->
    W
    T    [1st canary WP]
    H  ====================
    |  ---  stack end   ---
    |  ====================
    |  <- Reserved space ->
    |
    |    [2nd canary WP]
    |   (only work on s3)
    |  ====================
    V  --- end of rsvd  ---
    
  • After loading the core dump data by idf.py coredump-debug and examine with gdb, we can find the reserved space was modified by the panic handler and core dump code.

  • Possible fix:

    • Ensure _xt_panic and _panic_handler only do the minimum works and run the real panic handler on a dedicated emergency stack, just like the core dump do.
  • Additional note for S3, -O0:

    • Core dump cannot even continue with enlarged stack. What's happened here?

Steps to reproduce.

  • Reproduce:

    1. Enable some options in the sdkconfig.defaults.
      1. Enable core dump and save to flash.
        • CONFIG_ESP32_ENABLE_COREDUMP_TO_FLASH=y
      2. Enable canary watchpoint at the end of stack to catch the overflow event.
        • CONFIG_FREERTOS_WATCHPOINT_END_OF_STACK=y
      3. Enable core dump stack for core dump code.
        • CONFIG_ESP_COREDUMP_STACK_SIZE=1536
        • CONFIG_ESP32_CORE_DUMP_STACK_SIZE=1536
      4. Adjust the optimization level.
    2. Input crash and carefully overflow the victim task by recursively calling a small function.
    3. We can see the espcoredump crashed again :(
  • Examine:

    1. Reserve extra space before the stack by adding #define ENABLE_RESERVED_SPACE 1 to main/main.c
    2. Crash the task by input crash.
    3. Check the overflowed stask
    • Using gdb to show the overflowed task stack:
      1. Load the dumped image by ESP_ROM_ELF_DIR=~/.espressif/tools/esp-rom-elfs/20230320/ idf.py coredump-debug.
      2. Enter p task_stack and see the content of the stack in gdb.
      3. We can see the reserved space was modified by the core dump process.
      4. For S3: if the reserved size is too small, the 2nd WP will be triggered during the core dump process.
    • Hash the reserved area and compare before/after the panic handler:
      1. Hash and save the reserved space.
      2. Store the hash value to the RTC memory or global variable above the task_stack.
      3. Hash the reserved space again after the panic handler.
        • For IDF v5.3, user can put the checking code into esp_core_dump_write_internal()
      4. The hash value will be mismatched if the panic handler invaded the reserved space.

Debug Logs.

  • Log messages without workaround: (Excerpted from *without-workaround.log)

    • v5.1, S3, -Og:

      ELF file SHA256: b47fc05821d133ea
      
      I (6381) esp_core_dump_flash: Save core dump to flash...
      I (6381) esp_core_dump_common: Backing up stack @ 0x3fc959f0 and use core dump stack @ 0x3fc98780
      I (6382) esp_core_dump_flash: Erase flash 16384 bytes @ 0x110000
      Guru Meditation Error: Core  0 panic'ed (InstructionFetchError). Exception was unhandled.
      
      Core  0 register dump:
      PC      : 0x3fcef74c  PS      : 0x00060734  A0      : 0x82007e12  A1      : 0x3fc986f0
      A2      : 0x3fcef74c  A3      : 0x00110000  A4      : 0x00004000  A5      : 0x00004000
      A6      : 0x00110000  A7      : 0x00000002  A8      : 0x80377fc9  A9      : 0x3fc98660
      A10     : 0x3fc98700  A11     : 0x3fc948cf  A12     : 0xffffffff  A13     : 0x3fc98666
      A14     : 0x00000014  A15     : 0xfffffff9  SAR     : 0x00000005  EXCCAUSE: 0x00000002
      EXCVADDR: 0x3fcef74c  LBEG    : 0x400570e8  LEND    : 0x400570f3  LCOUNT  : 0x00000000
      
      
      Backtrace: 0x3fcef749:0x3fc986f0 |<-CORRUPTED
      
    • v5.2, C6, -Og:

      ELF file SHA256: c4031c967
      
      I (319) esp_core_dump_flash: Save core dump to flash...
      I (320) esp_core_dump_common: Backing up stack @ 0x4080d6b0 and use core dump stack @ 0x408100d0
      I (324) esp_core_dump_flash: Erase flash 12288 bytes @ 0x110000
      Guru Meditation Error: Core  0 panic'ed (Instruction access fault). Exception was unhandled.
      
      Core  0 register dump:
      MEPC    : 0x4080d570  RA      : 0x4080d570  SP      : 0x40810070  GP      : 0x4080c100
      TP      : 0x40805848  T0      : 0x4002289a  T1      : 0xffffffe0  T2      : 0x00000001
      S0/FP   : 0x00000000  S1      : 0x00000000  A0      : 0x4080d61c  A1      : 0x00110000
      A2      : 0x00003000  A3      : 0x4080ae8c  A4      : 0x40018ebe  A5      : 0x4080d570
      A6      : 0x80000000  A7      : 0x00000010  S2      : 0x00000000  S3      : 0x00003000
      S4      : 0x00000000  S5      : 0x00000000  S6      : 0x00000000  S7      : 0x00000000
      S8      : 0x00000000  S9      : 0x00000000  S10     : 0x00000000  S11     : 0x00000000
      T3      : 0x00000000  T4      : 0x00000000  T5      : 0x00000000  T6      : 0x00000000
      MSTATUS : 0x00001801  MTVEC   : 0x40800001  MCAUSE  : 0x00000001  MTVAL   : 0x4080d570
      MHARTID : 0x00000000
      
    • v5.2, S3, -Os:

      ELF file SHA256: 4aa6107ff
      
      I (8005) esp_core_dump_flash: Save core dump to flash...
      
      // Stucked here ...
      
      ESP-ROM:esp32s3-20210327
      Build:Mar 27 2021
      
      // And reset by RTCWDT
      rst:0x9 (RTCWDT_SYS_RST),boot:0x18 (SPI_FAST_FLASH_BOOT)
      
    • v5.3, C6, -O0:

      ELF file SHA256: 9f80fd8fa
      
      I (1589) esp_core_dump_flash: Save core dump to flash...
      I (1589) esp_core_dump_common: Backing up stack @ 0x40817560 and use core dump stack @ 0x4081a2f0
      I (1593) esp_core_dump_flash: Erase flash 12288 bytes @ 0x110000
      Guru Meditation Error: Core  0 panic'ed (Load access fault). Exception was unhandled.
      
      Core  0 register dump:
      MEPC    : 0x408048a6  RA      : 0x40804890  SP      : 0x4081a220  GP      : 0x40817500
      TP      : 0x4080c1cc  T0      : 0x4002289a  T1      : 0xffffffe0  T2      : 0x00000000
      S0/FP   : 0x4081a280  S1      : 0x4080d3b4  A0      : 0x00000000  A1      : 0x00110000
      A2      : 0x00003000  A3      : 0x00110000  A4      : 0x408174dc  A5      : 0xffffffff
      A6      : 0x80000000  A7      : 0x00000010  S2      : 0x00021340  S3      : 0x00000000
      S4      : 0x38be5e00  S5      : 0x00000000  S6      : 0x00000000  S7      : 0x00000000
      S8      : 0x00000000  S9      : 0x00000000  S10     : 0x00000000  S11     : 0x00000000
      T3      : 0x00000000  T4      : 0x00000000  T5      : 0x00000000  T6      : 0x00000000
      MSTATUS : 0x00001801  MTVEC   : 0x40800001  MCAUSE  : 0x00000005  MTVAL   : 0x00000017
      MHARTID : 0x00000000
      
    • v5.3, C6, -Og:

      ELF file SHA256: 8f8b05fe3
      
      I (10180) esp_core_dump_flash: Save core dump to flash...
      I (10181) esp_core_dump_common: Backing up stack @ 0x4080c330 and use core dump stack @ 0x4080ee00
      I (10185) esp_core_dump_flash: Erase flash 12288 bytes @ 0x110000
      Guru Meditation Error: Core  0 panic'ed (Illegal instruction). Exception was unhandled.
      
      Core  0 register dump:
      MEPC    : 0xfffffffe  RA      : 0x408028a8  SP      : 0x4080ed90  GP      : 0x4080c300
      TP      : 0x408031b4  T0      : 0x4002289a  T1      : 0xffffffe0  T2      : 0x00000001
      S0/FP   : 0x00000000  S1      : 0x4080ee44  A0      : 0x4080c290  A1      : 0x00110000
      A2      : 0x00003000  A3      : 0x4080b094  A4      : 0xffffffff  A5      : 0x4080c1e0
      A6      : 0x80000000  A7      : 0x00000010  S2      : 0x00110000  S3      : 0x00003000
      S4      : 0x00000000  S5      : 0x00000000  S6      : 0x00000000  S7      : 0x00000000
      S8      : 0x00000000  S9      : 0x00000000  S10     : 0x00000000  S11     : 0x00000000
      T3      : 0x00000000  T4      : 0x00000000  T5      : 0x00000000  T6      : 0x00000000
      MSTATUS : 0x00001801  MTVEC   : 0x40800001  MCAUSE  : 0x00000002  MTVAL   : 0x00000000
      MHARTID : 0x00000000
      
    • v5.3, S3, -O0:

      ELF file SHA256: 6343cb008
      
      I (5490) esp_core_dump_flash: Save core dump to flash...
      I (5490) esp_core_dump_common: Backing up stack @ 0x3fc9f1c0 and use core dump stack @ 0x3fca20a0
      I (5491) esp_core_dump_flash: Erase flash 20480 bytes @ 0x110000
      Guru Meditation Error: Core  0 panic'ed (IllegalInstruction). Exception was unhandled.
      
      Core  0 register dump:
      PC      : 0x6d75645f  PS      : 0x00060134  A0      : 0x8200c80c  A1      : 0x3fca1fa0
      A2      : 0x3fc9f31c  A3      : 0x00110000  A4      : 0x00005000  A5      : 0x00005000
      A6      : 0x00110000  A7      : 0x3fca1fa0  A8      : 0x80379796  A9      : 0x3fca1fd0
      A10     : 0x3fca1fd0  A11     : 0x3c037c7f  A12     : 0xffffffff  A13     : 0x3fca1f46
      A14     : 0x00000014  A15     : 0xfffffff9  SAR     : 0x00000005  EXCCAUSE: 0x00000000
      EXCVADDR: 0x00000000  LBEG    : 0x400570e8  LEND    : 0x400570f3  LCOUNT  : 0x00000000
      
      
      Backtrace: 0x6d75645c:0x3fca1fa0 |<-CORRUPTED
      
  • Log messages with workaround:

    • See attachment file with-workaround/${IDF_VER}-${TARGET}-${OPT}-log-with-workaround.log.
  • Stack content of overflowed task with workaround applied:

    • v5.3, C6, -O0

      [Current thread is 1 (process 1082239012)]
      add symbol table from file "~/.espressif/tools/esp-rom-elfs/20230320/esp32c6_rev0_rom.elf" at
              .text_addr = 0x400017ec
      Reading symbols from ~/.espressif/tools/esp-rom-elfs/20230320/esp32c6_rev0_rom.elf...
      (No debugging symbols found in ~/.espressif/tools/esp-rom-elfs/20230320/esp32c6_rev0_rom.elf)
      (gdb) p task_stack
      
      // Overflowed for (RESERVED_BYTES - 104) bytes...
      // In this case 896 - 104 = 792 bytes
      
      $1 = '\000' <repeats 104 times>, "\001\000\000\000\237w\201@", '\000' <repeats 12 times>, "\001", '\000' <repeats 15 times>, "\260w\201@\000\000\000\000\000\000\000\000\001\000\000\000\317w\201 \000\000\000\000\000\000\000\000\320w\201@\220\017\000B", '\000' <repeats 12 times>, "\340w\201 \000\000\000\000\000\000\000\000\000"...
      
    • Other dumped stacks have been excerpted to with-workaround/${IDF_VER}-${TARGET}-${OPT}-corredump-brief.log

  • Missing logs of C6, release/v5.2 due to v5.2 bootloader crashes on startup (IDFGH-12141) #13197.

More Information.

  • Related bug:

    • ESP32-S3 may stuck during swapping the stack in espcordump regardless of ESP-IDF version when processing the overflowed task. Not yet found the possible cause.
  • Followings are the test code.

main/main.c

#include <driver/usb_serial_jtag.h>
#include <esp_vfs_dev.h>
#include <esp_idf_version.h>

#include <freertos/FreeRTOS.h>
#include <freertos/task.h>

#include <stdio.h>
#include <string.h>

#define ENABLE_RESERVED_SPACE 1

#if ENABLE_RESERVED_SPACE
#if ESP_IDF_VERSION_MAJOR == 5 && ESP_IDF_VERSION_MINOR == 1

#ifdef CONFIG_IDF_TARGET_ESP32C3
#if CONFIG_COMPILER_OPTIMIZATION_NONE
#define RESERVED_BYTES 896
#elif CONFIG_COMPILER_OPTIMIZATION_DEFAULT
#define RESERVED_BYTES 640
#elif CONFIG_COMPILER_OPTIMIZATION_SIZE
#define RESERVED_BYTES 640
#endif
#endif

#ifdef CONFIG_IDF_TARGET_ESP32S3
#if CONFIG_COMPILER_OPTIMIZATION_NONE
// Stucked during core dump
#define RESERVED_BYTES 1536
#elif CONFIG_COMPILER_OPTIMIZATION_DEFAULT
#define RESERVED_BYTES 1024
#elif CONFIG_COMPILER_OPTIMIZATION_SIZE
#define RESERVED_BYTES 1024
#endif
#endif


#elif ESP_IDF_VERSION_MAJOR == 5 && ESP_IDF_VERSION_MINOR == 2

#if CONFIG_IDF_TARGET_ESP32C6
#if CONFIG_COMPILER_OPTIMIZATION_NONE
#define RESERVED_BYTES 896
#elif CONFIG_COMPILER_OPTIMIZATION_DEBUG
#define RESERVED_BYTES 640
#elif CONFIG_COMPILER_OPTIMIZATION_SIZE
#define RESERVED_BYTES 640
#endif
#endif

#ifdef CONFIG_IDF_TARGET_ESP32S3
#if CONFIG_COMPILER_OPTIMIZATION_NONE
// Stucked during core dump
#define RESERVED_BYTES 1536
#elif CONFIG_COMPILER_OPTIMIZATION_DEBUG
// Stucked during core dump
#define RESERVED_BYTES 1024
#elif CONFIG_COMPILER_OPTIMIZATION_SIZE
// Stucked during core dump
#define RESERVED_BYTES 1024
#endif
#endif

#elif ESP_IDF_VERSION_MAJOR == 5 && ESP_IDF_VERSION_MINOR == 3

#if CONFIG_IDF_TARGET_ESP32C3
#if CONFIG_COMPILER_OPTIMIZATION_NONE
#define RESERVED_BYTES 896
#elif CONFIG_COMPILER_OPTIMIZATION_DEBUG
#define RESERVED_BYTES 640
#elif CONFIG_COMPILER_OPTIMIZATION_SIZE
#define RESERVED_BYTES 640
#endif
#endif

#if CONFIG_IDF_TARGET_ESP32C6
#if CONFIG_COMPILER_OPTIMIZATION_NONE
#define RESERVED_BYTES 896
#elif CONFIG_COMPILER_OPTIMIZATION_DEBUG
#define RESERVED_BYTES 640
#elif CONFIG_COMPILER_OPTIMIZATION_SIZE
#define RESERVED_BYTES 640
#endif
#endif

#ifdef CONFIG_IDF_TARGET_ESP32S3
#if CONFIG_COMPILER_OPTIMIZATION_NONE
// Stucked during core dump
#define RESERVED_BYTES 1536
#elif CONFIG_COMPILER_OPTIMIZATION_DEBUG
// Stucked during core dump
#define RESERVED_BYTES 1024
#elif CONFIG_COMPILER_OPTIMIZATION_SIZE
// Stucked during core dump
#define RESERVED_BYTES 1024
#endif
#endif

#endif // ESP_IDF_VERSION_MAJOR == 5 && ESP_IDF_VERSION_MINOR == 3
#else  // !ENABLE_RESERVED_SPACE
#define RESERVED_BYTES 0
#endif // ENABLE_RESERVED_SPACE

#define CANARY_WP_SIZE 64
#define STACK_BYTES 3072

StaticTask_t task_tcb;
TaskHandle_t task_handle = NULL;
/**
 * Reserve RESERVED_BYTES bytes of extra space for panic handler and espcoredump before replacing the stack
 * Add COREDUMP_DRAM_ATTR to save the whole stack to coredump.
 */
COREDUMP_DRAM_ATTR __attribute__((aligned(CANARY_WP_SIZE))) StackType_t task_stack[STACK_BYTES + RESERVED_BYTES];

#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Winfinite-recursion"
__attribute__((noreturn)) NOINLINE_ATTR void crashme(void) { crashme(); }
#pragma GCC diagnostic pop

void task_func(void *) {
  fprintf(stderr, "Previous frame: %p:%p\n", __builtin_extract_return_addr(__builtin_return_address(0)), esp_cpu_get_sp());

  char buf[32] = {};
  fputs("$ \n", stderr);
  while (true) {
    if (fgets(buf, sizeof(buf), stdin)) {
      fprintf(stderr, "$ %s", buf);
      if (strnstr(buf, "crash", sizeof(buf))) {
        break;
      }
    }

    memset(buf, 0, sizeof(buf));
    vTaskDelay(pdMS_TO_TICKS(1000));
  }

  // Should be triggered on S3 when the reserved space exhausted
#if RESERVED_BYTES >= CANARY_WP_SIZE && ENABLE_RESERVED_SPACE
  assert(esp_cpu_set_watchpoint(0, &task_stack, CANARY_WP_SIZE,  ESP_CPU_WATCHPOINT_STORE) == ESP_OK);
#endif
  crashme();
}

void app_main(void) {
  // Disable buffering on stdin
  setvbuf(stdin, NULL, _IONBF, 0);

  // Install USB-SERIAL-JTAG driver for interrupt-driven reads and writes
  usb_serial_jtag_driver_config_t usb_serial_jtag_config =
      USB_SERIAL_JTAG_DRIVER_CONFIG_DEFAULT();
  assert(usb_serial_jtag_driver_install(&usb_serial_jtag_config) == ESP_OK);

  // Tell vfs to use usb-serial-jtag driver
  esp_vfs_usb_serial_jtag_use_driver();

  fprintf(stderr, "stack@%p-%p\n", &task_stack[0], (const char*)&task_stack[0] + sizeof(task_stack));

  task_handle = xTaskCreateStaticPinnedToCore(
      task_func, "outer", STACK_BYTES, NULL, CONFIG_PTHREAD_TASK_PRIO_DEFAULT, &task_stack[RESERVED_BYTES], &task_tcb, 0);
}

main/CMakeLists.txt

idf_component_register(SRCS "main.c")

CMakeLists.txt

cmake_minimum_required(VERSION 3.16)
include($ENV{IDF_PATH}/tools/cmake/project.cmake)
project(tester)

sdkconfig.defaults

CONFIG_ESP_CONSOLE_USB_SERIAL_JTAG=y
CONFIG_FREERTOS_WATCHPOINT_END_OF_STACK=y

CONFIG_ESP32_ENABLE_COREDUMP_TO_FLASH=y

# Optionally enable this to print more details.
# CONFIG_LOG_DEFAULT_LEVEL_DEBUG=y

# -O0
CONFIG_COMPILER_OPTIMIZATION_NONE=y
# -Og (Default)
# CONFIG_COMPILER_OPTIMIZATION_DEBUG=y
# CONFIG_COMPILER_OPTIMIZATION_DEFAULT=y
# -Os
# CONFIG_COMPILER_OPTIMIZATION_SIZE=y

# CONFIG_COMPILER_OPTIMIZATION_ASSERTIONS_ENABLE=y

# Enable and allocate a dedicated stack for coredump code to handle the overflowed stacks safely.
CONFIG_ESP_COREDUMP_STACK_SIZE=1536
CONFIG_ESP32_CORE_DUMP_STACK_SIZE=1536

@andylinpersonal andylinpersonal added the Type: Bug bugs in IDF label Feb 20, 2024
@andylinpersonal
Copy link
Contributor Author

Extra log files:
logs.tar.gz

@andylinpersonal andylinpersonal changed the title Coredump failed on overflowed tasks due to excess stack usage of panic handler Coredump failed on overflowed tasks due to excessive stack usage of panic handler Feb 20, 2024
@espressif-bot espressif-bot added the Status: Opened Issue is new label Feb 20, 2024
@github-actions github-actions bot changed the title Coredump failed on overflowed tasks due to excessive stack usage of panic handler Coredump failed on overflowed tasks due to excessive stack usage of panic handler (IDFGH-12163) Feb 20, 2024
@andylinpersonal
Copy link
Contributor Author

Additional note for ESP32-S3 with -O0:

  1. Coredump cannot even continue with enlarged stack
  • With a bit of esp_rom_printf, it seems like S3 hanged near esp_core_dump_replace_sp() and xthal_window_spill(). At least esp_rom_printf cannot print anything anymore after xthal_window_spill().

@andylinpersonal
Copy link
Contributor Author

  • Update some log files of v5.3 to c460e1c.
  • Add another examination method.

test.tar.gz

@andylinpersonal
Copy link
Contributor Author

Any update? Thanks.

@erhankur
Copy link
Collaborator

@andylinpersonal Thanks for reporting the issue. We have some findings regarding stack replacement on both xtensa and riscv chips. I will wrap up what we did and explain here next week.

Regarding overflow scenarios, I am sorry, I couldn't find a time to investigate yet. Will check it soon.

@AxelLin
Copy link
Contributor

AxelLin commented Jun 6, 2024

@erhankur Any follow up of #13219 (comment) ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Opened Issue is new Type: Bug bugs in IDF
Projects
None yet
Development

No branches or pull requests

4 participants