drivers: ethernet: enc28j60: disable/enable interrupts to avoid races #82529

zkrx · 2024-12-04T08:42:12Z

Description

We have multiple custom STM32-based boards that we use with ethernet. Long story short, they are plugged in on a switch and we can control them from a computer (Linux), through ethernet. Each board uses an SPI ENC28J60 chip for ethernet communication.

We encountered a rare bug where, sometimes, our computer would completely lose ethernet communication to one or a few of those boards. From the Zephyr console, we could see that the neighbor tables were stale. The boards were completely deaf. Only restarting the affected boards would restore ethernet connectivity. Reproducing the issue can take days or even weeks.

Analysis

I spent quite some time trying to reproduce with GDB. It was not easy. When I succeeded, I could see that the GPIO used as the ENC28J60 interrupt pin was correctly configured, and that it's input register value showed 0. The ENC28J60 interrupt is active low (edge triggered), so it should have triggered an interrupt on the STM32 and subsequently called the interrupt handler. It did not and the RX thread was still stuck waiting on its semaphore.

I then proceeded to look at several ENC28J60 registers. Breaking into eth_enc28j60_read_reg and reading the content of the EIR register (EIR: ETHERNET INTERRUPT REQUEST (FLAG) REGISTER) showed 0x49: PKTIF | TXIF | RXERIF. RXERIF is expected (RX buffer is full, we never read). TXIF is a TX completed interrupt, expected also. PKTIF is expected since we received packets. It looks like somehow we missed an interrupt.

Commit 1

At first, I suspected an errata. The ENC28J60 has an errata that says that we cannot rely on the PKTIF bit to determine if data is pending in the RX buffer:

"The Receive Packet Pending Interrupt Flag
(EIR.PKTIF) does not reliably/accurately report
the status of pending packets."

"In the Interrupt Service Routine, if it is unknown if
a packet is pending and the source of the interrupt
is unknown, switch to Bank 1 and check the value
in EPKTCNT.
If polling to see if a packet is pending, check the
value in EPKTCNT."

The driver already contains a workaround for this errata that checks EPKTCNT inside of eth_enc28j60_rx(). But upon receiving an interrupt, we still check the PKTIF flag in the RX thread before calling eth_enc28j60_rx(). In my view, this defeats the purpose of the errata workaround. I addressed this in the first commit and reran tests. While I believe this should fix a potential errata-related bug, this did not, however, fix my original missing interrupt issue.

Commit 2

Continuing my investigation, I realized that there is a small race window in the way that we handle the interrupt. Let me paste my 2nd commit message as an explanation:

    Currently, there is a small race window where we can miss an interrupt.
    Right after we're done reading the RX buffer but just before decrementing
    the RX counter to zero, the ENC28J60 may receive a packet. The chip will
    raise an interrupt, but the line is still asserted. That means that the
    callback will not be invoked since it is edge-triggered.
    
    To avoid that, disable interrupts on the chip itself before processing
    the RX buffer.
    
    In fact, the ENC28J60 datasheet specifically says:
    
            "After an interrupt occurs, the host controller should
            clear the global enable bit for the interrupt pin before
            servicing the interrupt. Clearing the enable bit will
            cause the interrupt pin to return to the non-asserted
            state (high). Doing so will prevent the host controller
            from missing a falling edge should another interrupt
            occur while the immediate interrupt is being serviced.
            After the interrupt has been serviced, the global enable
            bit may be restored. If an interrupt event occurred while
            the previous interrupt was being processed, the act of
            resetting the global enable bit will cause a new falling
            edge on the interrupt pin to occur."
    
    This is also what is being done in the Linux driver [1].
    
    [1] https://elixir.bootlin.com/linux/v6.11.2/source/drivers/net/ethernet/microchip/enc28j60.c#L1126

In the 2nd commit, I disable the global interrupt bit on the ENC28J60 before processing the interrupt. We no longer experience the issue on our boards.

The enc28j60 errata sheet says: "The Receive Packet Pending Interrupt Flag (EIR.PKTIF) does not reliably/accurately report the status of pending packets." "In the Interrupt Service Routine, if it is unknown if a packet is pending and the source of the interrupt is unknown, switch to Bank 1 and check the value in EPKTCNT. If polling to see if a packet is pending, check the value in EPKTCNT." A workaround has already been implemented inside of eth_enc28j60_rx(). But checking PKTIF before calling eth_enc28j60_rx() completely defeats the purpose of the workaround. Do not check it. Moreover, clearing ENC28J60_BIT_EIR_PKTIF is useless since it is automatically cleared once all packets are read. So remove that check and clarify comment. Also please refer to the Linux driver [1]. [1] https://elixir.bootlin.com/linux/v6.11.2/source/drivers/net/ethernet/microchip/enc28j60.c#L1090 Signed-off-by: Xavier Ruppen <[email protected]>

Currently, there is a small race window where we can miss an interrupt. Right after we're done reading the RX buffer but just before decrementing the RX counter to zero, the ENC28J60 may receive a packet. The chip will raise an interrupt, but the line is still asserted. That means that the callback will not be invoked since it is edge-triggered. To avoid that, disable interrupts on the chip itself before processing the RX buffer. In fact, the ENC28J60 datasheet specifically says: "After an interrupt occurs, the host controller should clear the global enable bit for the interrupt pin before servicing the interrupt. Clearing the enable bit will cause the interrupt pin to return to the non-asserted state (high). Doing so will prevent the host controller from missing a falling edge should another interrupt occur while the immediate interrupt is being serviced. After the interrupt has been serviced, the global enable bit may be restored. If an interrupt event occurred while the previous interrupt was being processed, the act of resetting the global enable bit will cause a new falling edge on the interrupt pin to occur." This is also what is being done in the Linux driver [1]. [1] https://elixir.bootlin.com/linux/v6.11.2/source/drivers/net/ethernet/microchip/enc28j60.c#L1126 Signed-off-by: Xavier Ruppen <[email protected]>

jukkar · 2024-12-04T15:13:30Z

Exceptionally good analysis, kudos to that 👍

zkrx added 2 commits December 3, 2024 15:18

zephyrbot added the area: Ethernet label Dec 4, 2024

zephyrbot requested review from decsny, lmajewski and pdgendt December 4, 2024 08:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

drivers: ethernet: enc28j60: disable/enable interrupts to avoid races #82529

drivers: ethernet: enc28j60: disable/enable interrupts to avoid races #82529

zkrx commented Dec 4, 2024

jukkar commented Dec 4, 2024

drivers: ethernet: enc28j60: disable/enable interrupts to avoid races #82529

Are you sure you want to change the base?

drivers: ethernet: enc28j60: disable/enable interrupts to avoid races #82529

Conversation

zkrx commented Dec 4, 2024

Description

Analysis

Commit 1

Commit 2

jukkar commented Dec 4, 2024