Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

drivers: ethernet: enc28j60: disable/enable interrupts to avoid races #82529

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

zkrx
Copy link
Contributor

@zkrx zkrx commented Dec 4, 2024

Description

We have multiple custom STM32-based boards that we use with ethernet. Long story short, they are plugged in on a switch and we can control them from a computer (Linux), through ethernet. Each board uses an SPI ENC28J60 chip for ethernet communication.

We encountered a rare bug where, sometimes, our computer would completely lose ethernet communication to one or a few of those boards. From the Zephyr console, we could see that the neighbor tables were stale. The boards were completely deaf. Only restarting the affected boards would restore ethernet connectivity. Reproducing the issue can take days or even weeks.

Analysis

I spent quite some time trying to reproduce with GDB. It was not easy. When I succeeded, I could see that the GPIO used as the ENC28J60 interrupt pin was correctly configured, and that it's input register value showed 0. The ENC28J60 interrupt is active low (edge triggered), so it should have triggered an interrupt on the STM32 and subsequently called the interrupt handler. It did not and the RX thread was still stuck waiting on its semaphore.

I then proceeded to look at several ENC28J60 registers. Breaking into eth_enc28j60_read_reg and reading the content of the EIR register (EIR: ETHERNET INTERRUPT REQUEST (FLAG) REGISTER) showed 0x49: PKTIF | TXIF | RXERIF. RXERIF is expected (RX buffer is full, we never read). TXIF is a TX completed interrupt, expected also. PKTIF is expected since we received packets. It looks like somehow we missed an interrupt.

Commit 1

At first, I suspected an errata. The ENC28J60 has an errata that says that we cannot rely on the PKTIF bit to determine if data is pending in the RX buffer:

"The Receive Packet Pending Interrupt Flag
(EIR.PKTIF) does not reliably/accurately report
the status of pending packets."

"In the Interrupt Service Routine, if it is unknown if
a packet is pending and the source of the interrupt
is unknown, switch to Bank 1 and check the value
in EPKTCNT.
If polling to see if a packet is pending, check the
value in EPKTCNT."

The driver already contains a workaround for this errata that checks EPKTCNT inside of eth_enc28j60_rx(). But upon receiving an interrupt, we still check the PKTIF flag in the RX thread before calling eth_enc28j60_rx(). In my view, this defeats the purpose of the errata workaround. I addressed this in the first commit and reran tests. While I believe this should fix a potential errata-related bug, this did not, however, fix my original missing interrupt issue.

Commit 2

Continuing my investigation, I realized that there is a small race window in the way that we handle the interrupt. Let me paste my 2nd commit message as an explanation:

    Currently, there is a small race window where we can miss an interrupt.
    Right after we're done reading the RX buffer but just before decrementing
    the RX counter to zero, the ENC28J60 may receive a packet. The chip will
    raise an interrupt, but the line is still asserted. That means that the
    callback will not be invoked since it is edge-triggered.
    
    To avoid that, disable interrupts on the chip itself before processing
    the RX buffer.
    
    In fact, the ENC28J60 datasheet specifically says:
    
            "After an interrupt occurs, the host controller should
            clear the global enable bit for the interrupt pin before
            servicing the interrupt. Clearing the enable bit will
            cause the interrupt pin to return to the non-asserted
            state (high). Doing so will prevent the host controller
            from missing a falling edge should another interrupt
            occur while the immediate interrupt is being serviced.
            After the interrupt has been serviced, the global enable
            bit may be restored. If an interrupt event occurred while
            the previous interrupt was being processed, the act of
            resetting the global enable bit will cause a new falling
            edge on the interrupt pin to occur."
    
    This is also what is being done in the Linux driver [1].
    
    [1] https://elixir.bootlin.com/linux/v6.11.2/source/drivers/net/ethernet/microchip/enc28j60.c#L1126

In the 2nd commit, I disable the global interrupt bit on the ENC28J60 before processing the interrupt. We no longer experience the issue on our boards.

The enc28j60 errata sheet says:

	"The Receive Packet Pending Interrupt Flag
	(EIR.PKTIF) does not reliably/accurately report
	the status of pending packets."

	"In the Interrupt Service Routine, if it is unknown if
	a packet is pending and the source of the interrupt
	is unknown, switch to Bank 1 and check the value
	in EPKTCNT.
	If polling to see if a packet is pending, check the
	value in EPKTCNT."

A workaround has already been implemented inside of eth_enc28j60_rx().
But checking PKTIF before calling eth_enc28j60_rx() completely defeats
the purpose of the workaround. Do not check it.

Moreover, clearing ENC28J60_BIT_EIR_PKTIF is useless since it is
automatically cleared once all packets are read. So remove that check
and clarify comment.

Also please refer to the Linux driver [1].

[1] https://elixir.bootlin.com/linux/v6.11.2/source/drivers/net/ethernet/microchip/enc28j60.c#L1090

Signed-off-by: Xavier Ruppen <[email protected]>
Currently, there is a small race window where we can miss an interrupt.
Right after we're done reading the RX buffer but just before decrementing
the RX counter to zero, the ENC28J60 may receive a packet. The chip will
raise an interrupt, but the line is still asserted. That means that the
callback will not be invoked since it is edge-triggered.

To avoid that, disable interrupts on the chip itself before processing
the RX buffer.

In fact, the ENC28J60 datasheet specifically says:

	"After an interrupt occurs, the host controller should
	clear the global enable bit for the interrupt pin before
	servicing the interrupt. Clearing the enable bit will
	cause the interrupt pin to return to the non-asserted
	state (high). Doing so will prevent the host controller
	from missing a falling edge should another interrupt
	occur while the immediate interrupt is being serviced.
	After the interrupt has been serviced, the global enable
	bit may be restored. If an interrupt event occurred while
	the previous interrupt was being processed, the act of
	resetting the global enable bit will cause a new falling
	edge on the interrupt pin to occur."

This is also what is being done in the Linux driver [1].

[1] https://elixir.bootlin.com/linux/v6.11.2/source/drivers/net/ethernet/microchip/enc28j60.c#L1126

Signed-off-by: Xavier Ruppen <[email protected]>
@jukkar
Copy link
Member

jukkar commented Dec 4, 2024

Exceptionally good analysis, kudos to that 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants