drivers: ethernet: enc28j60: disable/enable interrupts to avoid races #82529
+21
−7
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
We have multiple custom STM32-based boards that we use with ethernet. Long story short, they are plugged in on a switch and we can control them from a computer (Linux), through ethernet. Each board uses an SPI ENC28J60 chip for ethernet communication.
We encountered a rare bug where, sometimes, our computer would completely lose ethernet communication to one or a few of those boards. From the Zephyr console, we could see that the neighbor tables were stale. The boards were completely deaf. Only restarting the affected boards would restore ethernet connectivity. Reproducing the issue can take days or even weeks.
Analysis
I spent quite some time trying to reproduce with GDB. It was not easy. When I succeeded, I could see that the GPIO used as the ENC28J60 interrupt pin was correctly configured, and that it's input register value showed 0. The ENC28J60 interrupt is active low (edge triggered), so it should have triggered an interrupt on the STM32 and subsequently called the interrupt handler. It did not and the RX thread was still stuck waiting on its semaphore.
I then proceeded to look at several ENC28J60 registers. Breaking into
eth_enc28j60_read_reg
and reading the content of theEIR
register (EIR: ETHERNET INTERRUPT REQUEST (FLAG) REGISTER
) showed0x49
:PKTIF | TXIF | RXERIF
.RXERIF
is expected (RX buffer is full, we never read).TXIF
is a TX completed interrupt, expected also.PKTIF
is expected since we received packets. It looks like somehow we missed an interrupt.Commit 1
At first, I suspected an errata. The ENC28J60 has an errata that says that we cannot rely on the
PKTIF
bit to determine if data is pending in the RX buffer:The driver already contains a workaround for this errata that checks
EPKTCNT
inside ofeth_enc28j60_rx()
. But upon receiving an interrupt, we still check thePKTIF
flag in the RX thread before callingeth_enc28j60_rx()
. In my view, this defeats the purpose of the errata workaround. I addressed this in the first commit and reran tests. While I believe this should fix a potential errata-related bug, this did not, however, fix my original missing interrupt issue.Commit 2
Continuing my investigation, I realized that there is a small race window in the way that we handle the interrupt. Let me paste my 2nd commit message as an explanation:
In the 2nd commit, I disable the global interrupt bit on the ENC28J60 before processing the interrupt. We no longer experience the issue on our boards.