Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

better softnpu management command reliability #570

Merged
merged 4 commits into from
Dec 1, 2023

Conversation

rcgoodfellow
Copy link
Contributor

@rcgoodfellow rcgoodfellow commented Nov 20, 2023

This PR does two things

  1. Makes UART FIFO length configurable and sets softnpu's management UART FIFO length to 10 instead of 1.
  2. Updates the softnpu management message reader to not attempt to deserialize messages where one of the following is true
    a. The softnpu management preamble has not been recognized.
    b. The message is not long enough to contain a preamble plus actual data.

These changes were spurred by transient issues adding routes in Falcon topologies. I traced this down to corrupted softnpu management command payloads coming through the UART. The corrupted payloads were easy to reproduce. With the changes in this PR, I'm no longer able to get corrupted payloads through the UART.

@rcgoodfellow
Copy link
Contributor Author

Something peculiar about the missing bytes. Missing bytes would happen intermittently on a message-by-message basis e.g. I could not detect a pattern there. However, when there was a missing byte in a message, it was often the 2nd byte, which felt odd.

@rcgoodfellow rcgoodfellow changed the title better softnpu management command reliability better softnpu management command reliability, p9fs read count fix Nov 24, 2023
@rcgoodfellow rcgoodfellow force-pushed the softnpu-mgmt-reliability branch from afe9d74 to 836afad Compare November 25, 2023 19:31
@rcgoodfellow rcgoodfellow changed the title better softnpu management command reliability, p9fs read count fix better softnpu management command reliability Nov 25, 2023
@rcgoodfellow rcgoodfellow force-pushed the softnpu-mgmt-reliability branch from 836afad to 7d3e4ee Compare November 25, 2023 20:29
@pfmooney
Copy link
Collaborator

The illumos issue regarding the lost byte(s) on the UART has been written up in 16088. The changes to the propolis UART to expand the FIFO sizing should not go in as-is. (We may someday improve the emulation to support 16550-compatible FIFO behavior)

Do you want to keep the other changes to the softNPU logic contained in this PR?

@rcgoodfellow
Copy link
Contributor Author

Do you want to keep the other changes to the softNPU logic contained in this PR?

Yeah, I do.

Once the illumos patch goes through and makes its way to stlouis, I'll cut a new Helios base image for Falcon. Then I'll revert the FIFO changes on this branch and we can get the remaining updates merged.

@pfmooney
Copy link
Collaborator

pfmooney commented Dec 1, 2023

With the UART fixes merged into both illumos-gate and stlouis, can we pare this down to only the softNPU fixes so it can be merged?

@rcgoodfellow rcgoodfellow merged commit da5ac5a into master Dec 1, 2023
10 checks passed
@pfmooney pfmooney deleted the softnpu-mgmt-reliability branch December 1, 2023 21:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants