-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error/hang detection #123
base: main
Are you sure you want to change the base?
Error/hang detection #123
Conversation
I'm putting this up for an early evaluation of the idea. Do you agree this is the right way to go? Some notes from me:
Thanks in advance for feedback! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* We could add locking, but I feel like it's not worth it because if we miss an error because of interleaving threads, it's not a big deal. * We could make the code platform-specific and use atomic operations, but, again, not worth it for the same reason.
Both "tasks" run in the same thread (through the same workqueue), we should be safe in this aspect.
* Network issues are a good first example, what else should we try to detect?
I don't know...
* Should we return `MENDER_OK` to the scheduler while waiting for a reboot so that we have a (better) chance to detect that the reboot is not happening and we should fail the deployment? (introducing a `pending_reboot` counter)
Something along these lines, I think. Returning MENDER_DONE means no further re-schedule, right? So from there on we don't have any chance to detect these problems.
* Should the limits be build-configurable?
If we take better care of the type casting of the limit (ref comment below) then I would say yes because it is cheap and opens up for more user cases. We don't have user feecback yet, but I can imagine systems where the connections are expected to fail and maybe they want like > 100 error tolerance but also the opposite systems where a single failure is already enough to cancel an update 🤷
* Should this functionality be optional?
The motivation being for saving ROM space? I would not leave it optional, as we believe this is a safety mechanism to have for robust updates, but still make the limit configurable so that can be set to 1 (see above)
Thanks in advance for feedback!
You are welcome 🍻
…rk_function() This tells the scheduler that this work is done and should not be scheduled again. Ticket: MEN-7555 Changelog: none Signed-off-by: Vratislav Podzimek <[email protected]>
923c4a0
to
c4b318c
Compare
Instead of being stuck in the non-working setup. Ticket: MEN-7555 Changelog: none Signed-off-by: Vratislav Podzimek <[email protected]>
c4b318c
to
67793c0
Compare
If reboot was requested, but it hasn't come in a pre-defined number of iterations to wait, we need to fail the deployment and resume normal operation. This also means we need to tell the scheduler to run the work function even if it is just waiting for a reboot and has nothing to do. Nothing else than checking if it has not been waiting for too long. Ticket: MEN-7555 Changelog: none Signed-off-by: Vratislav Podzimek <[email protected]>
67793c0
to
830dc33
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, I haven't tested it because of the current problems with downloading artifacts, but it looks good to me
Same here 😓 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice 🚀 Should the mender_err_count_*
functions be deduped and instead instead have an enum for which counter to operate on? Might save some static memory?
Detection of network errors and an attempt to fix them by network reset plus detection of being stuck waiting for a reboot and getting away from it by failing the deployment and resuming normal operation.