Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error/hang detection #123

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

vpodzime
Copy link
Collaborator

@vpodzime vpodzime commented Nov 26, 2024

Detection of network errors and an attempt to fix them by network reset plus detection of being stuck waiting for a reboot and getting away from it by failing the deployment and resuming normal operation.

@vpodzime
Copy link
Collaborator Author

vpodzime commented Nov 26, 2024

I'm putting this up for an early evaluation of the idea. Do you agree this is the right way to go? Some notes from me:

  • We could add locking, but I feel like it's not worth it because if we miss an error because of interleaving threads, it's not a big deal.
  • We could make the code platform-specific and use atomic operations, but, again, not worth it for the same reason.
  • Network issues are a good first example, what else should we try to detect?
  • Should we return MENDER_OK to the scheduler while waiting for a reboot so that we have a (better) chance to detect that the reboot is not happening and we should fail the deployment? (introducing a pending_reboot counter)
  • Should the limits be build-configurable?
  • Should this functionality be optional?

Thanks in advance for feedback!

Copy link
Collaborator

@lluiscampos lluiscampos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* We could add locking, but I feel like it's not worth it because if we miss an error because of interleaving threads, it's not a big deal.
* We could make the code platform-specific and use atomic operations, but, again, not worth it for the same reason.

Both "tasks" run in the same thread (through the same workqueue), we should be safe in this aspect.

* Network issues are a good first example, what else should we try to detect?

I don't know...

* Should we return `MENDER_OK` to the scheduler while waiting for a reboot so that we have a (better) chance to detect that the reboot is not happening and we should fail the deployment? (introducing a `pending_reboot` counter)

Something along these lines, I think. Returning MENDER_DONE means no further re-schedule, right? So from there on we don't have any chance to detect these problems.

* Should the limits be build-configurable?

If we take better care of the type casting of the limit (ref comment below) then I would say yes because it is cheap and opens up for more user cases. We don't have user feecback yet, but I can imagine systems where the connections are expected to fail and maybe they want like > 100 error tolerance but also the opposite systems where a single failure is already enough to cancel an update 🤷

* Should this functionality be optional?

The motivation being for saving ROM space? I would not leave it optional, as we believe this is a safety mechanism to have for robust updates, but still make the limit configurable so that can be set to 1 (see above)

Thanks in advance for feedback!

You are welcome 🍻

core/src/mender-error-counters.c Show resolved Hide resolved
…rk_function()

This tells the scheduler that this work is done and should not be
scheduled again.

Ticket: MEN-7555
Changelog: none
Signed-off-by: Vratislav Podzimek <[email protected]>
@vpodzime vpodzime force-pushed the master-error_counters branch 3 times, most recently from 923c4a0 to c4b318c Compare November 28, 2024 14:08
Instead of being stuck in the non-working setup.

Ticket: MEN-7555
Changelog: none
Signed-off-by: Vratislav Podzimek <[email protected]>
@vpodzime vpodzime changed the title WIP: error detection Error/hang detection Nov 28, 2024
If reboot was requested, but it hasn't come in a pre-defined
number of iterations to wait, we need to fail the deployment and
resume normal operation.

This also means we need to tell the scheduler to run the work
function even if it is just waiting for a reboot and has nothing
to do. Nothing else than checking if it has not been waiting for
too long.

Ticket: MEN-7555
Changelog: none
Signed-off-by: Vratislav Podzimek <[email protected]>
Copy link
Collaborator

@danielskinstad danielskinstad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I haven't tested it because of the current problems with downloading artifacts, but it looks good to me

@vpodzime
Copy link
Collaborator Author

Nice, I haven't tested it because of the current problems with downloading artifacts

Same here 😓

Copy link
Collaborator

@larsewi larsewi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice 🚀 Should the mender_err_count_* functions be deduped and instead instead have an enum for which counter to operate on? Might save some static memory?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants