Error/hang detection #123

vpodzime · 2024-11-26T16:49:36Z

Detection of network errors and an attempt to fix them by network reset plus detection of being stuck waiting for a reboot and getting away from it by failing the deployment and resuming normal operation.

vpodzime · 2024-11-26T16:53:56Z

I'm putting this up for an early evaluation of the idea. Do you agree this is the right way to go? Some notes from me:

We could add locking, but I feel like it's not worth it because if we miss an error because of interleaving threads, it's not a big deal.
We could make the code platform-specific and use atomic operations, but, again, not worth it for the same reason.
Network issues are a good first example, what else should we try to detect?
Should we return MENDER_OK to the scheduler while waiting for a reboot so that we have a (better) chance to detect that the reboot is not happening and we should fail the deployment? (introducing a pending_reboot counter)
Should the limits be build-configurable?
Should this functionality be optional?

Thanks in advance for feedback!

lluiscampos

* We could add locking, but I feel like it's not worth it because if we miss an error because of interleaving threads, it's not a big deal.
* We could make the code platform-specific and use atomic operations, but, again, not worth it for the same reason.

Both "tasks" run in the same thread (through the same workqueue), we should be safe in this aspect.

* Network issues are a good first example, what else should we try to detect?

I don't know...

* Should we return `MENDER_OK` to the scheduler while waiting for a reboot so that we have a (better) chance to detect that the reboot is not happening and we should fail the deployment? (introducing a `pending_reboot` counter)

Something along these lines, I think. Returning MENDER_DONE means no further re-schedule, right? So from there on we don't have any chance to detect these problems.

* Should the limits be build-configurable?

If we take better care of the type casting of the limit (ref comment below) then I would say yes because it is cheap and opens up for more user cases. We don't have user feecback yet, but I can imagine systems where the connections are expected to fail and maybe they want like > 100 error tolerance but also the opposite systems where a single failure is already enough to cancel an update 🤷

* Should this functionality be optional?

The motivation being for saving ROM space? I would not leave it optional, as we believe this is a safety mechanism to have for robust updates, but still make the limit configurable so that can be set to 1 (see above)

Thanks in advance for feedback!

You are welcome 🍻

core/src/mender-error-counters.c

…rk_function() This tells the scheduler that this work is done and should not be scheduled again. Ticket: MEN-7555 Changelog: none Signed-off-by: Vratislav Podzimek <[email protected]>

Instead of being stuck in the non-working setup. Ticket: MEN-7555 Changelog: none Signed-off-by: Vratislav Podzimek <[email protected]>

If reboot was requested, but it hasn't come in a pre-defined number of iterations to wait, we need to fail the deployment and resume normal operation. This also means we need to tell the scheduler to run the work function even if it is just waiting for a reboot and has nothing to do. Nothing else than checking if it has not been waiting for too long. Ticket: MEN-7555 Changelog: none Signed-off-by: Vratislav Podzimek <[email protected]>

danielskinstad

Nice, I haven't tested it because of the current problems with downloading artifacts, but it looks good to me

vpodzime · 2024-11-29T14:03:07Z

Nice, I haven't tested it because of the current problems with downloading artifacts

Same here 😓

larsewi

Very nice 🚀 Should the mender_err_count_* functions be deduped and instead instead have an enum for which counter to operate on? Might save some static memory?

vpodzime requested review from lluiscampos, larsewi and danielskinstad November 26, 2024 16:49

lluiscampos reviewed Nov 27, 2024

View reviewed changes

core/src/mender-error-counters.c Show resolved Hide resolved

chore: Return MENDER_DONE when waiting for reboot in mender_client_wo…

9d45251

…rk_function() This tells the scheduler that this work is done and should not be scheduled again. Ticket: MEN-7555 Changelog: none Signed-off-by: Vratislav Podzimek <[email protected]>

vpodzime force-pushed the master-error_counters branch 3 times, most recently from 923c4a0 to c4b318c Compare November 28, 2024 14:08

chore: Try to detect too many network issues and reset network

50fdd39

Instead of being stuck in the non-working setup. Ticket: MEN-7555 Changelog: none Signed-off-by: Vratislav Podzimek <[email protected]>

vpodzime force-pushed the master-error_counters branch from c4b318c to 67793c0 Compare November 28, 2024 16:24

vpodzime changed the title ~~WIP: error detection~~ Error/hang detection Nov 28, 2024

vpodzime requested a review from lluiscampos November 28, 2024 16:28

vpodzime force-pushed the master-error_counters branch from 67793c0 to 830dc33 Compare November 28, 2024 16:30

danielskinstad approved these changes Nov 29, 2024

View reviewed changes

larsewi approved these changes Nov 29, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error/hang detection #123

Error/hang detection #123

vpodzime commented Nov 26, 2024 •

edited

Loading

vpodzime commented Nov 26, 2024 •

edited

Loading

lluiscampos left a comment

danielskinstad left a comment

vpodzime commented Nov 29, 2024

larsewi left a comment

Error/hang detection #123

Are you sure you want to change the base?

Error/hang detection #123

Conversation

vpodzime commented Nov 26, 2024 • edited Loading

vpodzime commented Nov 26, 2024 • edited Loading

lluiscampos left a comment

Choose a reason for hiding this comment

danielskinstad left a comment

Choose a reason for hiding this comment

vpodzime commented Nov 29, 2024

larsewi left a comment

Choose a reason for hiding this comment

vpodzime commented Nov 26, 2024 •

edited

Loading

vpodzime commented Nov 26, 2024 •

edited

Loading