Do not fail job on error in GlobalJobPreLoad #15

BigRoy · 2024-07-25T18:06:34Z

Changelog Description

Do not force the full job to fail - it may be just this machine has a server connection timeout.
This allows a single machine to fail - yet others to continue in the queue.
This also allows a single machine to try again - since it'll use Deadline's default "mark bad after X errors" setting to decide when a worker is 'bad'.

Additional info

Fix #14

Testing notes:

Submit job to Deadline
Turn off your AYON server (or do whatever to make the GlobalJobPreLoad fail)
It should not instantly mark the full job and all its task "failed"

… server connection timeout. Fix ynput#14

iLLiCiTiT · 2024-07-25T18:13:35Z

NOTE: This should be tested on linux mashines too.

BigRoy · 2024-07-25T18:18:52Z

@fabiaserra - any chance you could roll this out on Linux machines and test?

fabiaserra · 2024-07-25T18:54:20Z

@fabiaserra - any chance you could roll this out on Linux machines and test?

Yeah, this is the exact same change I did on my fork when we started discussing it but I haven't been able yet to do a test to confirm it's working, I will try soon

iLLiCiTiT · 2024-07-25T18:55:13Z

BTW bump version of deadline plugin pls

BigRoy · 2024-07-25T19:11:04Z

By the way - there may be cases where failing the job still makes sense. E.g. the:

    if ayon_publish_job == "1" and ayon_render_job == "1":
        raise RuntimeError(
            "Misconfiguration. Job couldn't be both render and publish."
        )

Or if the AyonExecutable is not configured at all.

    if not exe_list:
        raise RuntimeError(
            "Path to AYON executable not configured."
            "Please set it in Ayon Deadline Plugin."
        )

Will always fail - since it's set on the job or deadlineplugin and will be the same result for all machines. So it may make sense to fail the job then?

Maybe this:

        if not all(add_kwargs.values()):
            raise RuntimeError((
                "Missing required env vars: AYON_PROJECT_NAME,"
                " AYON_FOLDER_PATH, AYON_TASK_NAME, AYON_APP_NAME"
            ))

May also make sense to always fail since it should behave quite similar across the workers/machines?

There are also cases where it may make sense to directly mark the Worker as bad for the job.

For example this:

        exe_list = get_ayon_executable()
        exe = FileUtils.SearchFileList(exe_list)

        if not exe:
            raise RuntimeError((
               "Ayon executable was not found in the semicolon "
               "separated list \"{}\"."
               "The path to the render executable can be configured"
               " from the Plugin Configuration in the Deadline Monitor."
            ).format(exe_list))

This may fail per worker depending on whether it has the exe to be found at any of the paths.

There is a high likelihood that that machine may not find it the next run either?
So we could mark the worker "bad" for the job? Using RepositoryUtils.AddBadSlaveForJob...

As such - we may do a follow-up PR or issue to be more specific with our raised errors. For example raising a dedicated error for when we should fail the job.

class AYONJobConfigurationError(RuntimeError):
    """An error of which we know when raised that the full job should fail
    and retrying by other machines will be worthless.

    This may be the case if e.g. not the fully required env vars are configured
    to inject the AYON environment.
    """

Or a dedicated error when we should mark the Worker as bad:

class AYONWorkerBadForJobError(RuntimeError):
    """When raised, the worker will be marked bad for the current job.

    This should be raised when we know that the machine will most likely
    also fail on subsequent tries.
    """

However - a server timeout should allow the job to just error and let it requeue with the same worker.. so it can try again.
So a lot of error attributed to not being able to access the server itself should not generate such a hard failure.

fabiaserra · 2024-07-25T19:40:45Z

I can confirm this works in Linux

BigRoy · 2024-07-25T21:11:56Z

Works for me on Windows - @kalisp @iLLiCiTiT ready for merge when you've decided next steps. :)

antirotor

Makes sense. I am wondering if we should handle exceptions more with more granularity, because for example fo code reading json might fail because either the file doesn't exist on that particular machine (but if that wasn't cought by process handling above, we have a bug somewhere). This is putting responsibility on Deadline (which is fine), I am merely pointing out to that too broad Exception.

BigRoy · 2024-07-29T22:20:05Z

A client also ended up reporting this worked for them and actually helped them - so, on to merging! :)

Do not force the full job to fail - it may be just this machine has a…

9dec105

… server connection timeout. Fix ynput#14

BigRoy added the type: enhancement Improvement of existing functionality or minor addition label Jul 25, 2024

BigRoy requested review from kalisp and iLLiCiTiT July 25, 2024 18:17

Bump version

6f5b59f

BigRoy self-assigned this Jul 25, 2024

fabiaserra approved these changes Jul 25, 2024

View reviewed changes

iLLiCiTiT requested a review from antirotor July 26, 2024 09:13

iLLiCiTiT assigned antirotor and kalisp Jul 26, 2024

antirotor approved these changes Jul 27, 2024

View reviewed changes

BigRoy merged commit 67fce57 into ynput:develop Jul 29, 2024
1 check passed

BigRoy mentioned this pull request Jul 29, 2024

Fail GlobalJobPreLoad harder when we know for sure repeating the task won't work #20

Open

2 tasks

BigRoy deleted the enhancement/globaljobpreload_do_not_fail_job branch July 29, 2024 22:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not fail job on error in GlobalJobPreLoad #15

Do not fail job on error in GlobalJobPreLoad #15

BigRoy commented Jul 25, 2024

iLLiCiTiT commented Jul 25, 2024

BigRoy commented Jul 25, 2024

fabiaserra commented Jul 25, 2024

iLLiCiTiT commented Jul 25, 2024

BigRoy commented Jul 25, 2024 •

edited

Loading

fabiaserra commented Jul 25, 2024

BigRoy commented Jul 25, 2024

antirotor left a comment

BigRoy commented Jul 29, 2024

Do not fail job on error in GlobalJobPreLoad #15

Do not fail job on error in GlobalJobPreLoad #15

Conversation

BigRoy commented Jul 25, 2024

Changelog Description

Additional info

Testing notes:

iLLiCiTiT commented Jul 25, 2024

BigRoy commented Jul 25, 2024

fabiaserra commented Jul 25, 2024

iLLiCiTiT commented Jul 25, 2024

BigRoy commented Jul 25, 2024 • edited Loading

fabiaserra commented Jul 25, 2024

BigRoy commented Jul 25, 2024

antirotor left a comment

Choose a reason for hiding this comment

BigRoy commented Jul 29, 2024

BigRoy commented Jul 25, 2024 •

edited

Loading