Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not fail job on error in GlobalJobPreLoad #15

Merged

Conversation

BigRoy
Copy link
Contributor

@BigRoy BigRoy commented Jul 25, 2024

Changelog Description

Do not force the full job to fail - it may be just this machine has a server connection timeout.
This allows a single machine to fail - yet others to continue in the queue.
This also allows a single machine to try again - since it'll use Deadline's default "mark bad after X errors" setting to decide when a worker is 'bad'.

Additional info

Fix #14

Testing notes:

  1. Submit job to Deadline
  2. Turn off your AYON server (or do whatever to make the GlobalJobPreLoad fail)
  3. It should not instantly mark the full job and all its task "failed"

@BigRoy BigRoy added the type: enhancement Improvement of existing functionality or minor addition label Jul 25, 2024
@iLLiCiTiT
Copy link
Member

NOTE: This should be tested on linux mashines too.

@BigRoy BigRoy requested review from kalisp and iLLiCiTiT July 25, 2024 18:17
@BigRoy
Copy link
Contributor Author

BigRoy commented Jul 25, 2024

@fabiaserra - any chance you could roll this out on Linux machines and test?

@fabiaserra
Copy link

@fabiaserra - any chance you could roll this out on Linux machines and test?

Yeah, this is the exact same change I did on my fork when we started discussing it but I haven't been able yet to do a test to confirm it's working, I will try soon

@iLLiCiTiT
Copy link
Member

BTW bump version of deadline plugin pls

@BigRoy BigRoy self-assigned this Jul 25, 2024
@BigRoy
Copy link
Contributor Author

BigRoy commented Jul 25, 2024

By the way - there may be cases where failing the job still makes sense. E.g. the:

    if ayon_publish_job == "1" and ayon_render_job == "1":
        raise RuntimeError(
            "Misconfiguration. Job couldn't be both render and publish."
        )

Or if the AyonExecutable is not configured at all.

    if not exe_list:
        raise RuntimeError(
            "Path to AYON executable not configured."
            "Please set it in Ayon Deadline Plugin."
        )

Will always fail - since it's set on the job or deadlineplugin and will be the same result for all machines. So it may make sense to fail the job then?

Maybe this:

        if not all(add_kwargs.values()):
            raise RuntimeError((
                "Missing required env vars: AYON_PROJECT_NAME,"
                " AYON_FOLDER_PATH, AYON_TASK_NAME, AYON_APP_NAME"
            ))

May also make sense to always fail since it should behave quite similar across the workers/machines?


There are also cases where it may make sense to directly mark the Worker as bad for the job.

For example this:

        exe_list = get_ayon_executable()
        exe = FileUtils.SearchFileList(exe_list)

        if not exe:
            raise RuntimeError((
               "Ayon executable was not found in the semicolon "
               "separated list \"{}\"."
               "The path to the render executable can be configured"
               " from the Plugin Configuration in the Deadline Monitor."
            ).format(exe_list))

This may fail per worker depending on whether it has the exe to be found at any of the paths.

There is a high likelihood that that machine may not find it the next run either?
So we could mark the worker "bad" for the job? Using RepositoryUtils.AddBadSlaveForJob...

As such - we may do a follow-up PR or issue to be more specific with our raised errors. For example raising a dedicated error for when we should fail the job.

class AYONJobConfigurationError(RuntimeError):
    """An error of which we know when raised that the full job should fail
    and retrying by other machines will be worthless.

    This may be the case if e.g. not the fully required env vars are configured
    to inject the AYON environment.
    """

Or a dedicated error when we should mark the Worker as bad:

class AYONWorkerBadForJobError(RuntimeError):
    """When raised, the worker will be marked bad for the current job.

    This should be raised when we know that the machine will most likely
    also fail on subsequent tries.
    """

However - a server timeout should allow the job to just error and let it requeue with the same worker.. so it can try again.
So a lot of error attributed to not being able to access the server itself should not generate such a hard failure.

@fabiaserra
Copy link

I can confirm this works in Linux

@BigRoy
Copy link
Contributor Author

BigRoy commented Jul 25, 2024

Works for me on Windows - @kalisp @iLLiCiTiT ready for merge when you've decided next steps. :)

Copy link
Member

@antirotor antirotor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I am wondering if we should handle exceptions more with more granularity, because for example fo code reading json might fail because either the file doesn't exist on that particular machine (but if that wasn't cought by process handling above, we have a bug somewhere). This is putting responsibility on Deadline (which is fine), I am merely pointing out to that too broad Exception.

@BigRoy
Copy link
Contributor Author

BigRoy commented Jul 29, 2024

A client also ended up reporting this worked for them and actually helped them - so, on to merging! :)

@BigRoy BigRoy merged commit 67fce57 into ynput:develop Jul 29, 2024
1 check passed
@BigRoy BigRoy deleted the enhancement/globaljobpreload_do_not_fail_job branch July 29, 2024 22:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: enhancement Improvement of existing functionality or minor addition
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Failure with inject_environment should fail only task which triggered it
5 participants