Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to give a maximum runtime per job? #91

Closed
jbusecke opened this issue Aug 22, 2023 · 8 comments
Closed

Is it possible to give a maximum runtime per job? #91

jbusecke opened this issue Aug 22, 2023 · 8 comments

Comments

@jbusecke
Copy link
Contributor

I have found several instances where jobs on dataflow get 'stuck', meaning they run for much longer than expected.
So far this seems to be an intermittent problem, but when I am submitting hundreds of jobs, cancelling them manually becomes tedious.
Is there a way to give a time after which a job is force killed at the point of submission to dataflow? Presumably the user would want to manage this on a per job basis. For instance my CMIP6 recipes will rarely take longer than a few minutes, but e.g. @cisaacstern s climsim recipe often takes a few days.

@cisaacstern
Copy link
Member

Looks like there isn't a Dataflow feature for this, see https://stackoverflow.com/a/51410414. The responder there is Google's Dataflow product manager, so I take the response to be pretty authoritative, albeit a few years old. The suggested workaround probably wouldn't be desirable using the GitHub Action, as it would require the Action's runner process to stay blocked from closing until the Dataflow job timed out. But perhaps the general approach of a supervisor ("zombie killer" 🧟 ) process is transferable. Admittedly the idea of an additional operational layer is less appealing than a simple beam pipeline option, but in this case it doesn't seem like there's an alternative. A lightweight implementation could be a Github Workflow running a "zombie hunting" chron job? 🤔

@yuvipanda
Copy link
Collaborator

BTW, I cancel all running jobs on dataflow with the following command:

❯ gcloud dataflow jobs list --status=active --format=json | jq -r ".[].id" | xargs -L1 gcloud dataflow jobs cancel --region=us-central1

@cisaacstern
Copy link
Member

That's a helpful command, @yuvipanda!

Is it correct to understand that a custom list of job id's could be passed as well (instead of just all jobs) by doing something like:

gcloud dataflow jobs cancel --region=us-central1 $SPACE_DELIMITED_JOBID_LIST 

?

@jbusecke
Copy link
Contributor Author

A lightweight implementation could be a Github Workflow running a "zombie hunting" chron job? 🤔

Interesting idea. I think if we can make this specific per recipe module or recipe ids that could work, otherwise it will be hard to find a sensible 'time_after_job_is_a_zombie' value I think.

@yuvipanda
Copy link
Collaborator

@cisaacstern yes, you can filter as you wish. If we were to stick to bash the easiest would be to filter in the jq step - pretty powerful. But if this was to be systematized, easiest is to just read the output of that JSON command into python, do whatever you need, and then kill them.

@jbusecke
Copy link
Contributor Author

Implemented a zombie killer gh workflow here. Works like a charm.

@jbusecke
Copy link
Contributor Author

jbusecke commented Sep 1, 2023

A bit off-topic but I wonder if there is a way to more gracefully end the job if there are no ids to cance (example)?

@jbusecke
Copy link
Contributor Author

jbusecke commented Sep 7, 2023

I separated #46 out, and will close this now.

@jbusecke jbusecke closed this as completed Sep 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants