-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is it possible to give a maximum runtime per job? #91
Comments
Looks like there isn't a Dataflow feature for this, see https://stackoverflow.com/a/51410414. The responder there is Google's Dataflow product manager, so I take the response to be pretty authoritative, albeit a few years old. The suggested workaround probably wouldn't be desirable using the GitHub Action, as it would require the Action's runner process to stay blocked from closing until the Dataflow job timed out. But perhaps the general approach of a supervisor ("zombie killer" 🧟 ) process is transferable. Admittedly the idea of an additional operational layer is less appealing than a simple beam pipeline option, but in this case it doesn't seem like there's an alternative. A lightweight implementation could be a Github Workflow running a "zombie hunting" chron job? 🤔 |
BTW, I cancel all running jobs on dataflow with the following command:
|
That's a helpful command, @yuvipanda! Is it correct to understand that a custom list of job id's could be passed as well (instead of just all jobs) by doing something like:
? |
Interesting idea. I think if we can make this specific per recipe module or recipe ids that could work, otherwise it will be hard to find a sensible 'time_after_job_is_a_zombie' value I think. |
@cisaacstern yes, you can filter as you wish. If we were to stick to bash the easiest would be to filter in the |
Implemented a zombie killer gh workflow here. Works like a charm. |
A bit off-topic but I wonder if there is a way to more gracefully end the job if there are no ids to cance (example)? |
I separated #46 out, and will close this now. |
I have found several instances where jobs on dataflow get 'stuck', meaning they run for much longer than expected.
So far this seems to be an intermittent problem, but when I am submitting hundreds of jobs, cancelling them manually becomes tedious.
Is there a way to give a time after which a job is force killed at the point of submission to dataflow? Presumably the user would want to manage this on a per job basis. For instance my CMIP6 recipes will rarely take longer than a few minutes, but e.g. @cisaacstern s climsim recipe often takes a few days.
The text was updated successfully, but these errors were encountered: