-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix for job manager crash: Unable to contact slurm controller #255
Conversation
Just was thinking that this pr might not work in production when it crashes during a running job. Because This did not happen because of the fix in this pr but something else happened. The squeue command was not failing but was just not outputting anything. This was in the log:
|
So this fix will keep the bot from crashing but the output of the bot in the pr could be wrong if slurm crashes when the bot is handling jobs at that moment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd put the line https://github.com/laraPPr/eessi-bot-software-layer/blob/d458da17fd20a1aaed17132e929c96cb847b2ca5/eessi_bot_job_manager.py#L684 into a try-except
block. In the except
clause I'd just let the bot end the iteration with code as in https://github.com/laraPPr/eessi-bot-software-layer/blob/d458da17fd20a1aaed17132e929c96cb847b2ca5/eessi_bot_job_manager.py#L740..L746 and then let it continue
to the next iteration in the main loop (maybe with an entry in the log file).
Alternatively (better/easier/cleaner), we move the code that lets the bot pause for poll_interval
seconds to the start of the loop (but only execute it from the 2nd iteration to not pause the bot in the first iteration already) and increment the loop counter i
before continue
in the except
clause.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update @laraPPr !
I added two small suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. Thanks @laraPPr !
No description provided.