Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix for job manager crash: Unable to contact slurm controller #255

Merged
merged 6 commits into from
Feb 22, 2024

Conversation

laraPPr
Copy link
Collaborator

@laraPPr laraPPr commented Feb 21, 2024

No description provided.

@laraPPr laraPPr marked this pull request as draft February 21, 2024 13:17
@laraPPr
Copy link
Collaborator Author

laraPPr commented Feb 21, 2024

Just was thinking that this pr might not work in production when it crashes during a running job. Because current_jobs would be empty. I also think I've already run into this case here.

This did not happen because of the fix in this pr but something else happened. The squeue command was not failing but was just not outputting anything.

This was in the log:

[20240201-T13:44:00] run_cmd(): Result for running '/usr/bin/squeue --clusters=ALL --long --noheader --user=vsc46128' in 'None
           stdout 'shinx                28189575     shinx bot-buil vsc46128  PENDING       0:00 1-00:00:00      1 (Resources)
'
           stderr ''
           exit code 0
[20240201-T13:44:00] job manager main loop: current_jobs='28189575'
[20240201-T13:44:00] job manager main loop: new_jobs=''
[20240201-T13:44:00] job manager main loop: running_jobs=''
[20240201-T13:44:00] job manager main loop: finished_jobs=''
[20240201-T13:44:00] job manager main loop: sleep 60 seconds
[20240201-T13:45:00] job manager main loop: iteration 16
[20240201-T13:45:00] job manager main loop: known_jobs='28189575'
[20240201-T13:45:00] run_subprocess(): 'get_current_jobs(): squeue command' by running '/usr/bin/squeue --clusters=ALL --long --noheader --user=vsc46128' in directory '/kyukon/scratch/gent/461/vsc46128/EESSI/eessi-bot-software-layer'
[20240201-T13:45:01] run_cmd(): Result for running '/usr/bin/squeue --clusters=ALL --long --noheader --user=vsc46128' in 'None
           stdout ''
           stderr ''
           exit code 0
[20240201-T13:45:01] job manager main loop: current_jobs=''
[20240201-T13:45:01] job manager main loop: new_jobs=''
[20240201-T13:45:01] job manager main loop: running_jobs=''
[20240201-T13:45:01] job manager main loop: finished_jobs='28189575'
[20240201-T13:45:01] process_finished_job(): os.rename(/scratch/gent/vo/000/gvo00002/vsc46128/software_bot/jobs/submitted/28189575,/scratch/gent/vo/000/gvo00002/vsc46128/software_bot/jobs/finished/28189575)
[20240201-T13:45:01] Found metadata file at /scratch/gent/vo/000/gvo00002/vsc46128/software_bot/jobs/finished/28189575/_bot_job28189575.result
[20240201-T13:45:01] process_finished_job(): finished job 28189575
########
comment_description: <details><summary>:cry: FAILURE _(click triangle for details)_</summary><dl><dt>_Details_</dt><dd>:white_check_mark: job output file <code>slurm-28189575.out</code><br/>:x: found message matching <code>ERROR: </code><br/>:white_check_mark: no message matching <code>FAILED: </code><br/>:white_check_mark: no message matching <code> required modules missing:</code><br/>:x: no message matching <code>No missing installations</code><br/></dd><dt>_Artefacts_</dt><dd><details><summary><code> ScaFaCoS/1.0.4</code></summary></details></dd></dl></details>
########

[20240201-T13:45:01] Found metadata file at /scratch/gent/vo/000/gvo00002/vsc46128/software_bot/jobs/finished/28189575/_bot_job28189575.metadata
[20240201-T13:45:01] process_finished_job(): pr comment id 1921220010
[20240201-T13:45:04] job manager main loop: sleep 60 seconds

[20240201-T13:47:04] run_cmd(): Result for running '/usr/bin/squeue --clusters=ALL --long --noheader --user=vsc46128' in 'None
           stdout 'shinx                28189579     shinx INTERACT vsc46128  RUNNING       0:17   1:00:00      1 node4223.shinx.os
'
           stderr ''
           exit code 0
[20240201-T13:47:04] job manager main loop: current_jobs='28189579'
[20240201-T13:47:04] job manager main loop: new_jobs='28189579'
[20240201-T13:47:04] run_subprocess(): 'process_new_job(): scontrol command' by running '/usr/bin/scontrol --oneliner show jobid 28189579 --clusters=shinx' in directory '/kyukon/scratch/gent/461/vsc46128/EESSI/eessi-bot-software-layer'
[20240201-T13:47:05] run_cmd(): Result for running '/usr/bin/scontrol --oneliner show jobid 28189579 --clusters=shinx' in 'None
           stdout 'JobId=28189579 JobName=INTERACTIVE UserId=vsc46128(2546128) GroupId=vsc46128(2546128) MCS_label=N/A Priority=1388 Nice=0 Account=gvo00002 QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 RunTime=00:00:18 TimeLimit=01:00:00 TimeMin=N/A SubmitTime=2024-02-01T13:46:47 EligibleTime=2024-02-01T13:46:47 AccrueTime=Unknown StartTime=2024-02-01T13:46:47 EndTime=2024-02-01T14:46:47 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-02-01T13:46:47 Scheduler=Main Partition=shinx AllocNode:Sid=gligar07:130300 ReqNodeList=(null) ExcNodeList=(null) NodeList=node4223.shinx.os BatchHost=node4223.shinx.os NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1,mem=1980M,node=1,billing=1 Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryCPU=1980M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/kyukon/scratch/gent/vo/000/gvo00002/vsc46128/software_bot/jobs/2024.01/pr_4/event_e32a3ed0-bace-11ee-964e-d76fd2f1478e/run_000/RHEL8_zen4-ib-bot/install Comment=stdout=/kyukon/scratch/gent/vo/000/gvo00002/vsc46128/software_bot/jobs/2024.01/pr_4/event_e32a3ed0-bace-11ee-964e-d76fd2f1478e/run_000/RHEL8_zen4-ib-bot/install/slurm-28189579.out  Power= 
'
           stderr ''
           exit code 0
[20240201-T13:47:05] process_new_job(): work dir of job 28189579: '/kyukon/scratch/gent/vo/000/gvo00002/vsc46128/software_bot/jobs/2024.01/pr_4/event_e32a3ed0-bace-11ee-964e-d76fd2f1478e/run_000/RHEL8_zen4-ib-bot/install'
[20240201-T13:47:05] No metadata file found at /kyukon/scratch/gent/vo/000/gvo00002/vsc46128/software_bot/jobs/2024.01/pr_4/event_e32a3ed0-bace-11ee-964e-d76fd2f1478e/run_000/RHEL8_zen4-ib-bot/install/_bot_job28189579.metadata.
[20240201-T13:47:05] No metadata file found at /kyukon/scratch/gent/vo/000/gvo00002/vsc46128/software_bot/jobs/2024.01/pr_4/event_e32a3ed0-bace-11ee-964e-d76fd2f1478e/run_000/RHEL8_zen4-ib-bot/install/_bot_job28189579.metadata for job 28189579, so skipping it
[20240201-T13:47:05] job manager main loop: running_jobs=''
[20240201-T13:47:05] job manager main loop: finished_jobs=''
[20240201-T13:47:05] job manager main loop: sleep 60 seconds
[20240201-T13:48:05] job manager main loop: iteration 19
[20240201-T13:48:05] job manager main loop: known_jobs=''
[20240201-T13:48:05] run_subprocess(): 'get_current_jobs(): squeue command' by running '/usr/bin/squeue --clusters=ALL --long --noheader --user=vsc46128' in directory '/kyukon/scratch/gent/461/vsc46128/EESSI/eessi-bot-software-layer' 

@laraPPr laraPPr marked this pull request as ready for review February 21, 2024 13:30
@laraPPr
Copy link
Collaborator Author

laraPPr commented Feb 21, 2024

So this fix will keep the bot from crashing but the output of the bot in the pr could be wrong if slurm crashes when the bot is handling jobs at that moment.

eessi_bot_job_manager.py Outdated Show resolved Hide resolved
Copy link
Contributor

@trz42 trz42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd put the line https://github.com/laraPPr/eessi-bot-software-layer/blob/d458da17fd20a1aaed17132e929c96cb847b2ca5/eessi_bot_job_manager.py#L684 into a try-except block. In the except clause I'd just let the bot end the iteration with code as in https://github.com/laraPPr/eessi-bot-software-layer/blob/d458da17fd20a1aaed17132e929c96cb847b2ca5/eessi_bot_job_manager.py#L740..L746 and then let it continue to the next iteration in the main loop (maybe with an entry in the log file).

Alternatively (better/easier/cleaner), we move the code that lets the bot pause for poll_interval seconds to the start of the loop (but only execute it from the 2nd iteration to not pause the bot in the first iteration already) and increment the loop counter i before continue in the except clause.

eessi_bot_job_manager.py Outdated Show resolved Hide resolved
eessi_bot_job_manager.py Outdated Show resolved Hide resolved
eessi_bot_job_manager.py Outdated Show resolved Hide resolved
Copy link
Contributor

@trz42 trz42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update @laraPPr !

I added two small suggestions.

eessi_bot_job_manager.py Outdated Show resolved Hide resolved
eessi_bot_job_manager.py Outdated Show resolved Hide resolved
Copy link
Contributor

@trz42 trz42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Thanks @laraPPr !

@trz42 trz42 merged commit b03baa1 into EESSI:develop Feb 22, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants