Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't find vertex on LC cluster #1305

Open
jameshcorbett opened this issue Oct 2, 2024 · 2 comments
Open

Can't find vertex on LC cluster #1305

jameshcorbett opened this issue Oct 2, 2024 · 2 comments

Comments

@jameshcorbett
Copy link
Member

The following message is being repeated somewhat regularly on a LC cluster with different job IDs.

[ +14.021503] sched-fluxion-resource[0]: run_remove: dfu_traverser_t::remove (id=345577048527356928): add_or_update: couldn't find vertex in graph for cluster1029.
[ +14.022637] sched-fluxion-resource[0]: unpack_rank: failed unpacking rank for cluster1029.
[ +14.022648] sched-fluxion-resource[0]: add_or_update: couldn't find vertex in graph for cluster1030.
[ +14.022653] sched-fluxion-resource[0]: unpack_rank: failed unpacking rank for cluster1030.
[ +14.022658] sched-fluxion-resource[0]: add_or_update: couldn't find vertex in graph for cluster1036.
[ +14.022663] sched-fluxion-resource[0]: unpack_rank: failed unpacking rank for cluster1036.
[ +14.022668] sched-fluxion-resource[0]: add_or_update: couldn't find vertex in graph for cluster1004.
[ +14.022672] sched-fluxion-resource[0]: unpack_rank: failed unpacking rank for cluster1004.
[ +14.022678] sched-fluxion-resource[0]: add_or_update: couldn't find vertex in graph for cluster1015.
[ +14.022683] sched-fluxion-resource[0]: unpack_rank: failed unpacking rank for cluster1015.
[ +14.022688] sched-fluxion-resource[0]: add_or_update: couldn't find vertex in graph for cluster1010.
[ +14.022692] sched-fluxion-resource[0]: unpack_rank: failed unpacking rank for cluster1010.
[ +14.022697] sched-fluxion-resource[0]: add_or_update: couldn't find vertex in graph for cluster1003.
[ +14.022703] sched-fluxion-resource[0]: unpack_rank: failed unpacking rank for cluster1003.
[ +14.022707] sched-fluxion-resource[0]: add_or_update: couldn't find vertex in graph for cluster1002.
[ +14.022723] sched-fluxion-resource[0]: unpack_rank: failed unpacking rank for cluster1002.
[ +14.022731] sched-fluxion-resource[0]: add_or_update: couldn't find vertex in graph for cluster1031.
[ +14.022735] sched-fluxion-resource[0]: unpack_rank: failed unpacking rank for cluster1031.
[ +14.022739] sched-fluxion-resource[0]: add_or_update: couldn't find vertex in graph for cluster1023.
[ +14.022743] sched-fluxion-resource[0]: unpack_rank: failed unpacking rank for cluster1023.
[ +14.022747] sched-fluxion-resource[0]: add_or_update: couldn't find vertex in graph for cluster1039.
[ +14.022751] sched-fluxion-resource[0]: unpack_rank: failed unpacking rank for cluster1039.
[ +14.022755] sched-fluxion-resource[0]: add_or_update: couldn't find vertex in graph for cluster1013.
[ +14.022758] sched-fluxion-resource[0]: unpack_rank: failed unpacking rank for cluster1013.
[ +14.022762] sched-fluxion-resource[0]: add_or_update: couldn't find vertex in graph for cluster1040.
[ +14.022768] sched-fluxion-resource[0]: unpack_rank: failed unpacking rank for cluster1040.
[ +14.022772] sched-fluxion-resource[0]: add_or_update: couldn't find vertex in graph for cluster1032.
[ +14.022775] sched-fluxion-resource[0]: unpack_rank: failed unpacking rank for cluster1032.
[ +14.022779] sched-fluxion-resource[0]: add_or_update: couldn't find vertex in graph for cluster1014.
[ +14.022783] sched-fluxion-resource[0]: unpack_rank: failed unpacking rank for cluster1014.
[ +14.022787] sched-fluxion-resource[0]: add_or_update: couldn't find vertex in graph for cluster1041.
[ +14.022790] sched-fluxion-resource[0]: unpack_rank: failed unpacking rank for cluster1041.
[ +14.022794] sched-fluxion-resource[0]: add_or_update: couldn't find vertex in graph for cluster1037.
[ +14.022798] sched-fluxion-resource[0]: unpack_rank: failed unpacking
[ +14.022808] sched-fluxion-resource[0]: partial_cancel_request_cb: remove fails due to match error (id=345577048527356928): Invalid argument
[ +14.023561] sched-fluxion-qmanager[0]: remove: .free RPC partial cancel failed for jobid 345577048527356928: Invalid argument
[ +14.023577] sched-fluxion-qmanager[0]: jobmanager_free_cb: remove (queue=pdev id=345577048527356928): Invalid argument

Based on the JGF in the KVS, those vertices it can't find should be in the graph.

Interestingly, the job wasn't even using those nodes:

flux jobs 345577048527356928
       JOBID QUEUE    USER     NAME       ST NTASKS NNODES     TIME INFO
 foXVU52MaqD pdev     user1 run.sh      F      8      8   1.563m cluster[1027,1032-1033,1035-1039]

@zekemorton @milroy any thoughts?

@garlick
Copy link
Member

garlick commented Oct 2, 2024

apologies if this is obvious but just in case not, that system is configured with match-format = "rv1_nosched" and has an R containing JGF. Recently it was discovered that the R was not riight and it was regenerated and the scheduler reloaded with the system idle. Well, almost, we discovered a stuck job in housekeeping from before this morning which was causing flux resource list to report "duplicate allocation in housekeeping". The stuck nodes were rebooted before 9AM and that stopped. However the above messages appear to still be popping up.

Note also that the job_manager_free() remove error is sometimes reported as a protocol error and other times invalid argument as above.

@jameshcorbett
Copy link
Member Author

After changing the match format to rv1 and restarting Flux, errors are still occurring:

Oct 15 09:42:56 cluster1 flux[1877160]: sched-fluxion-resource.err[0]: run_remove: dfu_traverser_t::remove (id=364473021286581248): add_or_update: couldn't find vertex in graph for cluster1012
.
Oct 15 09:42:56 cluster1 flux[1877160]: sched-fluxion-resource.err[0]: unpack_rank: failed unpacking rank for cluster1012.
Oct 15 09:42:56 cluster1 flux[1877160]: sched-fluxion-resource.err[0]: add_or_update: couldn't find vertex in graph for cluster1016.
Oct 15 09:42:56 cluster1 flux[1877160]: sched-fluxion-resource.err[0]: unpack_rank: failed unpacking rank for cluster1016.
Oct 15 09:42:56 cluster1 flux[1877160]: sched-fluxion-resource.err[0]: mod_agfilter: planner_multi_reduce_span returned -1.
Oct 15 09:42:56 cluster1 flux[1877160]: sched-fluxion-resource.err[0]: rack1.
Oct 15 09:42:56 cluster1 flux[1877160]: sched-fluxion-resource.err[0]: Invalid argument
Oct 15 09:42:56 cluster1 flux[1877160]: sched-fluxion-resource.err[0]: mod_agfilter: planner_multi_reduce_span returned -1.
Oct 15 09:42:56 cluster1 flux[1877160]: sched-fluxion-resource.err[0]: rack2.
Oct 15 09:42:56 cluster1 flux[1877160]: sched-fluxion-resource.err[0]: Invalid argument
Oct 15 09:42:56 cluster1 flux[1877160]: sched-fluxion-resource.err[0]: mod_agfilter: planner_multi_reduce_span returned -1.
Oct 15 09:42:56 cluster1 flux[1877160]: sched-fluxion-resource.err[0]: rack1.
Oct 15 09:42:56 cluster1 flux[1877160]: sched-fluxion-resource.err[0]: Invalid argument
Oct 15 09:42:56 cluster1 flux[1877160]: sched-fluxion-resource.err[0]: mod_agfilter: planner_multi_reduce_span returned -1.
Oct 15 09:42:56 cluster1 flux[1877160]: sched-fluxion-resource.err[0]: rack2.
Oct 15 09:42:56 cluster1 flux[1877160]: sched-fluxion-resource.err[0]: Invalid argument
Oct 15 09:42:56 cluster1 flux[1877160]: sched-fluxion-resource.err[0]: mod_agfilter: planner_multi_reduce_span returned -1.
Oct 15 09:42:56 cluster1 flux[1877160]: sched-fluxion-resource.err[0]: rack3.
Oct 15 09:42:56 cluster1 flux[1877160]: sched-fluxion-resource.err[0]: Invalid argument
Oct 15 09:42:56 cluster1 flux[1877160]: sched-fluxion-resource.err[0]: upd_agfilter: planner_multi_add_span returned -1.
Oct 15 09:42:56 cluster1 flux[1877160]: sched-fluxion-resource.err[0]: Invalid argument
Oct 15 09:42:56 cluster1 flux[1877160]: sched-fluxion-resource.err[0]: upd_agfilter: planner_multi_add_span returned -1.
Oct 15 09:42:56 cluster1 flux[1877160]: sched-fluxion-resource.err[0]: Invalid argument
Oct 15 09:42:56 cluster1 flux[1877160]: sched-fluxion-resource.err[0]: upd_agfilter: planner_multi_add_span returned -1.
Oct 15 09:42:56 cluster1 flux[1877160]: sched-fluxion-resource.err[0]: Invalid argument
Oct 15 09:42:56 cluster1 flux[1877160]: sched-fluxion-resource.err[0]: upd_agfilter: planner_multi_add_span returned -1.
Oct 15 09:42:56 cluster1 flux[1877160]: sched-fluxion-resource.err[0]: Invalid argument
Oct 15 09:42:56 cluster1 flux[1877160]: sched-fluxion-resource.err[0]: upd_agfilter: planner_multi_add_span returned -1.
Oct 15 09:42:56 cluster1 flux[1877160]: sched-fluxion-resource.err[0]: Invalid argument
Oct 15 09:42:56 cluster1 flux[1877160]: sched-fluxion-resource.err[0]: upd_agfilter: planner_multi_add_span returned -1.
Oct 15 09:42:56 cluster1 flux[1877160]: sched-fluxion-resource.err[0]: Invalid argument
Oct 15 09:42:56 cluster1 flux[1877160]: sched-fluxion-resource.err[0]: upd_agfilter: planner_multi_add_span returned -1.
Oct 15 09:42:56 cluster1 flux[1877160]: sched-fluxion-resource.err[0]: Invalid argument
Oct 15 09:42:56 cluster1 flux[1877160]: sched-fluxion-resource.err[0]: upd_agfilter: planner_multi_add_span returned -1.
Oct 15 09:42:56 cluster1 flux[1877160]: sched-fluxion-resource.err[0]: Invalid argument
Oct 15 09:42:56 cluster1 flux[1877160]: sched-fluxion-resource.err[0]: upd_agfilter: planner_multi_add_span returned -1.
Oct 15 09:42:56 cluster1 flux[1877160]: sched-fluxion-resource.err[0]: Invalid argument
Oct 15 09:42:56 cluster1 flux[1877160]: sched-fluxion-resource.err[0]: upd_agfilter: planner_multi_add_span returned -1.
Oct 15 09:42:56 cluster1 flux[1877160]: sched-fluxion-resource.err[0]: Invalid argument
Oct 15 09:42:56 cluster1 flux[1877160]: sched-fluxion-resource.err[0]: mod_agfilter: planner_multi_reduce_span returned -1.
Oct 15 09:42:56 cluster1 flux[1877160]: sched-fluxion-resource.err[0]: rack5.
Oct 15 09:42:56 cluster1 flux[1877160]: sched-fluxion-resource.err[0]: Invalid argument
Oct 15 09:42:56 cluster1 flux[1877160]: sched-fluxion-resource.err[0]: mod_agfilter: planner_multi_reduce_span returned -1.
Oct 15 09:42:56 cluster1 flux[1877160]: sched-fluxion-resource.err[0]: rack6.
Oct 15 09:42:56 cluster1 flux[1877160]: sched-fluxion-resource.err[0]: Invalid argument
Oct 15 09:42:56 cluster1 flux[1877160]: sched-fluxion-resource.err[0]: mod_agfilter: planner_multi_reduce_span returned -1.
Oct 15 09:42:56 cluster1 flux[1877160]: sched-fluxion-resource.err[0]: rack7.
Oct 15 09:42:56 cluster1 flux[1877160]: sched-fluxion-resource.err[0]: Invalid argument
Oct 15 09:42:56 cluster1 flux[1877160]: sched-fluxion-resource.err[0]: upd_agfilter: planner_multi_add_span returned -1.
Oct 15 09:42:56 cluster1 flux[1877160]: sched-fluxion-resource.err[0]: Invalid argument
Oct 15 09:42:56 cluster1 flux[1877160]: sched-fluxion-resource.err[0]: upd_agfilter: planner_multi_add_span returned -1.
Oct 15 09:42:56 cluster1 flux[1877160]: sched-fluxion-resource.err[0]: Invalid argument
Oct 15 09:42:56 cluster1 flux[1877160]: sched-fluxion-resource.err[0]: upd_agfilter: planner_multi_add_span returned -1.
Oct 15 09:42:56 cluster1 flux[1877160]: sched-fluxion-resource.err[0]: Invalid argument
Oct 15 09:42:56 cluster1 flux[1877160]: sched-fluxion-resource.err[0]: upd_agfilter: planner_multi_add_span returned -1.
Oct 15 09:42:56 cluster1 flux[1877160]: sched-fluxion-resource.err[0]: Invalid argument
Oct 15 09:42:56 cluster1 flux[1877160]: sched-fluxion-resource.err[0]: upd_agfilter: planner_multi_add_span returned -1.
Oct 15 09:42:56 cluster1 flux[1877160]: sched-fluxion-resource.err[0]: Invalid argument
Oct 15 09:42:56 cluster1 flux[1877160]: sched-fluxion-resource.err[0]: upd_agfilter: planner_multi_add_span returned -1.
Oct 15 09:42:56 cluster1 flux[1877160]: sched-fluxion-resource.err[0]: Invalid argument
Oct 15 09:42:56 cluster1 flux[1877160]: sched-fluxion-resource.err[0]: partial_cancel_request_cb: remove fails due to match error (id=364473021286581248): Invalid argument
Oct 15 09:42:56 cluster1 flux[1877160]: sched-fluxion-qmanager.err[0]: remove: .free RPC partial cancel failed for jobid 364473021286581248: Invalid argument
Oct 15 09:42:56 cluster1 flux[1877160]: sched-fluxion-qmanager.err[0]: jobmanager_free_cb: remove (queue=pdev id=364473021286581248): Invalid argument

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants