You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Users have been seeing this error in their batch job logs frequently on elcap:
Jun 26 12:53:18.225473 PDT sched-fluxion-resource.err[0]: match_multi_request_cb: match failed due to match error (id=578897838080): Invalid argument
0.045s: job.exception type=alloc severity=0 alloc denied due to type="match error"
Here's some data from one such job:
timestamp of alloc and start events:
1719464345.070602 alloc
1719464354.300049 start
starttime and expiration in R:
# flux job info f24YW13Tx4Rm R | jq .execution.starttime
1719460761
# flux job info f24YW13Tx4Rm R | jq .execution.expiration
1719461361
The precise error was:
Jun 26 22:00:10.967952 PDT sched-fluxion-resource.err[0]: match_multi_request_cb: match failed due to match error (id=791666491392): Invalid argument
0.043s: job.exception type=alloc severity=0 alloc denied due to type="match error"
where
$ date +%s --date="Jun 26 22:00:10.967952 PDT"
1719464410
Some things to note:
The timestamps in R are off, the time of the alloc event was 3584s after the starttime in R, and thus the job was already expired by the time it ran. It sounds like this could cause the match request failure above, so the question is how this could happen.
If the scheduler is stuck in one of its resource match death spirals, could that delay the alloc response to the job manager for this long? Could that explain the behavior here?
The text was updated successfully, but these errors were encountered:
Users have been seeing this error in their batch job logs frequently on elcap:
Here's some data from one such job:
timestamp of
alloc
andstart
events:starttime
andexpiration
in R:The precise error was:
where
Some things to note:
The timestamps in R are off, the time of the
alloc
event was 3584s after the starttime in R, and thus the job was already expired by the time it ran. It sounds like this could cause the match request failure above, so the question is how this could happen.If the scheduler is stuck in one of its resource match death spirals, could that delay the alloc response to the job manager for this long? Could that explain the behavior here?
The text was updated successfully, but these errors were encountered: