Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

match_multi_request_cb: match failed due to match error (id=578897838080): Invalid argument #1229

Open
grondo opened this issue Jun 27, 2024 · 1 comment

Comments

@grondo
Copy link
Contributor

grondo commented Jun 27, 2024

Users have been seeing this error in their batch job logs frequently on elcap:

Jun 26 12:53:18.225473 PDT sched-fluxion-resource.err[0]: match_multi_request_cb: match failed due to match error (id=578897838080): Invalid argument
0.045s: job.exception type=alloc severity=0 alloc denied due to type="match error"

Here's some data from one such job:

timestamp of alloc and start events:

1719464345.070602 alloc
1719464354.300049 start

starttime and expiration in R:

# flux job info f24YW13Tx4Rm R | jq .execution.starttime
1719460761
# flux job info f24YW13Tx4Rm R | jq .execution.expiration
1719461361

The precise error was:

Jun 26 22:00:10.967952 PDT sched-fluxion-resource.err[0]: match_multi_request_cb: match failed due to match error (id=791666491392): Invalid argument
0.043s: job.exception type=alloc severity=0 alloc denied due to type="match error"

where

$ date +%s --date="Jun 26 22:00:10.967952 PDT"
1719464410

Some things to note:

The timestamps in R are off, the time of the alloc event was 3584s after the starttime in R, and thus the job was already expired by the time it ran. It sounds like this could cause the match request failure above, so the question is how this could happen.

If the scheduler is stuck in one of its resource match death spirals, could that delay the alloc response to the job manager for this long? Could that explain the behavior here?

@grondo
Copy link
Contributor Author

grondo commented Sep 6, 2024

Users are seeing this error again in a slightly different scenario. Example:

  1. user submits batch job with 30m timelimit
  2. job is allocated an R with expiration of starttime (now) + 1800 = E1
  3. job-exec module gives small grace time of ~ expiration + (now - starttime) = E2
  4. fluxion initializes with graph end time of E1
  5. job gets stuck or is slower than expected
  6. after about 30m, a delayed flux run or flux start is issued. this occurs between E1 and E2
  7. instead of a timeout exception, the job submission fails with match_multi_request_cb: match failed due to match error

I will see if I can reproduce this scenario.

I'm also not sure why the duration-validator isn't rejecting the job from step 6 if this is indeed the sequence of events.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant