-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
double booking observed on cray system #1170
Comments
Information pulled for later debugging, partially redacted, contact me for an original.
The context for the two exceptions is important:
Note that the cancel is sent 343ms before the start of foy5qJFL9Qs. Somehow we're getting the cancel and releasing the resources before the epilog-finish and free events. |
That is odd. I checked earlier and it does not appear that fluxion subscribes to the |
Are the timestamps milliseconds? Looking at some of the raw logs they seem to be seconds. Comparing with other logs the cancel appears to have occurred around 5 minutes before the second job started. Does the log from exception 2 mean it was a user-initiated cancel or is it coming from job manager? I ask because there may be a path around the typical flux-sched/resource/modules/resource_match.cpp Line 2023 in c8e03f8
Which is a callback corresponding to the |
Is it possible we're getting the cancel event through qmanager from schedutil sched.cancel somehow? That code path does free up the resources, so it would actually produce this result. I was just digging through the RFC and looked at sched-simple, and it looks like it's a bit ill-defined what cancel is supposed to do when the job is already allocated. The RFC says something like "some of these stages may not be necessary and the job enters cleanup", fluxion clears the job out immediately, sched-simple removes it from the queue if it's waiting for resources and otherwise ignores the event entirely. It was logged as a user-initiated cancel, but there might be some other way to generate that I don't know about. That said I haven't found the path by which that would get invoked, it ties back quite far, but I haven't found the last link to either the job-exception event or a watch on the event log. |
Ah, and you're right, they're seconds. Clearly need some sleep, looking back in on this first thing in the morning after I literally sleep on it. |
The cancel request to the scheduler is supposed to cancel the In this case, the |
I was wondering if somehow a user-initiated cancel is bypassing qmanager and directly invoking
The cancel request to the scheduler might be cancelling the allocation. Examining the complex control flow starting with the flux-sched/qmanager/modules/qmanager.cpp Line 511 in c8e03f8
The control flow enters Assuming the job was in From there it will eventually be popped in the
immediately before it's canceled:
EASY queue policy inherits from I hope that combination of markdown and URLs isn't too confusing and that I haven't misunderstood the |
That's the correct origin in qmanager. However, in |
Ok, I don't have a full answer, but here's at least some further tests and analysis. I started with this flow: flux submit --wait-event epilog-start -N 1 -c 4 -n 1 /non-existant/path > id
flux job raise $(cat id)
flux submit --wait -N 1 --exclusive -n 1 hostname > id2
J1_EXCEPTION=$(flux job info $(cat id) eventlog | grep exception | tail -n 1 | jq -c '.timestamp')
echo JOB 1
flux job info $(cat id) eventlog | jq -c ".timestamp -= $J1_EXCEPTION"
echo JOB 2
flux job info $(cat id2) eventlog | jq -c ".timestamp -= $J1_EXCEPTION" It produces the same flow of events that we see in the eventlog, except that it does not overlap. Clearly it's not that simple. If I add an explicit I'm trying to come up with how to actually replicate the cancel, I'm getting the impression that this was a |
I don't think this was a
i.e. the |
Yup, note's empty, so it must have been a raise. Getting progressively less sure how we got to this place unless we ended up with a race condition. I can go in and make it so the cancel RPC doesn't clear resource allocations easily enough, but if we can't repro I'm not sure that's what it is. |
I've been banging on this every way I can think of, and so far can't reproduce the scenario without a resource.cancel RPC coming from somewhere. The best I have right now, is that the interface between qmanager and resource doesn't differentiate between It should be relatively straightforward to split this so a cancel RPC has no code path to the cancel sched RPC. Pending a way to explicitly reproduce it, that's the most useful thing I can think of to do. It will make sure that the cancel RPC on qmanager only ever cancels a pending job at least. |
Problem: two jobs were allocated the same resources.
The jobs in question are foxZW969KtT and foy5qJFL9Qs, both of which got allocations on cray8292 (rank 8048). Thanks @adambertsch for catching this. This was actually detected when
flux resource list
threw a duplicate allocation error (because the job manager now builds the allocated resource set from all the R fragments assigned to running jobs and that failed).You can use
flux job info <jobid> R | jq
to view the resources assigned to each. On rank 8048, the cores 80-95 were assigned to foxZW969KtT and 0-95 were assigned to foy5qJFL9Qs.Here is the scheduler config (excerpted from
flux config get | jq
)The text was updated successfully, but these errors were encountered: