Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dv, dvsim] Add job queue/lic wait timeout in addition to run timeout #24817

Merged
merged 1 commit into from
Oct 29, 2024

Conversation

venkatk-ot
Copy link
Contributor

The timeout_mins specified in the test list json file or on dvsim command
line is intended to be a timeout on the wall time elapsed after the job
has started running.It is not intended to count in the time spent in waiting for a compute node and/or a license to become available. Despite best efforts to specify license resource strings, the availability of a compute node doesn't always guarantee the availability of a license as this scheme only works when all jobs from all projects running on the cluster adhere to this license requesting scheme which sadly can't be enforced in a fool proof manner.
In the proposed scheme, the file size as a proxy to determine if the job has really started running. While this is not one hundred percent accurate, it prevents false timeouts on jobs/tests that haven't even had a chance to start running because of unavailability of compute or license
resources. The file size threshold to determine whether the job has
started running has been chosen after carefully analyzing log files of jobs executed successfuly and jobs that were killed while waiting

@johngt
Copy link

johngt commented Oct 18, 2024

Assigning to @luismarques to forward assign. @rswarbrick is away.
There other person that may be able to look into this is @hcallahan-lowrisc . Whoever gets to it first and can check on it please remove the other from the reviewers.

util/dvsim/NcLauncher.py Outdated Show resolved Hide resolved
util/dvsim/NcLauncher.py Outdated Show resolved Hide resolved
util/dvsim/NcLauncher.py Show resolved Hide resolved
util/dvsim/NcLauncher.py Outdated Show resolved Hide resolved
util/dvsim/NcLauncher.py Outdated Show resolved Hide resolved
util/dvsim/NcLauncher.py Outdated Show resolved Hide resolved
util/dvsim/NcLauncher.py Outdated Show resolved Hide resolved
util/dvsim/NcLauncher.py Outdated Show resolved Hide resolved
util/dvsim/NcLauncher.py Outdated Show resolved Hide resolved
util/dvsim/NcLauncher.py Outdated Show resolved Hide resolved
@venkatk-ot venkatk-ot force-pushed the nc_updates branch 5 times, most recently from 9ee2375 to 0418975 Compare October 23, 2024 14:05
Copy link
Contributor

@rswarbrick rswarbrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple more small comments, but this makes sense to me.

Because GitHub "helpfully" collapses long lists of review comments, it's worth mentioning that I've got a note about the gui term on line 171.

util/dvsim/NcLauncher.py Outdated Show resolved Hide resolved
util/dvsim/NcLauncher.py Outdated Show resolved Hide resolved
Copy link
Contributor

@rswarbrick rswarbrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some really minor comments, but this looks good to me.

util/dvsim/NcLauncher.py Outdated Show resolved Hide resolved
util/dvsim/NcLauncher.py Outdated Show resolved Hide resolved
util/dvsim/NcLauncher.py Outdated Show resolved Hide resolved
The timeout_mins specified in the test list json file or on dvsim command
 line is intended to be a timeout on the wall time elapsed after the job
has started running.It is not intended to count in the time spent in
waiting for a compute node and/or a license to become available.
Despite best efforts to specify license resource strings, the
availability of a compute node doesn't always guarantee the availability
of a license as this scheme only works when all jobs from all projects
running on the cluster adhere to this license requesting scheme which
sadly can't be enforced in a fool proof manner.
In the proposed scheme, the file size as a proxy to determine if the job
has really started running. While this is not one hundred percent
accurate, it prevents false timeouts on jobs/tests that haven't even had
a chance to start running because of unavailability of compute or license
 resources. The file size threshold to determine whether the job has
started running has been chosen after carefully analyzing log files of
jobs executed successfuly and jobs that were killed while waiting

Signed-off-by: Venkat Krishnan <[email protected]>
Copy link
Contributor

@rswarbrick rswarbrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good, thanks!

@moidx moidx merged commit c721723 into lowRISC:master Oct 29, 2024
41 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants