-
Notifications
You must be signed in to change notification settings - Fork 792
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dv, dvsim] Add job queue/lic wait timeout in addition to run timeout #24817
Conversation
3be4254
to
b589300
Compare
Assigning to @luismarques to forward assign. @rswarbrick is away. |
9ee2375
to
0418975
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple more small comments, but this makes sense to me.
Because GitHub "helpfully" collapses long lists of review comments, it's worth mentioning that I've got a note about the gui term on line 171.
0418975
to
a1a7f74
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some really minor comments, but this looks good to me.
The timeout_mins specified in the test list json file or on dvsim command line is intended to be a timeout on the wall time elapsed after the job has started running.It is not intended to count in the time spent in waiting for a compute node and/or a license to become available. Despite best efforts to specify license resource strings, the availability of a compute node doesn't always guarantee the availability of a license as this scheme only works when all jobs from all projects running on the cluster adhere to this license requesting scheme which sadly can't be enforced in a fool proof manner. In the proposed scheme, the file size as a proxy to determine if the job has really started running. While this is not one hundred percent accurate, it prevents false timeouts on jobs/tests that haven't even had a chance to start running because of unavailability of compute or license resources. The file size threshold to determine whether the job has started running has been chosen after carefully analyzing log files of jobs executed successfuly and jobs that were killed while waiting Signed-off-by: Venkat Krishnan <[email protected]>
a1a7f74
to
dd71989
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really good, thanks!
The timeout_mins specified in the test list json file or on dvsim command
line is intended to be a timeout on the wall time elapsed after the job
has started running.It is not intended to count in the time spent in waiting for a compute node and/or a license to become available. Despite best efforts to specify license resource strings, the availability of a compute node doesn't always guarantee the availability of a license as this scheme only works when all jobs from all projects running on the cluster adhere to this license requesting scheme which sadly can't be enforced in a fool proof manner.
In the proposed scheme, the file size as a proxy to determine if the job has really started running. While this is not one hundred percent accurate, it prevents false timeouts on jobs/tests that haven't even had a chance to start running because of unavailability of compute or license
resources. The file size threshold to determine whether the job has
started running has been chosen after carefully analyzing log files of jobs executed successfuly and jobs that were killed while waiting