Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document instance IDs and placeholder usage better #645

Merged
merged 1 commit into from
Dec 4, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 8 additions & 4 deletions docs/jobs/failure.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,11 +26,15 @@ Sometimes a worker might crash while it is executing some task. In that case the
different worker and the task will begin executing from the beginning.

In order to let the executed application know that the same task is being executed repeatedly, HyperQueue assigns each
execution a separate **Instance id**. It is a 32b non-negative number that identifies each (re-)execution of a task.
execution a separate **Instance ID**. It is a 32b non-negative number that identifies each (re-)execution of a task.

It is guaranteed that a newer execution of a task will have a larger instance id, however HyperQueue explicitly
**does not** guarantee any specific values or differences between two ids. Each instance id is valid only for a particular
task. Two different tasks may have the same instance id.
It is guaranteed that a newer execution of a task will have a larger instance ID, however HyperQueue explicitly
**does not** guarantee any specific values or differences between two IDs. Each instance ID is valid only for a particular
task. Two different tasks may have the same instance ID.

Instance IDs can be useful e.g. when a task is restarted, and you want to distinguish the output of the first execution
and the restarted execution (by default, HQ will overwrite the standard output/error file of the first execution). You
can instead create a separate stdout/stderr file for each task execution using the [instance ID placeholder](jobs.md#placeholders).

## Task array failures
By default, when a single task of a [task array](arrays.md) fails, the computation of the job will continue.
Expand Down
37 changes: 22 additions & 15 deletions docs/jobs/jobs.md
Original file line number Diff line number Diff line change
Expand Up @@ -110,12 +110,12 @@ $ hq submit --env KEY1=VAL1 --env KEY2=VAL2 ...

Each executed task will also automatically receive the following environment variables:

| Variable name | Explanation |
|------------------|-------------------------------------------------------------------|
| `HQ_JOB_ID` | Job id |
| `HQ_TASK_ID` | Task id |
| `HQ_INSTANCE_ID` | [Instance id](failure.md#task-restart) |
| `HQ_RESOURCE_...`| A set of variables related to allocated [resources](resources.md) |
| Variable name | Explanation |
|-------------------|-------------------------------------------------------------------|
| `HQ_JOB_ID` | Job id |
| `HQ_TASK_ID` | Task id |
| `HQ_INSTANCE_ID` | [Instance id](failure.md#task-restart) |
| `HQ_RESOURCE_...` | A set of variables related to allocated [resources](resources.md) |

### Time management
You can specify two time-related parameters when submitting a job. They will be applied to each task of the submitted job.
Expand Down Expand Up @@ -186,16 +186,23 @@ Placeholders are enclosed in curly braces (`{}`) and prefixed with a percent (`%

You can use the following placeholders:

| Placeholder | Will be replaced by | Available for |
|------------------|-----------------------------------------------|----------------------------------|
| `%{JOB_ID}` | Job ID | `stdout`, `stderr`, `cwd`, `log` |
| `%{TASK_ID}` | Task ID | `stdout`, `stderr`, `cwd` |
| `%{INSTANCE_ID}` | [Instance ID](failure.md#task-restart) | `stdout`, `stderr`, `cwd` |
| `%{SUBMIT_DIR}` | Directory from which the job was submitted. | `stdout`, `stderr`, `cwd`, `log` |
| `%{CWD}` | Working directory of the task. | `stdout`, `stderr` |
| `%{SERVER_UID}` | Server unique ID (a string of length 6)[^uid] | `stdout`, `stderr`, `cwd`, `log` |
| Placeholder | Will be replaced by | Available for |
|------------------|---------------------------------------------|----------------------------------|
| `%{JOB_ID}` | Job ID | `stdout`, `stderr`, `cwd`, `log` |
| `%{TASK_ID}` | Task ID | `stdout`, `stderr`, `cwd` |
| `%{INSTANCE_ID}` | [Instance ID](failure.md#task-restart) | `stdout`, `stderr`, `cwd` |
| `%{SUBMIT_DIR}` | Directory from which the job was submitted. | `stdout`, `stderr`, `cwd`, `log` |
| `%{CWD}` | Working directory of the task. | `stdout`, `stderr` |
| `%{SERVER_UID}` | Unique server ID. | `stdout`, `stderr`, `cwd`, `log` |

[^uid] Server generates a random `SERVER_UID` string every time a new server is started (`hq server start`).
`SERVER_UID` is a random string that is unique for each new server execution (each `hq server start` gets a separate value).

As an example, if you wanted to include the [Instance ID](failure.md#task-restart) in the `stdout` path (to
distinguish the individual outputs of restarted tasks), you can use placeholders like this:

```bash
$ hq submit --stdout '%{CWD}/job-%{JOB_ID}/%{TASK_ID}-%{INSTANCE_ID}.stdout' ...
```

## State
At any moment in time, each task and job has a specific *state* that represents what is currently happening to it. You
Expand Down
Loading