Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New "resubmit" mechanism #649

Merged
merged 6 commits into from
Dec 12, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions docs/jobs/arrays.md
Original file line number Diff line number Diff line change
Expand Up @@ -100,3 +100,19 @@ If `items.json` contained this content:
```
then HyperQueue would create two tasks, one with `HQ_ENTRY` set to `{"batch_size": 4, "learning_rate": 0.01}`
and the other with `HQ_ENTRY` set to `{"batch_size": 8, "learning_rate": 0.001}`.

### Combining with `--each-line`/`--from-json` with `--array`

Option `--each-line` or `--from-json` can be combined with option `--array`.
In such case, only a subset of lines/json will be submited.
spirali marked this conversation as resolved.
Show resolved Hide resolved
If `--array` defines an ID that exceeds the number of lines in the file (or the number of elements in JSON), then the ID is silently removed.


For example:

```commandline
$ hq submit --each-line input.txt --array "2, 8-10"
```

If `input.txt` has sufficiently many lines then it will create array job with four tasks. One for 3rd line of file and three tasks for 9th-11th line
(note that first line has id 0). It analogously works for `--from-json`.
43 changes: 28 additions & 15 deletions docs/jobs/failure.md
Original file line number Diff line number Diff line change
@@ -1,29 +1,42 @@
In distributed systems, failure is inevitable. This sections describes how HyperQueue handles various types of failures
and how can you affect its behavior.

## Resubmitting jobs
When a job fails or is canceled, you might want to submit it again, without the need to pass all the original
parameters. You can achieve this using **resubmit**:
## Resubmitting array jobs
When a job fails or is canceled, you can submit it again.
However, in case of [task arrays](arrays.md), different tasks may end in different states, and often we want to
recompute only tasks with a specific status (e.g. failed tasks).

```bash
$ hq job resubmit <job-id>
By following combination of commands you may recompute only failed tasks. Let us assume that we want to recompute
all failed jobs in job 5:
spirali marked this conversation as resolved.
Show resolved Hide resolved

```commandline
$ hq submit --array=`hq job task-ids 5 --filter=failed` ./my-computation
```
It works as follows: Command `hq job task-ids 5 --filter=failed` returns IDs of failed jobs of job `5`, and we set
it to `--array` parameter that starts only tasks for given IDs.

If we want to recompute all failed tasks and all canceled tasks we can do it as follows:

```commandline
$ hq submit --array=`hq job task-ids 5 --filter=failed,canceled` ./my-computation
```

It wil create a new job that has the same configuration as the job with the entered job id.
Note that it also works with `--each-line` or `--from-json`, i.e.:

```commandline
# Original computation
$ hq submit --each-line=input.txt ./my-computation

This is especially useful for [task arrays](arrays.md). By default, `resubmit` will submit all tasks of the original job;
however, you can specify only a subset of tasks based on their [state](jobs.md#task-state):

```bash
$ hq job resubmit <job-id> --status=failed,canceled
# Resubmitting failed jobs
$ hq submit --each-line=input.txt --array=`hq job task-ids last --filter=failed` ./my-computation
```

Using this command you can resubmit e.g. only the tasks that have failed, without the need to recompute all tasks of
a large task array.

## Task restart
Sometimes a worker might crash while it is executing some task. In that case the server will reschedule that task to a
different worker and the task will begin executing from the beginning.

Sometimes a worker might crash while it is executing some task. In that case the server will automatically
reschedule that task to a different worker and the task will begin executing from the beginning.

In order to let the executed application know that the same task is being executed repeatedly, HyperQueue assigns each
execution a separate **Instance ID**. It is a 32b non-negative number that identifies each (re-)execution of a task.
Expand All @@ -43,7 +56,7 @@ You can change this behavior with the `--max-fails=<X>` option of the `submit` c
If specified, once more tasks than `X` tasks fail, the rest of the job's tasks that were not completed yet will be canceled.

For example:
```bash
```commandline
$ hq submit --array 1-1000 --max-fails 5 ...
```
This will create a task array with `1000` tasks. Once `5` or more tasks fail, the remaining uncompleted tasks of the job
Expand Down