Skip to content

Commit

Permalink
Add documentation of the tag system (#326)
Browse files Browse the repository at this point in the history
* start tag docs

* add tags documentation
  • Loading branch information
mattrasmus authored Oct 18, 2023
1 parent 7883038 commit bcc3370
Show file tree
Hide file tree
Showing 3 changed files with 128 additions and 0 deletions.
1 change: 1 addition & 0 deletions docs/source/_static/redun-tags.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions docs/source/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,7 @@ executors
scheduler
config
db
tags
spark
typing
console
Expand Down
126 changes: 126 additions & 0 deletions docs/source/tags.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
---
tocpdeth: 3
---

# Tags

redun provides the concept of **tags** for annotating workflow [data provenance](design.md#provenance-and-call-graphs) with additional metadata. Tags are represented as key-values pairs, where the value can be any JSON datatype (int, float, string, boolean, None, list, dict). By default, redun automatically adds tags to Executions and Jobs to record metadata such as, redun version, username, git commit, AWS Batch Job id, etc. In the example below, we have a redun Execution with tags like `redun.version=0.16.0` and `user=rasmus`. Some tags like `git_commit` can be used to automatically construct links to GitHub.

<img src="_static/console-execution.svg" width="100%"/>

Tags are helpful for querying data provenance. For example, using the `redun log` command, we can query for all Executions run by user `rasmus`:

```sh
redun log --tag user=rasmus

Recent executions:

ion=0.16.1, user=rasmus, [email protected])
Exec 6dc981d7 [ DONE ] 2023-08-11 14:44:09: run hello_world.py main (git_commit=b3faf931667ffd18f797334eb1feef9b464ea600, [email protected]:insitro/redun.git, project=redun.examples.hello_world, redun.version=0.16.1, user=rasmus, use
[email protected])
Exec 86b72927 [ DONE ] 2023-06-15 09:27:47: run hello_world.py main --greet Hi4 (git_commit=7ec9cb747aac9bb5bfbe6b518c9eab71296ffcc2, [email protected]:insitro/redun.git, project=redun.examples.hello_world, redun.version=0.15.0, user
=rasmus)
Exec 022e2cb6 [ DONE ] 2023-06-15 09:26:12: run hello_world.py main --greet Hi3 (git_commit=7ec9cb747aac9bb5bfbe6b518c9eab71296ffcc2, [email protected]:insitro/redun.git, project=redun.examples.hello_world, redun.version=0.15.0, user
=rasmus)
Exec 525eb514 [ DONE ] 2023-06-15 09:25:49: run hello_world.py main --greet Hi2 (git_commit=7ec9cb747aac9bb5bfbe6b518c9eab71296ffcc2, [email protected]:insitro/redun.git, project=redun.examples.hello_world, redun.version=0.15.0, user
=rasmus)
```

## Tag structure

In general, users can add their own tags and attach them to any part of the recorded CallGraph (e.g. Execution, Job, Task, Value) in order to record helpful metadata and improve overall the data provenance.

<img src="_static/redun-tags.svg" width="100%"/>

In the example above, we have an illustration of the [CallGraph data structure](design.md#provenance-and-call-graphs) redun records while running a workflow. Briefly, when a user starts a new workflow execution, it is recorded with an Execution node (top right) along with a tree of Job nodes (right) for each task that is run during the workflow. For each Job, we record a CallNode (blue boxes) that connects together additional information such as the related Task, argument Values, and result Value.

As seen above, Tag nodes (green boxes) attach to a variety of node types. This allows users to record metadata specific to a particular Task, Value, Job, or Execution. Each tag has the following three properties:

- **Entity:** the id of the node the tag describes.
- **Key:** a string describing what kind of information the tag describes (e.g. `user`, `git_commit`, `aws_batch_job`)
- **Value:** a JSON datatype describing the tag value.

The key-value approach is similar to other tagging designs such as [AWS tags](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/Using_Tags.html), and the three part structure deliberately follows the [Entity, Attribute, Value](https://en.wikipedia.org/wiki/Entity%E2%80%93attribute%E2%80%93value_model) design in order to provide significant flexiblity in modeling metadata.

## Adding tags

### Adding execution tags

To add custom tags to an execution, add the `--tag` option when using `redun run`. For example, to add tags describing the kind of project and dataset being used in the workflow, we could use:

```sh
redun run --tag project=my-analysis --tag dataset=my-dataset workflow.py main
```

Tags are described in a `{key}={value}` format. For convenience, string values don't need to be quoted. Other JSON datatypes are represented as usual (but remembed to escape from shell interpretation):

```sh
redun run --tag cost=3.20 --tag final=true --tag flags '["a","b","c"]' workflow.py main
```

You may also use multiple tags of the same key. Here, we record multiple `source` tags:

```sh
redun run --tag source=my-data-source-1 --tag source=my-data-source-2 workflow.py main
```

If you are programmatically running a workflow, Execution tags can be added using the [`Scheduler.run()` method](redun.scheduler.Scheduler.run) and passing tags as a list of key-value tuples:

```py
scheduler = Scheduler(config=...)
result = scheduler.run(main(), tags=[("cost", 3.2), ("final", True)])
```

### Adding task tags

Tags can be added to Tasks using the `tags` task option in the `@task` decorator:

```py
@task(tags=[("version", "prod"), ("team", "my-team")])
def my_task(x: int, y: int) -> int:
# ...
```

Jobs will also automatically inherit the tags of their Tasks.

### Adding value tags

To add tags to a Value, pass the value through [`apply_tags()`](redun.scheduler.apply_tags) which returns the value back unchanged, but adds the provided tags during data provenance recording. For example, consider the following code:

```py
@task
def add(a: int, b: int) -> int:
c = a + b
return apply_tags(c, tags=[("version", "dev")])
```

Here, the task `add()` adds two numbers as usual, but will tag the result Value `c` with the tag `version=dev`.

### Adding job tags

First, Jobs automatically inherit any tags added to their corresponding Tasks. To add additional tags at runtime, you can also add them using the [`apply_tags()`](redun.scheduler.apply_tags) task.

```py
@task
def add(a: int, b: int) -> int:
cost = compute_cost(...)
c = a + b
return apply_tags(c, job_tags=[("cost", cost)])
```

Execution tags can also be added at runtime using `apply_tags(value, execution_tags=[(key, value), ...])`.

## Modifying tags after execution

Tags can also be added, removed, and updated after a workflow execution using the `redun tag` command. See `redun tag --help` for full details:

```sh
# Add a new tag `key=value` on record with id prefix `abcd1234`.
redun tag add abcd1234 key=value

# Remove the tag `key=value` from record with id prefix `abcd1234`.
redun tag rm abcd1234 key=value

# Update tag with key `key` on record with id prefix `abcd1234` with a new value `value`.
redun tag update abcd1234 key=value
```

0 comments on commit bcc3370

Please sign in to comment.