Skip to content

Commit

Permalink
Merge pull request #1383 from opensafely/replace-cohort-extractor-wit…
Browse files Browse the repository at this point in the history
…h-ehrql

Update Actions section to reference ehrQL
  • Loading branch information
inglesp authored Nov 16, 2023
2 parents 835cf4b + 35de709 commit 76431c7
Show file tree
Hide file tree
Showing 5 changed files with 36 additions and 51 deletions.
4 changes: 2 additions & 2 deletions docs/actions-intro.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@ Analytic code can be divided up into logical units. You might have a script whic

In OpenSAFELY, each logical unit is called an _action_. Actions can be scripts, Jupyter notebook generators, or specialised functions provided by the framework.

An OpenSAFELY project must refer to its actions in a [_pipeline_](actions-pipelines.md). This is a file called `project.yaml` which defines all the actions in a project, how they should be run, and how their outputs should be saved.
An OpenSAFELY project must refer to its actions in a [_pipeline_](actions-pipelines.md). This is a file called `project.yaml` which defines all the actions in a project, how they should be run, and how their outputs should be saved.

* Every pipeline will start with [_cohortextractor_](actions-cohortextractor.md) as its first action, to convert the study definition into an actual analysis-ready dataset based on dummy or real data.
* Every pipeline will start with an [_ehrQL_](/ehrql/) action, to generate an analysis-ready dataset of real or dummy data.
* You can create custom [scripted actions](actions-scripts.md) in a number of other coding languages and choose from (or create your own) [reusable actions](actions-reusable.md).

Dividing your analysis into actions and describing them in a pipeline has a few purposes:
Expand Down
44 changes: 15 additions & 29 deletions docs/actions-pipelines.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ This section covers how to develop, run, and test your code to ensure it will wo

## Project pipelines

The [cohortextractor](actions-cohortextractor.md) section describes how to make an action which generate dummy datasets based on the instructions [defined in your `study_definition.py` script](study-def.md).
The [ehrQL](/ehrql/how-to/dummy-data.md) documentation describes how to make an action which generate dummy datasets based on the instructions defined in your `dataset_definition.py` script.
These dummy datasets are the basis for developing the analysis code that will eventually be passed to the server to run on real datasets.
The code can be written and run on your local machine using whatever development set up you prefer (e.g., developing R in RStudio).
However, it's important to ensure that this code will run successfully in OpenSAFELY's secure environment too, using the specific language and package versions that are installed there. To do this, you should use the project pipeline.
Expand All @@ -30,33 +30,33 @@ A simple example of a `project.yaml` is as follows:
```yaml
version: "3.0"

# Ignore this `expectations` block. It is required but not used, and will be removed in future versions.
expectations:
population_size: 1000

actions:

generate_study_population:
run: cohortextractor:latest generate_cohort --study-definition study_definition --output-format csv.gz
generate_dataset:
run: ehrql:v0 generate-dataset analysis/dataset_definition.py --output output/dataset.csv.gz
outputs:
highly_sensitive:
cohort: output/input.csv.gz
dataset: output/dataset.csv.gz

run_model:
run: stata-mp:latest analysis/model.do
needs: [generate_study_population]
needs: [generate_dataset]
outputs:
moderately_sensitive:
model: models/cox-model.txt
figure: figures/survival-plot.png
```

This example declares the pipeline `version`, the `population_size` for the dummy data, and two actions, `generate_study_population` and `run_model`.
This example declares the pipeline `version`, and two actions: `generate_dataset` and `run_model`.

You only need to change `version` if you want to take advantage of features of newer versions of the pipeline framework.

The `generate_study_population` action will create the highly sensitive `input.csv.gz` dataset.
The `generate_dataset` action will create the highly sensitive `dataset.csv.gz` dataset.
It will be dummy data when run locally, and will be based on real data from the OpenSAFELY database when run in the secure environment.
The `run_model` action will run a Stata script called `model.do` based on the `input.csv.gz` created by the previous action.
The `run_model` action will run a Stata script called `model.do` based on the `dataset.csv.gz` created by the previous action.
It will output two moderately sensitive files `cox-model.txt` and `survival-plot.png`, which can be checked and released if appropriate.


Expand All @@ -65,7 +65,7 @@ In general, actions are composed as follows:

* Each action must be named using a valid YAML key (you won't go wrong with letters, numbers, and underscores) and must be unique.
* Each action must include a `run` key which includes an officially-supported command and a version (which at present is usually just `latest`).
* The `cohortextractor` command has the same options as described in the [cohortextractor section](actions-cohortextractor.md).
* The `ehrql` command has the same options as described in the [ehrQL reference](/ehrql/reference/cli/#generate-dataset).
* The `python`, `r`, and `stata-mp` commands provide a locked-down execution environment that can take one or more `inputs` which are passed to the code.
* Each action must include an `outputs` key with at least one output, classified as either `highly_sensitive` or `moderately_sensitive`
* `highly_sensitive` outputs are considered potentially highly-disclosive, and are never intended for publishing outside the secure environment. This includes all data at the pseudonymised patient-level. Outputs labelled highly_sensitive will not be visible to researchers.
Expand Down Expand Up @@ -93,7 +93,7 @@ When writing and running your pipeline, note that:

* If one or more dependencies of an action have not been run (i.e., their outputs do not exist) then these dependency actions will be run first. If a dependency has changed but has not been run (so the outputs are not up-to-date with the changes), then the dependency actions will not be run, and the dependent actions will be run using the out-of-date outputs.

* The ordering of columns may not be consistent between the dummy data and the TPP/EMIS backend. You should avoid referring to index integer positions and instead use the index / column names. Using index / column names will be more robust to different versions of cohortextractor and will also avoid problems caused by index integer positions changing as columns are added/removed.
* The ordering of columns may not be consistent between the dummy data and the TPP/EMIS backend. You should avoid referring to index integer positions and instead use the index / column names. Using index / column names will be more robust to different versions of ehrQL and will also avoid problems caused by index integer positions changing as columns are added/removed.

## Running your code locally

Expand All @@ -113,10 +113,10 @@ For `opensafely run` to work:
To run the first action in the example above, using dummy data, you can use:

```bash
opensafely run generate_study_population
opensafely run generate_dataset
```

This will generate the `input.csv.gz` file as explained in the [cohortextractor](actions-cohortextractor.md) section.
This will generate the `dataset.csv.gz` file as explained in the [ehrQL](/ehrql/) documentation.

To run the second action you can use:

Expand All @@ -127,7 +127,7 @@ opensafely run run_model
It will create the two files as specified in the `analysis/model.do` script.

To force the dependencies to be run you can use for example `opensafely run run_model --force-run-dependencies`, or `-f` for short.
This will ensure for example that both the `run_model` and `generate_study_population` actions are run, even if `input.csv.gz` already exists.
This will ensure for example that both the `run_model` and `generate_dataset` actions are run, even if `dataset.csv.gz` already exists.

To run all actions, you can use a special `run_all` action which is created for you (no need to define it in your `project.yaml`):

Expand Down Expand Up @@ -182,22 +182,8 @@ Outputs labelled `highly_sensitive` are not visible.

No data should ever be published from the Level 3 server. Access is only for permitted users, for the purpose of debugging problems in the secure environment.

Highly sensitive outputs can be seen in `E:/high_privacy/workspaces/<WORKSPACE_NAME>`. This includes a directory called `metadata`, containing log files for each action e.g. `generate_cohorts.log`, `run_model.log`.
Highly sensitive outputs can be seen in `E:/high_privacy/workspaces/<WORKSPACE_NAME>`. This includes a directory called `metadata`, containing log files for each action e.g. `generate_dataset.log`, `run_model.log`.

Moderately sensitive outputs can be seen in `E:/FILESFORL4/workspaces/<WORKSPACE_NAME>`.


## Running your code manually in the server

This is only possible for people with Level 3 access. You'll want to refer to [instructions for interacting with OpenSAFELY via the secure server](https://github.com/opensafely/server-instructions/blob/master/docs/Server-side%20how-to.md) (in restricted access repo).

The live environment is set up via a wrapper script; instead of `cohortextractor`, you should run `/e/bin/actionrunner.sh`.
For example, to run `run_model` on the Level 3 server, against the `full` database, you'd type:

```bash
/e/bin/actionrunner.sh run full run_model tpp
```



---8<-- 'includes/glossary.md'
16 changes: 8 additions & 8 deletions docs/actions-reusable.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
Like [scripted actions](actions-scripts.md), reusable actions are logical units of analytic code.
However, whereas a scripted action is written to solve a problem for one study and must be copied-and-pasted to solve a similar problem for another study, a reusable action is written to solve a problem for several studies *without copying-and-pasting between them*.
This makes reusable actions ideal for tasks that must be completed by several studies, such as joining cohorts or producing deciles charts.
This makes reusable actions ideal for tasks that must be completed by several studies, such as joining datasets or producing deciles charts.

## Running reusable actions

Expand All @@ -10,21 +10,21 @@ Consider the following extract from a study's *project.yaml*:

```yaml
actions:
generate_my_cohort:
run: cohortextractor:latest generate_cohort --output-format=csv.gz
generate_dataset:
run: ehrql:v0 generate-dataset analysis/dataset_definition.py --output output/dataset.csv.gz
outputs:
highly_sensitive:
cohort: output/input.csv.gz
dataset: output/dataset.csv.gz

run_a_reusable_action:
# We will run version `v1.0.0` of the reusable action called `a_reusable_action`.
# The reusable action accepts an argument; in this case, a path to a file.
run: a_reusable_action:v1.0.0 output/input.csv.gz
run: a_reusable_action:v1.0.0 output/dataset.csv.gz
# The reusable action accepts a configuration option;
# in this case, an output format.
config:
output-format: PNG
needs: [generate_my_cohort]
needs: [generate_dataset]
outputs:
moderately_sensitive:
output: output/output_from_a_reusable_action.png
Expand All @@ -39,8 +39,8 @@ The `config` property, which is optional, describes configuration options.

!!! note
If you're thinking about developing a reusable action, then start by creating a new study within the [`opensafely`](https://github.com/opensafely) organisation that encapsulates the problem.
As a minimum, the study should [extract](actions-cohortextractor.md) and operate on a cohort:
indeed, the code that operates on the cohort *is* the reusable action.
As a minimum, the study should [extract](/ehrql/) and operate on a dataset:
indeed, the code that operates on the dataset *is* the reusable action.

At this point, you should [open an issue](https://github.com/opensafely-actions/.github/issues).
Below, we describe how to convert the study into a reusable action.
Expand Down
22 changes: 11 additions & 11 deletions docs/actions-scripts.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,55 +42,55 @@ These types of outputs are considered potentially highly-disclosive, should not
Pseudonymised patient-level outputs tend to be large in size and therefore it is important that the right files formats are used for these large data files. The wrong formats can waste disk space, execution time, and server memory. The specific formats used vary with language ecosystem, but they should always be compressed.

!!! note
The template sets up `cohortextractor` command to produce `csv.gz` outputs.
The template sets up the `ehrql` command to produce `csv.gz` outputs.
This is the current recommended output format, as CSV files compress well,
and this reduces both storage requirements and improves job execution times
on the backend.

If you need to view the raw CSV data locally, you can unzip with `opensafely unzip input.csv.gz`.
If you need to view the raw CSV data locally, you can unzip with `opensafely unzip dataset.csv.gz`.



=== "Python"

```python
# read compressed CSV output from cohortextractor
pd.read_csv("output/input.csv.gz")
# read compressed CSV output from ehrql
pd.read_csv("output/dataset.csv.gz")

# write compressed feather file
df.to_feather("output/model.feather", compression="zstd")

# read feather file, decompressed automatically
pd.read_feather("output/input.feather")
pd.read_feather("output/dataset.feather")
```

=== "R"

```r
# read compressed CSV output from cohortextractor
df <- readr::read_csv("output/input.csv.gz")
# read compressed CSV output from ehrql
df <- readr::read_csv("output/dataset.csv.gz")

# write a compressed feather file
arrow::write_feather(df, "output/model.feather", compression = "zstd")

# read a feather file, decompressed automatically
df <- arrow::read_feather("output/input.feather")
df <- arrow::read_feather("output/dataset.feather")
```

=== "Stata"

```stata
// stata cannot handle compressed CSV files directly, so unzip first to a plain CSV file
// the unzipped file will be discarded when the action finishes.
!gunzip output/input.csv.gz
!gunzip output/dataset.csv.gz
// now import the uncompressed CSV using delimited
import delimited using output/input.csv
import delimited using output/dataset.csv

// save in compressed dta.gz format
gzsave output/model.dta.gz

// load a compressed .dta.gz file
gzload output/input.dta.gz
gzload output/dataset.dta.gz

```

Expand Down
1 change: 0 additions & 1 deletion mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,6 @@ nav:
- Actions:
- Overview: actions-intro.md
- The project pipeline: actions-pipelines.md
- The cohortextractor action: actions-cohortextractor.md
- Scripted actions: actions-scripts.md
- Reusable actions: actions-reusable.md
- Jobs site: jobs-site.md
Expand Down

0 comments on commit 76431c7

Please sign in to comment.