Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update documentation to make guidance around data clearer #1351

Merged
8 changes: 4 additions & 4 deletions docs/actions-pipelines.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,9 +68,9 @@ In general, actions are composed as follows:
* The `cohortextractor` command has the same options as described in the [cohortextractor section](actions-cohortextractor.md).
* The `python`, `r`, and `stata-mp` commands provide a locked-down execution environment that can take one or more `inputs` which are passed to the code.
* Each action must include an `outputs` key with at least one output, classified as either `highly_sensitive` or `moderately_sensitive`
* `highly_sensitive` outputs are considered potentially highly-disclosive, and are never intended for publishing outside the secure environment
* `moderately_sensitive` outputs are automatically copied to the secure review area for redaction (otherwise known as [Level 4](security-levels.md)) and potentially for publication back to GitHub.
* Outputs should be separated onto different lines, each with a unique 'key', but related outputs can be combined using a wildcard (`*`). E.g.:
* `highly_sensitive` outputs are considered potentially highly-disclosive, and are never intended for publishing outside the secure environment. This includes all data at the pseudonymised patient-level. Outputs labelled highly_sensitive will not be visible to researchers.
* `moderately_sensitive` outputs are considered non-disclosive (providing the appropriate [statistical disclosure controls](releasing-files.md) have been applied) and are automatically copied to the secure review area (otherwise known as [Level 4](security-levels.md)). This includes aggregated patient-data outputs such as summary tables, summary statistics and the outputs from statistical models. For a full list check the [allowed file types subsection](releasing-files.md).
* Outputs should be separated onto different lines, each with a unique 'key', but related outputs can be combined using a wildcard (`*`). Note, when using a wildcare, it is extremely important to ensure that no `highly_sensitive` outputs are included. E.g.:
```yaml
outputs:
moderately_sensitive:
Expand Down Expand Up @@ -174,7 +174,7 @@ After your project has been executed via the [jobs site](jobs-site.md), its outp

Users with permission to access Level 4 can view output files that are labelled as _moderately sensitive_; they can also view automatically created log files of the run for debugging purposes.

For security reasons, they will be in a different directory than if you had run locally. For the TPP backend, outputs labelled `moderately_sensitive` in the `project.yaml` will be saved in `D:/Level4Files/workspaces/<NAME_OF_YOUR_WORKSPACE>`. These outputs can be [reviewed on the server](releasing-files.md) and released via GitHub if they are deemed non-disclosive.
For security reasons, they will be in a different directory than if you had run locally. For the TPP backend, outputs labelled `moderately_sensitive` in the `project.yaml` will be saved in `D:/Level4Files/workspaces/<NAME_OF_YOUR_WORKSPACE>`. These outputs can be [reviewed on the server](releasing-files.md) and released if they are deemed non-disclosive.

Outputs labelled `highly_sensitive` are not visible.

Expand Down
37 changes: 24 additions & 13 deletions docs/actions-scripts.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,13 +26,20 @@ This helps with:
## Reading and Writing Outputs

Scripted actions can read and write output files that are saved in the workspace. These generally fall into two categories:
* large files of `highly_sensitive` data for use by other actions
* smaller `moderately_sensitive` outputs for review and release
* large pseudonymised patient-level files of `highly_sensitive` data for use by other actions
* smaller `moderately_sensitive` aggregated patient-data (non patient-level data) files for review and release


### Large `highly_sensitive` output files

It is important that the right files formats are used for large data files. The wrong formats can waste disk space, execution time, and server memory. The specific formats used vary with language ecosystem, but they should always be compressed.
Outputs should be classed as `highly_sensitive` if they are:

- Pseudonymised patient-level outputs derived from queries run against Level 1 and 2 data, i.e., a specific study dataset generated by a [study definition](study-def.md) or [dataset_definition](https://docs.opensafely.org/ehrql/).
- Pseudonymised patient-level intermediate outputs for a study derived from queries run against Level 3 data which output pseudonymised patient-level data i.e., a processed study dataset with certain filters/formatting applied.

These types of outputs are considered potentially highly-disclosive, and are never intended for publishing outside the secure environment. Outputs labelled highly_sensitive will not be visible to researchers.

Pseudonymised patient-level outputs tend to be large in size and therefore it is important that the right files formats are used for these large data files. The wrong formats can waste disk space, execution time, and server memory. The specific formats used vary with language ecosystem, but they should always be compressed.

!!! note
The template sets up `cohortextractor` command to produce `csv.gz` outputs.
Expand Down Expand Up @@ -89,16 +96,16 @@ It is important that the right files formats are used for large data files. The

### Smaller `moderately_sensitive` output files

These outputs are marked as `moderately_sensitive` in your `project.yaml`, and are available to view with [Level 4 access](level-4-server.md). Outputs can be:
* aggregate summary data
* images
* log files for debugging action code
Files that are labelled `moderately_sensitive` should only ever be aggregated data such as summary tables, images, and the outputs from statistical models. These files and will be available to view with [Level 4 access](level-4-server.md). These (and the corresponding automatically created log files of each action/script) will be the only output files that users will have access to; users do not have unfettered access to any patient-level data and only see aggregated outputs derived from their analysis code, which satisfies the GDPR principle of confidentiality.

Due to the fact that Level 4 files need to be reviewed, there are various restrictions placed on sizes and formats of files that can be released
#### File type restrictions for `moderately_sensitive` outputs
There are restrictions on the type of file that are transferred to Level 4. This is to reduce the risk of making pseudonymised patient-level data available for researchers to view.

#### File format restrictions
If a file labelled as `moderately_sensitive` does not meet the below allowed file types, it will be replaced on Level 4 with a `.txt` file with the same filename, which explains why the file was not allowed on Level 4.

These are restricted so that reviewers can properly examine the outputs on the secure server.
**File format**

These are restricted to types of file that are likely to contain summary data, rather than patient-level data, and so reviewers can properly examine the outputs on the secure server.

| Type | Formats |
| --- | --- |
Expand All @@ -107,13 +114,17 @@ These are restricted so that reviewers can properly examine the outputs on the s
| Images | `.png`, `.jpeg`, `.svgz` |
| Reports | `.html`, `.pdf` |

#### File size restrictions
**File size**

There is a maximum file size of 32 MB to:
There is a maximum file size of 16 MB to:

* limit the amount of data that can be accessed via Level 4
* prevent large patient-level data files being accessed via Level 4
* allow a thorough review of the outputs in a reasonable time

**Files with `patient_id` in the header**

Any CSV file with a `patient_id` header will not be moved to Level 4
Copy link
Contributor

@bloodearnest bloodearnest Nov 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"not be made available in level 4"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, done


## Execution environments

OpenSAFELY currently supports Stata, Python, and R for statistical analysis.
Expand Down
2 changes: 1 addition & 1 deletion docs/jobs-site.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ graph TD

Once outputs have been produced by running _jobs_ from within a _Workspace_, there are several stages they must go through before being made publicly available:

1. **Outputs on the [Level 4 server](level-4-server.md)**. These are outputs marked as `moderately_sensitive` in the `project.yaml` file and are only viewable when logged into the Level 4 server. These outputs have to be [reviewed by our output checking team](releasing-files.md#3-how-are-files-reviewed) before they can leave the server.
1. **Outputs on the [Level 4 server](level-4-server.md)**. These are aggregated patient-data (non patient-level data) outputs marked as `moderately_sensitive` in the `project.yaml` file and are only viewable when logged into the Level 4 server. These outputs have to be [reviewed by our output checking team](releasing-files.md#3-how-are-files-reviewed) before they can leave the server.
2. **Released outputs**. These are analysis outputs that have been reviewed for any [disclosivity issues](releasing-files.md#types-of-disclosure) and released from the Level 4 server by the output checking team to the relevant _Workspace_ on the Jobs site. These are only viewable if you have the correct permissions for the _Project_ the _Workspace_ belongs to.
3. **Draft public outputs**. Released outputs can only be shared with close collaborators of your projects ([refer to the examples of who this could include](https://www.opensafely.org/policies-for-researchers/#all-datasets-sharing)). To be shared more widely, they have to first be approved by NHS England. Once approved, and if you have the correct jobs site permissions, you can create draft public outputs for approval.
4. **Published outputs**. Once approved, draft public outputs are made publicly available to view by anyone through the _Workspace_ they belong to.
Expand Down
11 changes: 7 additions & 4 deletions docs/security-levels.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,9 +50,9 @@ Data is held within the EHR vendor's secure environment on the OpenSAFELY server
Data processor staff working at the EHR vendor and a small and restricted number of OpenSAFELY platform developers. Similar to level 1 above, researchers can query this pseudonymised external data, but only indirectly: they write their study analysis code away from the source data, in GitHub, and the OpenSAFELY service automates the execution of the study code against the external data. Only the aggregated results of their study are made available back to the researchers in Level 4 (see below).

## Level 3 [NHS England are data controllers of the data]
This level includes the pseudonymised intermediate outputs: the specific study dataset derived from queries run against Level 1 and 2 data, i.e. anything that is generated by a [study definition](study-def.md).
At this level data is typically stored as a pseudonymised patient-level (rather than event level) extract. It includes all pseudonymised patient-level outputs derived from queries run against Level 1 and 2 data, i.e., a specific study dataset generated by a [study definition](study-def.md) or [dataset_definition](https://docs.opensafely.org/ehrql/).. It also includes all of the pseudonymised patient-level intermediate outputs for a study derived from queries run against Level 3 data which output pseudonymised patient-level data i.e., a processed study dataset where certain filters/formatting have been applied.

The level 3 data is typically stored as a pseudonymised patient level (rather than event level) extract.
As the data stored at this level is still patient-level, access to this level is restricted to a small number of OpenSAFELY staff to allow data quality assessment and debugging problems.

### Where is this data held?
Data is held within the EHR vendor's secure environment on the OpenSAFELY server (same as level 2).
Expand All @@ -61,8 +61,11 @@ Data is held within the EHR vendor's secure environment on the OpenSAFELY server
This is the same as Level 2.

## Level 4 [NHS England are data controllers of the data]
This level includes tables, figures, and other structured files produced as a result of the analysis of the Level 3 data, for example summary statistics and statistical models.
Following strict disclosivity checks and redactions, and dual review by trained output-checkers, files can be released out of the server for further processing and public consumption.
This level includes aggregated patient-data (non patient-level data) derived from queries run against Level 3 data, such as summary tables, summary statistics and the outputs from statistical models. It also includes the automatically created log files of each action/script corresponding to each file, for debugging purposes.

This is the only level that OpenSAFELY users have access to in order to view their aggregated data/results/log files; users do not have unfettered access to any patient-level data and only see aggregated outputs derived from their analysis code, which satisfies the GDPR principle of confidentiality. Researchers are able to use this level to check that the appropriate statistical disclosure controls have been applied to any files intended for release out of the server.

Access to this level is secured via VPN access to a remote desktop. No files are released from the secure environment without undergoing dual independent checking by trained output-checkers for disclosure issues (see the [Safe Outputs section](releasing-files.md))

### Where is this data held?
Data is held within the EHR vendor's secure environment on a specific server, separate from the Level 2 and 3 server.
Expand Down