diff --git a/docs/actions-pipelines.md b/docs/actions-pipelines.md index c629a5b90..15a13abdf 100644 --- a/docs/actions-pipelines.md +++ b/docs/actions-pipelines.md @@ -69,8 +69,8 @@ In general, actions are composed as follows: * The `python`, `r`, and `stata-mp` commands provide a locked-down execution environment that can take one or more `inputs` which are passed to the code. * Each action must include an `outputs` key with at least one output, classified as either `highly_sensitive` or `moderately_sensitive` * `highly_sensitive` outputs are considered potentially highly-disclosive, and are never intended for publishing outside the secure environment. This includes all data at the pseudonymised patient-level. Outputs labelled highly_sensitive will not be visible to researchers. - * `moderately_sensitive` outputs are considered non-disclosive (providing the appropriate [statistical disclosure controls](releasing-files.md) have been applied) and are automatically copied to the secure review area (otherwise known as [Level 4](security-levels.md)). This includes aggregated patient-data outputs such as summary tables, summary statistics and the outputs from statistical models. For a full list check the [allowed file types subsection](releasing-files.md). * Outputs should be separated onto different lines, each with a unique 'key', but related outputs can be combined using a wildcard (`*`). Note, when using a wildcare, it is extremely important to ensure that no `highly_sensitive` outputs are included. E.g.: + * `moderately_sensitive` outputs **should never include patient-level data**, only data that is considered non-disclosive. This includes aggregated patient-data outputs such as summary tables, summary statistics and the outputs from statistical models. For a full list check the [allowed file types subsection](releasing-files.md). The appropriate [statistical disclosure controls](releasing-files.md) should have been applied to these files. They are copied to the secure review area (otherwise known as [Level 4](security-levels.md)). ```yaml outputs: moderately_sensitive: diff --git a/docs/actions-scripts.md b/docs/actions-scripts.md index 741307768..fdb3035ca 100644 --- a/docs/actions-scripts.md +++ b/docs/actions-scripts.md @@ -27,17 +27,17 @@ This helps with: Scripted actions can read and write output files that are saved in the workspace. These generally fall into two categories: * large pseudonymised patient-level files of `highly_sensitive` data for use by other actions -* smaller `moderately_sensitive` aggregated patient-data (non patient-level data) files for review and release +* smaller `moderately_sensitive` aggregated patient-data (this should **never** be patient-level data) files for review and release ### Large `highly_sensitive` output files -Outputs should be classed as `highly_sensitive` if they are: +Outputs labelled `highly_sensitive` will not be visible to researchers. This is a [deliberate design feature of OpenSAFELY](https://www.opensafely.org/about/), intended to reduce the risk of disclosure of sensitive information. Outputs should **always** be classed as `highly_sensitive` if they are: - Pseudonymised patient-level outputs derived from queries run against Level 1 and 2 data, i.e., a specific study dataset generated by a [study definition](study-def.md) or [dataset_definition](https://docs.opensafely.org/ehrql/). - Pseudonymised patient-level intermediate outputs for a study derived from queries run against Level 3 data which output pseudonymised patient-level data i.e., a processed study dataset with certain filters/formatting applied. -These types of outputs are considered potentially highly-disclosive, and are never intended for publishing outside the secure environment. Outputs labelled highly_sensitive will not be visible to researchers. +These types of outputs are considered potentially highly-disclosive, should not be pushed to Level 4, and are never intended for publishing outside the secure environment. Pseudonymised patient-level outputs tend to be large in size and therefore it is important that the right files formats are used for these large data files. The wrong formats can waste disk space, execution time, and server memory. The specific formats used vary with language ecosystem, but they should always be compressed.