Skip to content

Commit

Permalink
Merge pull request #1459 from alan-turing-institute/data_integrity
Browse files Browse the repository at this point in the history
Add data preparation guidance (including data integrity)
  • Loading branch information
JimMadge authored May 15, 2023
2 parents ebdac52 + 06f7d8d commit 208e223
Show file tree
Hide file tree
Showing 2 changed files with 74 additions and 1 deletion.
73 changes: 73 additions & 0 deletions docs/processes/data_ingress.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,79 @@
The Data Safe Haven has various technical controls to ensure data security.
However, the processes and contractual agreements that the **Dataset Provider** agrees to are equally important.

## Preparing data

This section has some recommendations for preparing input data for the Data Safe Haven.

### Avoid archives

The input data is presented to researchers on a read-only filesystem.
This means that researchers will be unable to extract inputs in-place.
Instead, they would have to extract to a read-write space within the environment.
This could unnecessarily duplicate the data and leads to a greater risk of loss of integrity as the inputs can be modified (intentionally or accidentally).

### Avoiding name clashes

In the recommended upload process there is no protection for overwriting files.
It is therefore important to avoid uploading files with the same pathname as the later files will replace existing files.

To help avoid name clashes, if you are uploading multiple data sets you should use unique names for each data set.
For example, if the data sets are single files, use unique file names.
If data sets consist of multiple files, collect them in uniquely named directories.

If there are multiple data providers uploading data for a single work package, each provider should use a uniquely named directory, or prepend their files with a unique name.

### Describe the data

Explaining the structure and format of the data will help researchers be most effective.
It is a good idea to upload a plain text file explaining the directory structure, file format, data columns, meaning of special terms, _etc._.
This file will be easy for researchers to read using tools inside the environment and they will be able to find it alongside the data.

### Data integrity

You will want to ensure that researchers have the correct data and that they can verify this.
We recommend using [checksums](https://www.redhat.com/sysadmin/hashing-checksums) to do this.

A checksum is a short string computed in a one-way process from some data.
A small change in the data (even a single bit) will result in a different checksum.
We can therefore use checksums to verify that data has not been changed.
In the safe haven this is useful for verifying that the data inside the environment is complete and correct.
It proves the data has not been modified or corrupted during transfer.

We recommended considering the hashing algorithms `md5sum` and `sha256`.
They are common algorithms built into many operating systems, and included in the Data Safe Haven.
`md5sum` is fast and sufficient for integrity checks.
`sha256` is slower but more secure, it better protects against malicious modification.

You can generate a checksum file, which can be used to verify the integrity of files.
If you upload this file then researchers will be able to independently verify data integrity within the environment.

Here are instructions to generate a checksum file using the `md5sum` algorithm for a data set stored in a directory called `data`.

```console
find ./data/ -type fl -exec md5sum {} + > hashes.txt
```

`find` searches the `data` directory for files and symbolic links (`-type fl`).
`find` also runs the checksum command `md5sum` on all matching files (`-exec md5sum {} +`).
Finally, the checksums are written to a file called `hashes.txt` (`> hashes.txt`).

The data can be checked, by comparing to the checksums.

```console
md5sum -c hashes.txt
```

If a file has changed the command will return a non-zero exit code (an error).
The failing files will be listed as `<filename>: FAILED` in the output.
Those files can be easily identified using `grep`

```console
md5sum -c hashes.txt | grep FAILED
```

To use the `sha256` algorithm, replace `md5sum` with `sha256` in the above commands.

## Bringing data into the environment

```{attention}
Expand Down
2 changes: 1 addition & 1 deletion docs/roles/investigator/data_ingress.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
(role_investigator_egress)=
(role_investigator_ingress)=

# Data ingress process

Expand Down

0 comments on commit 208e223

Please sign in to comment.