Skip to content

Commit

Permalink
chore: Merge branch 'develop' into new-feature/2066-file-attribute-ch…
Browse files Browse the repository at this point in the history
…aracter-limit-when-adding-via-admin-pages
  • Loading branch information
luistoptal committed Nov 14, 2024
2 parents 7511249 + adde9c7 commit 1effb4d
Show file tree
Hide file tree
Showing 76 changed files with 2,180 additions and 1,100 deletions.
2 changes: 1 addition & 1 deletion .gitlab-ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ image: docker:$DOCKER_VERSION


include:
- template: Security/SAST.gitlab-ci.yml
- template: Jobs/SAST.gitlab-ci.yml
- template: Jobs/Container-Scanning.gitlab-ci.yml
- local: "ops/pipelines/gigadb-build-jobs.yml"
- local: "ops/pipelines/gigadb-test-jobs.yml"
Expand Down
13 changes: 10 additions & 3 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,20 @@

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).


## Unreleased

- Feat #2066: Extend max length for file attribute values to 150
- Fix #2042: Batch deletion of file attributes and samples
- Feat #701: Code refactoring to separate upload status transitions and notifications to prepare for upload status overhaul
- Security #1867: Update the gitlab static application security testing (SAST) job using the Semgrep-based analyzer

## v4.4.0 - 2024-11-13 - ea1a37cc9 -

- Fix #2066: Max length for attribute value set to 1000 in file admin form
- Feat #1968: Add curators manual for operating tools on bastion server and improve tools usage
- Feat #1750: Switch to guzzle instead of cURL (preliminry work to prepare for DataCite schema upgrade)
- Fix #2042: Batch deletion of file attributes and samples to make deleting files from the admin dashboard faster

## v4.3.9 - 2024-10-28 - 961f7821a -
## v4.3.9 - 2024-10-28 - 961f7821a - 2024-11-06

- Fix #1838: switch datepicker format to yyy-mm-dd
- Feat #1768: Alphabetically sorted dataset author dropdown options in adminDatasetAuthor form
Expand Down
3 changes: 3 additions & 0 deletions data/gigadb_testdata/file_attributes.csv
Original file line number Diff line number Diff line change
@@ -1 +1,4 @@
id,file_id,attribute_id,value,unit_id
10669,453,605,d30b8b3549777953aeec9c82e8ac8265,
10670,454,605,da3aa9c474329f45a5f1053e1e99cc0d,
10671,457,605,35850810fcf14328b9811029b5a0d5b9,
263 changes: 263 additions & 0 deletions docs/curators/CURATOR_TOOLS_BASTION.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,263 @@
# Using customized tools in the production bastion server

## Overview

![Tool Overview](./overview.png 'Overview of tools on bastion server')

New datasets are uploaded into GigaDB using Excel spreadsheets. The bastion server provides a set of command-line tools which implement the above workflow for ingesting Excel spreadsheets and performing post-upload operations.

## 1. datasetUpload

After you have logged into the bastion server (bastion.gigadb.host) using SSH, you can begin the process of Excel spreadsheet ingestion into GigaDB.

Dataset metadata is added into [Excel template file version 19](https://github.com/gigascience/gigadb-website/blob/develop/gigadb/app/tools/excel-spreadsheet-uploader/template/GigaDBUpload-template_v19.xls). This Excel file needs to placed in the `uploadDir` directory:
```
# Your home directory can be referred to using ~
[peterl@ip-10-99-0-88 ~]$ ls ~
uploadDir
```

Excel files can be uploaded into `uploadDir` using `sftp` tool or [Filezilla](https://filezilla-project.org).

> [!TIP]
> For testing purposes, download a test Excel file into `uploadDir` using this command: `curl -L -o "./uploadDir/GigaDBUpload_v18_102498_TRR_202311_02_Cell_Clustering_Spatial_Transcriptomics.xls" "https://drive.google.com/uc?export=download&id=129j3ikdSojNVpvZPnBefoOA2Uz6OusHR"`
The Excel file can then be ingested using the datasetUpload script in `/usr/local/bin`:
```
[peterl@ip-10-99-0-88 ~]$ sudo datasetUpload
Done.
```

If the ingestion process has been successful then you should see the above output. In addition, the Excel file will have disappeared from `uploadDir` folder and there will be two log files:
```
[peterl@ip-10-99-0-88 ~]$ ls uploadDir/
java.log javac.log
```

Looking at the uploadDir/java.log will help confirm upload:
```
[peterl@ip-10-99-0-88 ~]$ tail uploadDir/java.log
Insert false: insert into file_attributes select 674872, 538971, 572, 'MIT', null where not exists ( select null from file_attributes where id = 674872 );
>>>>>>>About to exec sqlTemp...
execution time: 130
**End success: GigaDBUpload_v18_102498_TRR_202311_02_Cell_Clustering_Spatial_Transcriptomics.xls
```

You should also check the corresponding dataset admin page at `https://gigadb.org/adminDataset/update/id/<dataset_id>` which you will be able to find by entering the dataset's DOI, e.g. 102498 into the DOI column header in /adminDataset/admin page.

Also, look at the dataset's samples and files in the relevant dataset samples and dataset files admin pages.

> [!TIP]
> If there is a problem with Excel file ingestion then you will see the following output when running `datasetUpload`:
```
[peterl@ip-10-99-0-88 ~]$ datasetUpload
Spreadsheet cannot be uploaded, please check logs!
Done.
```

> Do as the output message suggests by checking `tail uploadDir/java.log`:
```
publisher test OK? true
contentXXX: Genomics
target: dataset_type
content: Genomics
values: [Genomic, Metagenomic, Epigenomic, Proteomic, Transcriptomic, Metabolomic, Neuoscience, Bioinformatics, Workflow, Software, Imaging, Network-Analysis, Genome-Mapping, ElectroEncephaloGraphy(EEG), Metadata, Metabarcoding, Virtual-Machine, Climate, Ecology, Lipidomic, Phenotyping]
relation test OK? false
email test OK? false
attribute_id test OK? false
author_name test OK? false
project test OK? false
image test OK? false
file_type test OK? false
latest date 2024-3-6
Finished validation OK? false
End error 1: GigaDBUpload_v18_GIGA-D-23-00109-Koref4K.xls
fillTable output: true
validation output: false
[GigaDBUpload_v18_GIGA-D-23-00109-Koref4K.xls]
```

> In the above example error, dataset_type is wrongly spelt as `Genomics` which breaks the ingestion process and therefore needs to be corrected.
## 2. createReadme

> [!IMPORTANT]
> To execute the `createReadme` command, change directory to the dataset's user dropbox directory located at `/share/dropbox/`:
```
[peterl@ip-10-99-0-95 ~]$ cd /share/dropbox/user5
```

From this user dropbox directory, a readme file for the dataset can be created using the `createReadme` script by calling it with a DOI:
```
[peterl@ip-10-99-0-142 user5]$ pwd
/share/dropbox/user5
[peterl@ip-10-99-0-88 user5]$ sudo createReadme --doi 102498
```

A `readme_<doi>.txt` file will appear in `/share/dropbox/user5` directory.
```
[peterl@ip-10-99-0-142 user5]$ ls
DLPFC_69_72_VNS_results.csv E2_VNS_Ground_Truth.csv readme_102498.txt
```

To create the readme file and copy it into Wasabi, extra parameters need to be provided:
```
[peterl@ip-10-99-0-88 user5]$ sudo createReadme --doi 102498 --wasabi --apply --use-live-data
```

The readme file will also have been uploaded into the correct dataset directory in Wasabi live bucket. The file size and MD5 value for the readme file will also be updated in the database.

## 3. calculateChecksumSizes

> [!IMPORTANT]
> To execute the `calculateChecksumSizes` command, change directory to the dataset's user dropbox directory located at `/share/dropbox/`:
```
[peterl@ip-10-99-0-95 ~]$ cd /share/dropbox/user5
```

`$doi.md5` and `$doi.filesizes` provide information used to update dataset files with md5 values and file size in the database. These two files can be generated from the user5 dropbox:
```
# Provide DOI number as a parameter
[peterl@ip-10-99-0-95 user5]$ sudo calculateChecksumSizes 102498
Created 102498.md5
Created 102498.filesizes
```

Check the contents of the two files:
```
[peterl@ip-10-99-0-95 user5]$ more 102498.filesizes
5124 ./readme_102498.txt
301 ./DLPFC_69_72_VNS_results.csv
332 ./E2_VNS_Ground_Truth.csv
[peterl@ip-10-99-0-95 user5]$ more 102498.md5
2b74aa5af1b67e48f0317748cbfdf310 ./readme_102498.txt
dc1feb8af3b8c02b0b615e968b87786d ./DLPFC_69_72_VNS_results.csv
b5a7e0953d1581077c13818153371918 ./E2_VNS_Ground_Truth.csv
```

## 4. filesMetaToDb

> [!IMPORTANT]
> To execute the `filesMetaToDb` command, change directory to the dataset's user dropbox directory located at `/share/dropbox/`:
```
[peterl@ip-10-99-0-95 ~]$ cd /share/dropbox/user5
```

The `fileMetaToDb` script can use `102498.filesizes` and `102498.md5` to update file metadata in the database from the user dropbox folder:
```
[peterl@ip-10-99-0-95 ~]$ sudo filesMetaToDb 102498
Updating md5 checksum values as file attributes for 102498
Number of changes: 3
Updating file sizes for 102498
Number of changes: 3
Updated file metadata for 102498 in database
```

You should check the adminfile pages of the files associated with this dataset to see if MD5 values and file sizes are visible.

## 5. Go to dataset admin page on gigadb.org

With the post upload operations complete, you need to go back to the page at `https://gigadb.org/adminDataset/update/id/<dataset_id>` in order to continue curation work on the dataset. You will be able to find this link by entering the dataset's DOI, e.g. 102498 into the DOI column header in /adminDataset/admin page.

On the dataset admin page, you will be able to create a mockup page in order to preview the final dataset view page with the information that was added to the database in the previous steps.

## 6. `transfer` - copy dataset files into Wasabi

When all files in a dataset have been finalised and curated then they can be copied into Wasabi using the `transfer` tool. The path to the user dropbox directory is provided as the `--sourcePath` parameter with the value of the `--doi` parameter being the DOI for the dataset. The `--wasabi` flag inform the `transfer` tool to copy files into Wasabi storage. The `--apply` flag takes the `transfer` tool out of dry-run mode that results in the actual transfer of files into Wasabi storage from the user drop box directory.
```
[peterl@ip-10-99-0-95 user5]$ transfer --doi 102498 --sourcePath /share/dropbox/user5/ --wasabi --apply
```

## 7. Housekeeping of user dropboxes of published datasets

After dataset files have been copied into Wasabi, the files should also be backed up into S3 Glacier:
```
[peterl@ip-10-99-0-95 user5]$ transfer --doi 102498 --sourcePath /share/dropbox/user5/ --backup --apply
More details about copying files to s3 bucket, please refer to: /var/log/gigadb/transfer.log
```

Confirm files have been backed up in S3 Glacier:
```
[peterl@ip-10-99-0-56 ~]$ tail /var/log/gigadb/transfer.log
2024/11/06 07:43:03 INFO : Start copying files from staging to s3
2024/11/06 07:43:04 INFO : 102498.filesizes: Copied (new)
2024/11/06 07:43:04 INFO : 102498.md5: Copied (new)
2024/11/06 07:43:04 INFO : DLPFC_69_72_VNS_results.csv: Copied (new)
2024/11/06 07:43:04 INFO : E2_VNS_Ground_Truth.csv: Copied (new)
2024/11/06 07:43:04 INFO : readme_102498.txt: Copied (new)
2024/11/06 07:43:04 INFO : Executed: rclone copy --s3-no-check-bucket --s3-profile aws-transfer /share/dropbox/user5/ gigadb-datasetfiles:gigadb-datasetfiles-backup/staging/pub/10.5524/102001_103000/102498 --log-file /var/log/gigadb/transfer.log --log-level INFO --stats-log-level DEBUG >> /var/log/gigadb/transfer.log
2024/11/06 07:43:04 INFO : Successfully copied files to s3 bucket for DOI: 102498
```

After you have confirmed the files are safely stored in Wasabi and Glacier then the `user` and `user.orig` dropbox directories should be deleted to save storage space after the manuscript has been published.

## `postUpload`: a wrapper script to create readme file and update file metadata in database

> [!IMPORTANT]
> To execute the `postUpload` command, change directory to the dataset's associated user dropbox directory located at `/share/dropbox/`:
```
[peterl@ip-10-99-0-95 ~]$ cd /share/dropbox/user5
```

There is a script called `postUpload` which calls `createReadme`, `calculateChecksumSizes` and `fileMetaToDb` in turn so that these three tools do not have to be manually executed one after another:
```
# Ensure you are in the dropbox directory
[peterl@ip-10-99-0-88 ~]$ pwd
/share/dropbox/user5
[peterl@ip-10-99-0-88 ~]$ sudo postUpload --doi 102498 --dropbox user5
Creating README file for 102498
[DOI]
10.5524/102498
...
[Comments]
[End]
Created readme file and uploaded it to Wasabi gigadb-website/staging bucket directory
Creating dataset metadata files for 102498
Created 102498.md5
Created 102498.filesizes
Updating file sizes and MD5 values in database for 102498
Updating md5 checksum values as file attributes for 102498
Number of changes: 3
Updating file sizes for 102498
Number of changes: 3
Updated file metadata for 102498 in database
```
> [!TIP]
> Take note of the number of changes made by the md5 and file size update tool. This number should be equal to the number of files listed in the metadata files.
To ensure the postUpload script has worked, you should perform checks using the dataset and file admin pages to see if dataset metadata are correctly stored in the database.

## compare: How to compare files on the user dropbox with the files in the dataset spreadsheet

If there are discrepancies between the state of the filesystem in a user dropbox and the list of files in the dataset spreadsheet, it will cause errors in the processing of the dataset spreadsheet and the saving of files metadata to the database at a later stage of the process. It may be necessary to curate the actual files in user dropboxes for conformity to guidelines or organisational purposes.

To solve this problem, it is important to reconcile both sources of files list regularly. To help with that, there is command available on the bastion server, called compare, to compare the list of files in the dataset spreadsheet with the list of files on the filesystem. By default, when running the command, it will show any discrepancies in both direction.

### How to use the tool

Open the dataset spreadsheet you are working on the files list tab

Copy the list of files and paste it into a text file that you save as DOI_xls.txt (where DOI is to be replaced by the real DOI of the dataset you are working on

Upload the text file listing all the files form the spreadsheet to the bastion server in your home directory using your preferred method (FileZilla, scp, …)

Connect to the bastion with SSH

Change directory to the user dropbox associated with the dataset spreadsheet you are working on
```
$ cd /share/dropbox/user0
```

Remember where you have saved the file DOI_xls.txt ? Pass it as the first argument to the compare command, and pass the current directory (.) as the second argument
```
$ compare /home/rija/100006_xls.txt .
```

If there are any discrepancies between the files listed in the spreadsheet and the files in the user dropbox, that command will output them.

> The command won’t show the list of files, it only output differences in either directions. If you want to see the full list of files from the spreadsheet and have highlighted the ones that are missing in the user dropbox, pass the -v parameter as the final argument to the command:
```
$ compare /home/rija/100006_xls.txt . -v
```
Binary file added docs/curators/overview.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit 1effb4d

Please sign in to comment.