chore: Merge branch 'develop' into new-feature/2066-file-attribute-ch…

…aracter-limit-when-adding-via-admin-pages
luistoptal · Nov 14, 2024 · 1effb4d · 1effb4d
2 parents 7511249 + adde9c7
commit 1effb4d
Show file tree

Hide file tree

Showing 76 changed files with 2,180 additions and 1,100 deletions.
diff --git a/.gitlab-ci.yml b/.gitlab-ci.yml
@@ -64,7 +64,7 @@ image: docker:$DOCKER_VERSION
 
 
 include:
-  - template: Security/SAST.gitlab-ci.yml
+  - template: Jobs/SAST.gitlab-ci.yml
   - template: Jobs/Container-Scanning.gitlab-ci.yml
   - local: "ops/pipelines/gigadb-build-jobs.yml"
   - local: "ops/pipelines/gigadb-test-jobs.yml"

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,13 +2,20 @@
 
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
 
-
 ## Unreleased
 
 - Feat #2066: Extend max length for file attribute values to 150
-- Fix #2042: Batch deletion of file attributes and samples
+- Feat #701: Code refactoring to separate upload status transitions and notifications to prepare for upload status overhaul
+- Security #1867: Update the gitlab static application security testing (SAST) job using the Semgrep-based analyzer
+
+## v4.4.0 - 2024-11-13 - ea1a37cc9 -
+
+- Fix #2066: Max length for attribute value set to 1000 in file admin form
+- Feat #1968: Add curators manual for operating tools on bastion server and improve tools usage
+- Feat #1750: Switch to guzzle instead of cURL (preliminry work to prepare for DataCite schema upgrade)
+- Fix #2042: Batch deletion of file attributes and samples to make deleting files from the admin dashboard faster
 
-## v4.3.9 - 2024-10-28 - 961f7821a -
+## v4.3.9 - 2024-10-28 - 961f7821a - 2024-11-06
 
 - Fix #1838: switch datepicker format to yyy-mm-dd
 - Feat #1768: Alphabetically sorted dataset author dropdown options in adminDatasetAuthor form

diff --git a/data/gigadb_testdata/file_attributes.csv b/data/gigadb_testdata/file_attributes.csv
@@ -1 +1,4 @@
 id,file_id,attribute_id,value,unit_id
+10669,453,605,d30b8b3549777953aeec9c82e8ac8265,
+10670,454,605,da3aa9c474329f45a5f1053e1e99cc0d,
+10671,457,605,35850810fcf14328b9811029b5a0d5b9,
diff --git a/docs/curators/CURATOR_TOOLS_BASTION.md b/docs/curators/CURATOR_TOOLS_BASTION.md
@@ -0,0 +1,263 @@
+# Using customized tools in the production bastion server
+
+## Overview
+
+![Tool Overview](./overview.png 'Overview of tools on bastion server')
+
+New datasets are uploaded into GigaDB using Excel spreadsheets. The bastion server provides a set of command-line tools which implement the above workflow for ingesting Excel spreadsheets and performing post-upload operations.
+
+## 1. datasetUpload
+
+After you have logged into the bastion server (bastion.gigadb.host) using SSH, you can begin the process of Excel spreadsheet ingestion into GigaDB.
+
+Dataset metadata is added into [Excel template file version 19](https://github.com/gigascience/gigadb-website/blob/develop/gigadb/app/tools/excel-spreadsheet-uploader/template/GigaDBUpload-template_v19.xls). This Excel file needs to placed in the `uploadDir` directory:
+```
+# Your home directory can be referred to using ~
+[peterl@ip-10-99-0-88 ~]$ ls ~
+uploadDir
+```
+
+Excel files can be uploaded into `uploadDir` using `sftp` tool or [Filezilla](https://filezilla-project.org).
+
+> [!TIP]
+> For testing purposes, download a test Excel file into `uploadDir` using this command: `curl -L -o "./uploadDir/GigaDBUpload_v18_102498_TRR_202311_02_Cell_Clustering_Spatial_Transcriptomics.xls" "https://drive.google.com/uc?export=download&id=129j3ikdSojNVpvZPnBefoOA2Uz6OusHR"`
+
+The Excel file can then be ingested using the datasetUpload script in `/usr/local/bin`:
+```
+[peterl@ip-10-99-0-88 ~]$ sudo datasetUpload
+Done.
+```
+
+If the ingestion process has been successful then you should see the above output. In addition, the Excel file will have disappeared from `uploadDir` folder and there will be two log files:
+```
+[peterl@ip-10-99-0-88 ~]$ ls uploadDir/
+java.log  javac.log
+```
+
+Looking at the uploadDir/java.log will help confirm upload:
+```
+[peterl@ip-10-99-0-88 ~]$ tail uploadDir/java.log 
+Insert false: insert into file_attributes select 674872, 538971, 572, 'MIT', null where not exists ( select null from file_attributes where id = 674872 ); 
+>>>>>>>About to exec sqlTemp...
+execution time: 130
+**End success: GigaDBUpload_v18_102498_TRR_202311_02_Cell_Clustering_Spatial_Transcriptomics.xls
+```
+
+You should also check the corresponding dataset admin page at `https://gigadb.org/adminDataset/update/id/<dataset_id>` which you will be able to find by entering the dataset's DOI, e.g. 102498 into the DOI column header in /adminDataset/admin page.
+
+Also, look at the dataset's samples and files in the relevant dataset samples and dataset files admin pages.
+
+> [!TIP]
+> If there is a problem with Excel file ingestion then you will see the following output when running `datasetUpload`:
+```
+[peterl@ip-10-99-0-88 ~]$ datasetUpload
+Spreadsheet cannot be uploaded, please check logs!
+Done.
+```
+
+> Do as the output message suggests by checking `tail uploadDir/java.log`:
+```
+publisher test OK? true
+contentXXX: Genomics
+target: dataset_type
+content: Genomics
+values: [Genomic, Metagenomic, Epigenomic, Proteomic, Transcriptomic, Metabolomic, Neuoscience, Bioinformatics, Workflow, Software, Imaging, Network-Analysis, Genome-Mapping, ElectroEncephaloGraphy(EEG), Metadata, Metabarcoding, Virtual-Machine, Climate, Ecology, Lipidomic, Phenotyping]
+relation test OK? false
+email test OK? false
+attribute_id test OK? false
+author_name test OK? false
+project test OK? false
+image test OK? false
+file_type test OK? false
+latest date   2024-3-6
+Finished validation OK? false
+End error 1: GigaDBUpload_v18_GIGA-D-23-00109-Koref4K.xls
+fillTable output: true
+validation output: false
+[GigaDBUpload_v18_GIGA-D-23-00109-Koref4K.xls]
+```
+
+> In the above example error, dataset_type is wrongly spelt as `Genomics` which breaks the ingestion process and therefore needs to be corrected.
+
+## 2. createReadme
+
+> [!IMPORTANT]
+> To execute the `createReadme` command, change directory to the dataset's user dropbox directory located at `/share/dropbox/`:
+```
+[peterl@ip-10-99-0-95 ~]$ cd /share/dropbox/user5
+```
+
+From this user dropbox directory, a readme file for the dataset can be created using the `createReadme` script by calling it with a DOI:
+```
+[peterl@ip-10-99-0-142 user5]$ pwd
+/share/dropbox/user5
+[peterl@ip-10-99-0-88 user5]$ sudo createReadme --doi 102498
+```
+
+A `readme_<doi>.txt` file will appear in `/share/dropbox/user5` directory.
+```
+[peterl@ip-10-99-0-142 user5]$ ls
+DLPFC_69_72_VNS_results.csv  E2_VNS_Ground_Truth.csv  readme_102498.txt
+```
+
+To create the readme file and copy it into Wasabi, extra parameters need to be provided:
+```
+[peterl@ip-10-99-0-88 user5]$ sudo createReadme --doi 102498 --wasabi --apply --use-live-data
+```
+
+The readme file will also have been uploaded into the correct dataset directory in Wasabi live bucket.  The file size and MD5 value for the readme file will also be updated in the database.
+
+## 3. calculateChecksumSizes
+
+> [!IMPORTANT]
+> To execute the `calculateChecksumSizes` command, change directory to the dataset's user dropbox directory located at `/share/dropbox/`:
+```
+[peterl@ip-10-99-0-95 ~]$ cd /share/dropbox/user5
+```
+
+`$doi.md5` and `$doi.filesizes` provide information used to update dataset files with md5 values and file size in the database. These two files can be generated from the user5 dropbox:
+```
+# Provide DOI number as a parameter
+[peterl@ip-10-99-0-95 user5]$ sudo calculateChecksumSizes 102498
+Created 102498.md5
+Created 102498.filesizes
+```
+
+Check the contents of the two files:
+```
+[peterl@ip-10-99-0-95 user5]$ more 102498.filesizes 
+5124    ./readme_102498.txt
+301     ./DLPFC_69_72_VNS_results.csv
+332     ./E2_VNS_Ground_Truth.csv
+
+[peterl@ip-10-99-0-95 user5]$ more 102498.md5 
+2b74aa5af1b67e48f0317748cbfdf310  ./readme_102498.txt
+dc1feb8af3b8c02b0b615e968b87786d  ./DLPFC_69_72_VNS_results.csv
+b5a7e0953d1581077c13818153371918  ./E2_VNS_Ground_Truth.csv
+```
+
+## 4. filesMetaToDb
+
+> [!IMPORTANT]
+> To execute the `filesMetaToDb` command, change directory to the dataset's user dropbox directory located at `/share/dropbox/`:
+```
+[peterl@ip-10-99-0-95 ~]$ cd /share/dropbox/user5
+```
+
+The `fileMetaToDb` script can use `102498.filesizes` and `102498.md5` to update file metadata in the database from the user dropbox folder:
+```
+[peterl@ip-10-99-0-95 ~]$ sudo filesMetaToDb 102498
+Updating md5 checksum values as file attributes for 102498
+Number of changes: 3
+Updating file sizes for 102498
+Number of changes: 3
+Updated file metadata for 102498 in database
+```
+
+You should check the adminfile pages of the files associated with this dataset to see if MD5 values and file sizes are visible.
+
+## 5. Go to dataset admin page on gigadb.org
+
+With the post upload operations complete, you need to go back to the page at `https://gigadb.org/adminDataset/update/id/<dataset_id>` in order to continue curation work on the dataset. You will be able to find this link by entering the dataset's DOI, e.g. 102498 into the DOI column header in /adminDataset/admin page.
+
+On the dataset admin page, you will be able to create a mockup page in order to preview the final dataset view page with the information that was added to the database in the previous steps.
+
+## 6. `transfer` - copy dataset files into Wasabi
+
+When all files in a dataset have been finalised and curated then they can be copied into Wasabi using the `transfer` tool. The path to the user dropbox directory is provided as the `--sourcePath` parameter with the value of the `--doi` parameter being the DOI for the dataset. The `--wasabi` flag inform the `transfer` tool to copy files into Wasabi storage. The `--apply` flag takes the `transfer` tool out of dry-run mode that results in the actual transfer of files into Wasabi storage from the user drop box directory.
+```
+[peterl@ip-10-99-0-95 user5]$ transfer --doi 102498 --sourcePath /share/dropbox/user5/ --wasabi --apply
+```
+
+## 7. Housekeeping of user dropboxes of published datasets
+
+After dataset files have been copied into Wasabi, the files should also be backed up into S3 Glacier:
+```
+[peterl@ip-10-99-0-95 user5]$ transfer --doi 102498 --sourcePath /share/dropbox/user5/ --backup --apply
+More details about copying files to s3 bucket, please refer to: /var/log/gigadb/transfer.log
+```
+
+Confirm files have been backed up in S3 Glacier:
+```
+[peterl@ip-10-99-0-56 ~]$ tail /var/log/gigadb/transfer.log
+2024/11/06 07:43:03 INFO  : Start copying files from staging to s3
+2024/11/06 07:43:04 INFO  : 102498.filesizes: Copied (new)
+2024/11/06 07:43:04 INFO  : 102498.md5: Copied (new)
+2024/11/06 07:43:04 INFO  : DLPFC_69_72_VNS_results.csv: Copied (new)
+2024/11/06 07:43:04 INFO  : E2_VNS_Ground_Truth.csv: Copied (new)
+2024/11/06 07:43:04 INFO  : readme_102498.txt: Copied (new)
+2024/11/06 07:43:04 INFO  : Executed: rclone copy --s3-no-check-bucket --s3-profile aws-transfer /share/dropbox/user5/ gigadb-datasetfiles:gigadb-datasetfiles-backup/staging/pub/10.5524/102001_103000/102498 --log-file /var/log/gigadb/transfer.log --log-level INFO --stats-log-level DEBUG >> /var/log/gigadb/transfer.log
+2024/11/06 07:43:04 INFO  : Successfully copied files to s3 bucket for DOI: 102498
+```
+
+After you have confirmed the files are safely stored in Wasabi and Glacier then the `user` and `user.orig` dropbox directories should be deleted to save storage space after the manuscript has been published.
+
+## `postUpload`: a wrapper script to create readme file and update file metadata in database
+
+> [!IMPORTANT]
+> To execute the `postUpload` command, change directory to the dataset's associated user dropbox directory located at `/share/dropbox/`:
+```
+[peterl@ip-10-99-0-95 ~]$ cd /share/dropbox/user5
+```
+
+There is a script called `postUpload` which calls `createReadme`, `calculateChecksumSizes` and `fileMetaToDb` in turn so that these three tools do not have to be manually executed one after another:
+```
+# Ensure you are in the dropbox directory
+[peterl@ip-10-99-0-88 ~]$ pwd
+/share/dropbox/user5
+[peterl@ip-10-99-0-88 ~]$ sudo postUpload --doi 102498 --dropbox user5
+Creating README file for 102498
+[DOI]
+10.5524/102498
+...
+[Comments]
+
+[End]
+Created readme file and uploaded it to Wasabi gigadb-website/staging bucket directory
+Creating dataset metadata files for 102498
+Created 102498.md5
+Created 102498.filesizes
+Updating file sizes and MD5 values in database for 102498
+Updating md5 checksum values as file attributes for 102498
+Number of changes: 3
+Updating file sizes for 102498
+Number of changes: 3
+Updated file metadata for 102498 in database
+```
+> [!TIP]
+> Take note of the number of changes made by the md5 and file size update tool. This number should be equal to the number of files listed in the metadata files.
+
+To ensure the postUpload script has worked, you should perform checks using the dataset and file admin pages to see if dataset metadata are correctly stored in the database.
+
+## compare: How to compare files on the user dropbox with the files in the dataset spreadsheet
+
+If there are discrepancies between the state of the filesystem in a user dropbox and the list of files in the dataset spreadsheet, it will cause errors in the processing of the dataset spreadsheet and the saving of files metadata to the database at a later stage of the process. It may be necessary to curate the actual files in user dropboxes for conformity to guidelines or organisational purposes.
+
+To solve this problem, it is important to reconcile both sources of files list regularly. To help with that, there is command available on the bastion server, called compare, to compare the list of files in the dataset spreadsheet with the list of files on the filesystem. By default, when running the command, it will show any discrepancies in both direction.
+
+### How to use the tool
+
+Open the dataset spreadsheet you are working on the files list tab
+
+Copy the list of files and paste it into a text file that you save as DOI_xls.txt (where DOI is to be replaced by the real DOI of the dataset you are working on
+
+Upload the text file listing all the files form the spreadsheet to the bastion server in your home directory using your preferred method (FileZilla, scp, …)
+
+Connect to the bastion with SSH
+
+Change directory to the user dropbox associated with the dataset spreadsheet you are working on
+```
+$ cd /share/dropbox/user0
+```
+
+Remember where you have saved the file DOI_xls.txt ? Pass it as the first argument to the compare command, and pass the current directory (.) as the second argument
+```
+$ compare /home/rija/100006_xls.txt .
+```
+
+If there are any discrepancies between the files listed in the spreadsheet and the files in the user dropbox, that command will output them.
+
+> The command won’t show the list of files, it only output differences in either directions. If you want to see the full list of files from the spreadsheet and have highlighted the ones that are missing in the user dropbox, pass the -v parameter as the final argument to the command:
+```
+$ compare /home/rija/100006_xls.txt . -v
+```
diff --git a/docs/curators/overview.png b/docs/curators/overview.png