Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vm ww viloca automation #17

Open
wants to merge 118 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
118 commits
Select commit Hold shift + click to select a range
fc94140
[WIP] initial restructuring from S3C
Jun 8, 2023
3abb8e6
wastewater-only modifications
Jun 8, 2023
b7277ce
First version of VM-based wastewater automationw
Sep 28, 2023
a1cc858
VM automation + viloca autorun
Oct 3, 2023
b10f167
production Dockerfile
Oct 3, 2023
b419933
mirrors now reside on Euler
Oct 4, 2023
761d684
sort samples relies on Euler downloads
Oct 4, 2023
556fc71
added storing names of processed batches for uploader
Oct 5, 2023
a361056
[FIX] partial fix FGCZ connection
Oct 5, 2023
8dc3b33
[FIX] upload call
Oct 5, 2023
e7a8eca
[WIP] new euler-based architecture
Nov 6, 2023
79b98a1
[WIP] sync fix
Nov 6, 2023
23bbfe8
fix sync_fgcz batman
Nov 6, 2023
06ee538
cleanup
Nov 6, 2023
3a5c94e
fix conda
Nov 6, 2023
fd315f4
fix batman vars
Nov 6, 2023
1d55153
[WIP] paths fix
Nov 6, 2023
3e35b11
fix conda
Nov 6, 2023
8ff5849
sync setup
Nov 7, 2023
f73afce
fix viloca
Nov 7, 2023
7413ec2
development
Nov 8, 2023
d92adfe
debugging configuration
Nov 9, 2023
41c30b1
[WIP] uploader migration
Nov 13, 2023
468f8a6
uploader migration
Nov 13, 2023
74ed6d8
uploader - cleanup
Nov 13, 2023
8e93e62
fix paths
Nov 13, 2023
0608f70
fix sync
Nov 13, 2023
1f7eea3
fix regular expression
Nov 13, 2023
ad628a6
fix config
Nov 13, 2023
e462335
fix batch check
Nov 14, 2023
254ea25
activate duplicate check
Nov 14, 2023
9217eb8
fix movedatafiles
Nov 14, 2023
119f4fa
add quotas for uploader
Nov 14, 2023
ef26c4d
removed debug entrypoint
Nov 14, 2023
17d32a2
bug fixes
Nov 15, 2023
2c7c7be
fix paths
Nov 15, 2023
09904df
disable mail
Nov 15, 2023
374d578
including migration to slurm
Nov 15, 2023
bb9f55d
fix batman.sh functions
Nov 15, 2023
5135aa2
fix movedatafiles
Nov 15, 2023
2335f82
Merging latest master into VM branch
DrYak Nov 16, 2023
d82ecc3
updated paths
Nov 20, 2023
16c2031
fix paths
Nov 20, 2023
e41d6cc
fix configuration
Nov 20, 2023
ac47274
fix paths
Nov 20, 2023
f71fb03
fix configuration
Nov 20, 2023
7af9f03
detangled backup from processing
Nov 20, 2023
4b99046
moving queue_uploads remotely
Nov 20, 2023
e137277
status tracking bug fixes
Nov 22, 2023
86730f4
fix viloca paths
Nov 23, 2023
8934947
bug fix viloca
Nov 24, 2023
4276343
bug fix run detection VILOCA
Nov 24, 2023
d23a1c2
multiple fixes uploader code
Nov 24, 2023
a5c61bb
update uploader paths
Dec 1, 2023
3cab86a
updated uploader variables
Dec 1, 2023
c609bc0
added listing viloca samples
Dec 1, 2023
e04c0d4
added completion check and rerun VILOCA
Dec 1, 2023
ccaa35a
fixed viloca status tracking
Dec 4, 2023
7e9aa89
add amplicon_coverage step
Dec 6, 2023
d687415
added conda environment yaml files
Dec 6, 2023
20fca02
fix amplicon cov paths
Dec 6, 2023
cef0088
fix typo
Dec 6, 2023
0785ad0
fix typo
Dec 6, 2023
13f1899
restore container functionality after testing
Dec 6, 2023
810ad71
added manual amplicon_cov options
Dec 6, 2023
327333f
fix paths
Dec 6, 2023
cb3dd4d
fix manual amplicon_cov functions
Dec 6, 2023
fc45ded
removing unused loop
Dec 6, 2023
02c2d2e
fix variables
Dec 6, 2023
423de67
adapting carillon to new amplicon cov syntax
Dec 6, 2023
61905d9
fix typo
Dec 6, 2023
4ba0ed9
renamed status files for clarity
Dec 7, 2023
2e4752a
adding hostname to sync output
Dec 7, 2023
8b090c5
fix directories viloca
Dec 8, 2023
08c18ca
implemented uploader at VM level
Dec 15, 2023
f0cc6ef
[WIP] uploader metadata fixes
Dec 18, 2023
7dfc60c
- added backup system on BSSE folder
Jan 18, 2024
c6a46bd
cleanup viloca command
Jan 18, 2024
8f49fd3
bug fixes viloca run
Jan 18, 2024
87aefe3
backups setup and fixes
Feb 5, 2024
2fb43d4
[fix] exception for malformed json stat files
Feb 5, 2024
26dc529
activate backups
Feb 5, 2024
0b72f3a
fix amplicon cov backup status test
Feb 5, 2024
1ce30d9
fix viloca restart logic
Feb 5, 2024
f7580ca
bug fixes
Feb 7, 2024
11603c4
Documentation of VM annotation
Feb 8, 2024
e5367e2
add reproducibility information to statusfiles
Feb 9, 2024
3b9b137
pangolin version and output saved to log files
Feb 9, 2024
6fb487d
fix uploader strain name building
Feb 19, 2024
1e9ef58
bug fixes
Feb 29, 2024
64c9ec5
updating settings
Mar 5, 2024
176e150
fix batman amplicon_coverage
Mar 5, 2024
a6ce337
update blacklist
Mar 11, 2024
243e1f4
bug fix on parsing older sample names
Mar 11, 2024
661de70
added processing of project p34560
Mar 26, 2024
6cb83cf
added blacklisting to uploader
Apr 2, 2024
9572e05
update configuration
Apr 8, 2024
efc7d2b
add full year processing commands
Apr 15, 2024
936f15b
update settings
May 13, 2024
9c2f66c
[FIX] adapt to new FGCZ filename conventions
May 14, 2024
68c7cfc
[FIX] adapt to new FGCZ filenames
May 14, 2024
14acf71
[FIX] fixing FGCZ filename changes
May 14, 2024
c29169a
[FIX] fixed uploader switch
May 14, 2024
7468f52
[FEAT] added blacklist for uploader
May 23, 2024
ade057f
[FEAT] ftp/https fgcz sync choice
Jun 12, 2024
8ca47c9
[FEAT] aviti sequencing conditional
Jun 12, 2024
60cafb1
[WIP] aviti automation changes
Jun 26, 2024
8a2c5f9
[FIX] uploader does not attempt uploads if no sample is available aft…
Jun 26, 2024
ea79adc
[FEAT] sortsamples for Aviti sequencing
Jul 3, 2024
f6be5a4
[FIX] aviti sequencing
Jul 3, 2024
2212780
[FEAT] branching automation for Aviti sequencing
Jul 8, 2024
95c7ab8
[FIX] https/ftp download fix
Jul 10, 2024
ec71ec7
[FIX] fix ftp/https switch
Jul 10, 2024
cc00e81
[FIX] fix conditional stopping uploads for empty submissions
Jul 10, 2024
6ff48b5
[FIX] restore FTP downloads with new folder structure
Jul 23, 2024
63220bf
[FIX] bug fix --recent option
Aug 12, 2024
d5bf3d9
[FIX] bug fixing
Aug 12, 2024
423cfc3
[FEAT] aviti processing
Oct 28, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 4 additions & 35 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,36 +1,5 @@
quick_install.sh
Miniconda3-latest-Linux-x86_64.sh
miniconda3
openbis-downloads
bfabric-downloads
sftp-health2030
sftp-viollier
token.pickle
credentials.json
snake-envs
samples*
working/samples
working/samples.tsv
working/lsf.*
working/references/*.fasta.*
working/references/*.log
working/references/*.benchmark
working/variants/
working/qa.csv
working/qa_report.html
working/qa_vals.csv
garbage/
status/
._*
.snapshot
.ipynb_checkpoints

#logs
slurm-*
.snakemake
cluster_logs


# vim backup files:
ssh
.*.sw?
.swp
secrets
pangolin_src/fgcz-gstore.uzh.ch
uploader/__pycache__
27 changes: 27 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
FROM debian:buster-slim

RUN addgroup --gid 1029 bs-pangolin && adduser --ingroup bs-pangolin --uid 542576 bs-pangolin

WORKDIR /root

RUN mkdir /home/bs-pangolin/.ssh
RUN chown -R bs-pangolin:bs-pangolin /home/bs-pangolin/.ssh

RUN apt-get update && apt-get install -y vim wget lftp rsync gawk ssh git gpg expect

USER bs-pangolin
WORKDIR /app/
RUN mkdir -p setup
COPY pangolin_src/setup /app/setup
RUN /app/setup/setup.sh

USER root
COPY --chown=bs-pangolin:bs-pangolin pangolin_src /app/pangolin_src
COPY --chown=bs-pangolin:bs-pangolin uploader /app/uploader
USER bs-pangolin

WORKDIR /app/pangolin_src



ENTRYPOINT ["/app/pangolin_src/entrypoint.sh"]
123 changes: 123 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,3 +28,126 @@ for the code used in wastewater analysis, see:
Short Oral presentation done at the [ECCVID conference in September 2020](https://www.escmid.org/research_projects/escmid_conferences/past_escmid_conferences/eccvid/):

- https://youtu.be/BJ-un88CT9A

## automation

the automation is a collection of scripts connected together by a loop that is configured to monitor, log and kickstart all necessary processes. The automation is completely dockerized.

The automation runs on VM `wisedb.nexus.ethz.ch`. Access is granted by the VM administrators and uses the ETH LDAP credentials.

The docker container dedicated to run the automation is named `revseq-revseq-1` and runs indefinitely in an internal loop.

### Setup

#### Pre-requisites
- [Docker engine](https://docs.docker.com/engine/install/)

#### Configuration
Automation configuration relies on the file `pangolin_src/config/server.conf`. The file contains all necessary variables for the automation, fully commented for clarity.

Please refer directly to the file to define directories and automation behaviors as necessary.

#### Resources and secrets
The automation relies on a set of resources and secret to successfully connect to external services. The necessary files are as follows:
- a `resource` directory with the files
- `config`, the ssh config file with an entry for the FGCZ sftp server
- `id_ed25519_spsp_uploads.pub`, the public key to connect to SPSP for the uploads
- `id_ed25519_wisedb_backups.pub`, the public key to connect to bs-bewi08 for the backups
- `id_ed25519_wisedb.pub`, the main public key used by the container to connect to external services
- `id_euler_ed25519.pub`, the public key to connect to Euler's rsync daemon
- `known_hosts`, the ssh file containing the accepted public keys of the FGCZ SFTP server, euler and bs-bewi08. The remaining hosts are automatically added during the container deployment
- a `secrets` directory with the files
- `bs-pangolin@d@bs-bewi08`, a file wih the credentials to login to bs-bewi08
- `[email protected]`, a file wih the credentials to login to Euler
- `ABC9FC14AAC952E7767FD14A48B70E724BAFE0A3.asc`, the GPG key provided by SPSP for the uploads. Please refer to [SendCrypt manual](https://gitlab.sib.swiss/clinbio/sendcrypt/sendcrypt-cli/-/tree/main?ref_type=heads) for further details
- `bs-pangolin_spsp-uploads_gpg-key.gpg`, the GPG created locally for the SPSP uploads. Please refer to [SendCrypt manual](https://gitlab.sib.swiss/clinbio/sendcrypt/sendcrypt-cli/-/tree/main?ref_type=heads) for further details
- `default.sendcrypt-profile`, the profile to use for the SPSP uploads
- `fgcz-gstore.uzh.ch`, a file with the credentials to login to the FGCZ SFTP server
- `gpg_key_secrets`, a file containing the password of the local GPG key for the SPSP uploads
- `id_ed25519_spsp_uploads`, the private key to connect to SPSP for the uploads
- `id_ed25519_wisedb` the main private key to connect to external services
- `id_ed25519_wisedb_backups`, the private key to connect to bs-bewi08 for the backups
- `id_euler_ed25519`, the private key to connect to Euler
- `rsync.pass.euler`, the password to access the rsync daemon on Euler

#### Docker configuration

docker-compose needs to be adapted to the host machine filesystem for a successful deploy. Sections `services-pangolin-build-context`, `service-pangolin-volumes` and `secrets` rely on host absolute paths and are currently configured to run on the dedicated VM

### Deployment

It is suggested to deploy the automation using `docker-compose up --detach`

### Steps

The automation loop is started by `quasimodo.sh` which is in charge of running `carillon.sh` every loop. `carillon.sh` code can be divide in multiple phases.

1. Data Sync:
- Provided the sync is activated in the configuration, the script runs a forced command on Euler using the script `batman.sh`, starting an rsync job that syncs any new raw data from FGCZ SFTP server to Euler directory `bfabric-downloads`
- The exit status of the sync procedure is downloaded from Euler and checked for success before running the next procedure
- The script runs a forced command on Euler to execute `sort_samples_bfabric.py` through the command `belfry.sh sortsamples --recent`. The command checks the consistency and completeness of the synced plates and creates the directory structure necessary for Vpipe
2. Vpipe run check:
- The automation checks if any Vpipe run is ongoing on Euler and logs the status
- If there is no ongoing Vpipe run, the results are backupped on bs-bewi08
- If there is no ongoing Vpipe run, the latest batch successfully analysed by Vpipe is added to the queue of samples to upload to SPSP
3. Start Vpipe run:
- If the previous steps detected and successfully handled a new full plate to analyse and there is no ongoing Vpipe run on Euler, the automation submits a new Vpipe job on Euler to analyse the new data and logs the new submission
4. VILOCA run check:
- The automation checks if any VILOCA run is ongoing on Euler and logs the status
- If it detects the previous VILOCA run as ended because of TIMEOUT, the VILOCA snakemake directory is unlocked and the run is retried
- If there is no ongoing VILOCA run, the results are backupped on bs-bewi08
5. Start VILOCA run:
- If the Vpipe logs shows a new batch successfully analysed by Vpipe and there is no ongoing VILOCA run on Euler, the automation submits a new VILOCA job on Euler using a forced command to analyse the new data and logs the new submission
6. Upload samples to SPSP:
- The automation keeps track of the number of files submitted to SPSP and the estimated total size uploaded for any day. As first step, the automation checks if the quotas set in the configuration are reached, before attempting any upload
- If no quota is reached and the uploads are activated, the automation triggers the upload scripts.
- The scripts take in input a full list of samples to upload and a list of samples that have been already uploaded
- A temporary list of samples to upload is generated and used to download from Euler the related cram files
- using the file `uploader/submission_metadata.py`, an SPSP metadata line is created for each sample to submit. Metadata lines are sanity checked before submission and the procedure will fail if provided unexpected samples
- The directory containing the cram files and the metadata is submitted to SPSP using the provided tool SendCrypt
- The temporary folders created by SendCrypt are deleted
- The metadata and logs from the submission are archived and backupped to bs-bewi08
7. Amplicon coverage:
- If the Vpipe logs shows a new batch successfully analysed by Vpipe, the automation triggers the scripts necessary to calculate the amplicons coverage using a forced commands
- The results (a csv table of the coverage per amplicon and a heatmap visualization of the table) are backupped on bs-bewi08

# Processing
## Sample collection
Samples from the WWTPs under surveillance are collected by Eawag during a week, usually from Tuesday afternoon to the next Tuesday morning. Frequency of collection and number of WWTPs under surveillance depend on the current terms of agreement. Eawag prepares the samples and ships them to the sequencing center (FGCZ) on Tuesday.

## Sequencing
FGCZ receives the samples, completes the library prepapration and the sequencing. The sequencing platform and library prep kit used change depending on the available technologies and the scientific evidence on the detection efficiency of SARS-CoV-2.

## Raw data retrieval
Raw sequencing results are delivered by FGCZ to their LFTP server. The automation regularly checks the server for new data to process. If new data is available, the entire delivery is mirrored on Euler, checked for consistency with the metadata, and V-pipe is ran on the samples. The automation is able to differentiate between Aviti and NextSeq sequencing and run V-pipe with the specific settings.

## Lollipop (temporary)
The current automated V-pipe analysis does not include the Lollipop deconvolution steps. Until those steps are integrated in the automated V-pipe run, Lollipop needs to be manually ran.

To do so, a user is required to wait for the automated Slurm email confirming that a run on a new batch has successfully completed, then run the script `cowabunga.sh` to hardlink the necessary data to the dedicated Lollipop directory. More specifically, the two commands to run are:
- `./cowabunga.sh autoaddwastewater year` to list all batches received in the current year and update the sample list with the samples from the new batches
- `./cowabung.sh bring_results` to hardlink the necessary V-pipe output to the Lollipop directory `work-vp-test` in preparation of the Lollipop run

After running `cowabunga.sh`, the user can navigate to the working directory `work-vp-test` and run Lollipop using the command `sbatch vpipe-test.sbatch`.

## Postprocessing
Lollipop results need to be postprocessed to generate and visualize the curves, as well as share the results with BAG.

Similarly to the automated V-pipe step, Lollipop reports the status of a run through automated slurm emails. If a Lollipop run is succesful, the post-processing can start.

All postprocessing happens on a dedicated machine called `jupyterhub05`. The machine runs a jupyter notebook server pre-configured with the necessary kernels. Kernels definitions are available on the server. The necessary notebooks are `ww_cov_uploader_V-pipe` and `ww_cov_SwitzerlandMap`.

- `ww_cov_uploader_V-pipe` loads the deconvolution and tallymut results, processing them for plotting and upload
- A user needs first to run the notebook up until the cell `Multiple choice time`. The curves preview should then be shared for review before proceeding
- If the review finds abnormalities, the cell `db.rollback()` should be run to cancel any update to the cov-spectrum database
- If the review finds no abnormalities, the cell `db.rollback()` should be *skipped* and the remainder of the notebook can be run, committing the changes to the cov-spectrum database and uploading the results on Polybox for BAG and for the public
- `ww_cov_SwitzerlandMap` loads the deconvolution and tallymut results, processing them for plotting cake plots with the average relative abundances for the week on each WWTP on a map of Switzerland
- The plot needs to be included in the email reports. The plot caption can be copy-pasted from previous weeks unless the method or the WWTPs change

After running both notebooks, the email report can be written. The mandatory content is as follows:
- Last collection date available for each WWTP. This information is provided as output of the early cells of the notebook `ww_cov_uploader_V-pipe`
- Comment on the current status, mentioning the dominant variant (if any) and the observable relative abundance trends
- Signature
- Switzerland map plot with its caption

The email report must be reviewed by an additional person before submission. Submission should be done by sending the email to the dedicated mailing list.
1 change: 0 additions & 1 deletion V-pipe
Submodule V-pipe deleted from e8be5f
1 change: 0 additions & 1 deletion V-pipe-test
Submodule V-pipe-test deleted from e8be5f
Loading