This directory contains the scripts needed to run the experiments with artificial mutations.
The experiment consists of running the workflow with 192 different inputs. For testing purposes, a subset of the inputs may be used. To run the experiment, please follow these steps.
To simplify running the experiment, the repository contains a helper script, experiment_helper.py. All its available options may be listed with python3 experiment_helper.py --help
.
cd experiments-with-artificial-mutations
- The identifiers of the inputs are listed in all-experiment-names.txt. Decide with which inputs to run the experiment, copy the list with e.g.
cp -i all-experiment-names.txt experiment-names.txt
and possibly remove some of the lines in order to run the experiment with fewer inputs. - Do one of the following:
- Download prepared indices needed to run the experiment as follows:
- Create a list of the compressed index URLs with
python3 experiment_helper.py --print-index-urls --experiment-list experiment-names.txt > index-urls.txt
- Download the files with e.g.
wget --content-disposition --trust-server-names -i index-urls.txt
- Extract the contents of the archives with e.g.
ls *.tar.bz2 | while read x; do pbzip2 -d -c "$x" | tar x; done
. The indices should be automatically placed in a subdirectory called indices. The downloaded .tar.gz files are not needed after this step.
- Create a list of the compressed index URLs with
- Download A2M inputs and generate the indices as follows:
- Create a list of the corresponding input files with
python3 experiment_helper.py --print-index-input-urls --experiment-list experiment-names.txt > index-input-urls.txt
- Download the files with e.g.
wget --content-disposition --trust-server-names -i index-input-urls.txt
- Extract the contents of the archives to a subdirectory called a2m.
- Get a list of commands to generate the indices from experiment_helper.py. These may be piped directly to the shell with e.g.
python3 experiment_helper.py --print-indexing-commands --experiment-list experiment-names.txt --snakemake-arguments '--cores 32 --conda-prefix ../conda-env --resources mem_mb=16000' | bash -x -e
. Alternatively, since some of the steps of the workflow have not been parallelised, the commands may be written to a file and executed with e.g. GNU Parallel:python3 experiment_helper.py ... > index-commands.txt; parallel -j16 < index-commands.txt
.
- Create a list of the corresponding input files with
- Download prepared indices needed to run the experiment as follows:
- Download the reads used in the experiment and extract. Please see the commands below. The compressed FASTQ files should be automatically placed in a subdirectory called genreads. (In addition to the separate read files, some parts of the workflow require all the reads in one file. The file is automatically generated as part of the workflow but we also provide the generated files.)
wget https://cs.helsinki.fi/group/gsa/panvc-founders/e-coli-experiment/reads/genreads-cov10.tar
wget https://cs.helsinki.fi/group/gsa/panvc-founders/e-coli-experiment/reads/genreads-cov20.tar
wget https://cs.helsinki.fi/group/gsa/panvc-founders/e-coli-experiment/reads/genreads-cov10-renamed.tar
wget https://cs.helsinki.fi/group/gsa/panvc-founders/e-coli-experiment/reads/genreads-cov20-renamed.tar
tar xf genreads-cov10.tar
tar xf genreads-cov20.tar
tar xf genreads-cov10-renamed.tar
tar xf genreads-cov20-renamed.tar
- Download sequences-truth.tar.gz and extract. The plain text files should be automatically placed in a subdirectory called sequences-truth.
wget https://cs.helsinki.fi/group/gsa/panvc-founders/e-coli-experiment/sequences-truth.tar.gz
tar xzf sequences-truth.tar.gz
- Download e.coli.fa.gz and extract. Some of our tools require the sequence part of the FASTA to not contain any newlines; we have modified the file accordingly.
wget https://cs.helsinki.fi/group/gsa/panvc-founders/e-coli-experiment/e.coli.fa.gz
gunzip e.coli.fa.gz
- Run the variant calling workflow. To this end, get a list of commands from experiment_helper.py. These may be piped directly to the shell with e.g.
python3 experiment_helper.py --print-variant-calling-commands --experiment-list experiment-names.txt --snakemake-arguments '--cores 32 --conda-prefix ../conda-env --resources mem_mb=16000' | bash -x -e
. - Generate the predicted sequences from the variants. As the process is rather I/O intensive, we recommend using one core with Snakemake:
python3 experiment_helper.py --print-predicted-sequence-generation-commands --experiment-list experiment-names.txt --snakemake-arguments '--cores 1 --conda-prefix ../conda-env' | bash -x -e
- Compare the predicted sequences to the truth with Edlib:
python3 experiment_helper.py --print-sequence-comparison-commands --experiment-list experiment-names.txt --snakemake-arguments '--cores 32 --conda-prefix ../conda-env --resources mem_mb=16000' | bash -x -e
- Run
./summarize-edlib-scores.sh
to create a summary of the calculated scores in TSV format.
The generated files are placed in subdirectories as listed in the following table.
Result | Directory |
---|---|
Edit distances from the truth | edlib-scores |
Predicted sequences | predicted-sequences/experiment-identifier/predicted.workflow.variant-caller.txt |
Variants called with the PanVC workflow | call/experiment-identifier/ext_vc/pg_variants.variant-caller.vcf |
Variants called with the baseline workflow | call/experiment-identifier/baseline_vc/variants.variant-caller.vcf |
The following archives contain the reads used in the experiment in gzip-compressed FASTQ format. (Hence the archive itself has not been re-compressed.)
Reads | Coverage | Note |
---|---|---|
genreads-cov10.tar | 10 | |
genreads-cov20.tar | 20 | |
genreads-cov10-renamed.tar | 10 | All reads in one file |
genreads-cov20-renamed.tar | 20 | All reads in one file |
The following archives contain the actual (not predicted) variants in the generated samples. The identifier of the removed sample in all cases is SAMPLE0
.
Description | File |
---|---|
Samples removed in the experiments | variants-truth.tar.gz |
All samples | variants-all.tar.bz2 |
The following archives contain the actual sequences of the samples that were removed in the experiments.
Sequences |
---|
sequences-truth.tar.gz |
Archives that contain the pregenerated indices have been listed in e-coli-indices.md.
All sequences in one archive |
---|
founder-sequences-a2m.tar.bz2 |
Individual sequence files have been listed in e-coli-index-inputs.md.