Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Db params #121

Merged
merged 12 commits into from
Nov 22, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 1 addition & 11 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -35,19 +35,9 @@ jobs:
with:
version: "${{ matrix.NXF_VER }}"

- name: Download the NCBI taxdump database
run: |
mkdir ncbi_taxdump
curl -L https://ftp.ncbi.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.tar.gz | tar -C ncbi_taxdump -xzf -

- name: Download the BUSCO lineage database
run: |
mkdir busco_database
curl -L https://tolit.cog.sanger.ac.uk/test-data/resources/busco/blobtoolkit.GCA_922984935.2.2023-08-03.lineages.tar.gz | tar -C busco_database -xzf -

- name: Run pipeline with test data
# You can customise CI pipeline run tests as required
# For example: adding multiple test runs with different parameters
# Remember that you can parallelise this by using strategy.matrix
run: |
nextflow run ${GITHUB_WORKSPACE} -profile test,docker --taxdump $PWD/ncbi_taxdump --busco $PWD/busco_database --outdir ./results
nextflow run ${GITHUB_WORKSPACE} -profile test,docker --outdir ./results
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [[0.7.0](https://github.com/sanger-tol/blobtoolkit/releases/tag/0.7.0)] – Psyduck – [2024-10-02]
## [[0.7.0](https://github.com/sanger-tol/blobtoolkit/releases/tag/0.7.0)] – Psyduck – [2024-11-20]

The pipeline is now considered to be a complete and suitable replacement for the Snakemake version.

Expand All @@ -13,6 +13,7 @@ The pipeline is now considered to be a complete and suitable replacement for the
to indicate in the samplesheet whether the reads are paired or single.
- Updated the Blastn settings to allow 7 days runtime at most, since that
covers 99.7% of the jobs.
- Allow database inputs to be optionally compressed (`.tar.gz`)

### Software dependencies

Expand Down
Binary file removed assets/test/mMelMel3.1.buscogenes.dmnd
Binary file not shown.
Binary file removed assets/test/mMelMel3.1.buscoregions.dmnd
Binary file not shown.
Binary file removed assets/test/nt_mMelMel3.1/nt_mMelMel3.1.ndb
Binary file not shown.
Binary file removed assets/test/nt_mMelMel3.1/nt_mMelMel3.1.nhr
Binary file not shown.
Binary file removed assets/test/nt_mMelMel3.1/nt_mMelMel3.1.nin
Binary file not shown.
Binary file removed assets/test/nt_mMelMel3.1/nt_mMelMel3.1.nog
Binary file not shown.
Binary file removed assets/test/nt_mMelMel3.1/nt_mMelMel3.1.nos
Binary file not shown.
Binary file removed assets/test/nt_mMelMel3.1/nt_mMelMel3.1.not
Binary file not shown.
Binary file removed assets/test/nt_mMelMel3.1/nt_mMelMel3.1.nsq
Binary file not shown.
Binary file removed assets/test/nt_mMelMel3.1/nt_mMelMel3.1.ntf
Binary file not shown.
Binary file removed assets/test/nt_mMelMel3.1/nt_mMelMel3.1.nto
Binary file not shown.
Binary file removed assets/test/nt_mMelMel3.1/taxonomy4blast.sqlite3
Binary file not shown.
Binary file removed assets/test_full/gfLaeSulp1.1.buscogenes.dmnd
Binary file not shown.
Binary file removed assets/test_full/gfLaeSulp1.1.buscoregions.dmnd
Binary file not shown.
Binary file not shown.
Binary file removed assets/test_full/nt_gfLaeSulp1.1/nt_gfLaeSulp1.1.nhr
Binary file not shown.
Binary file removed assets/test_full/nt_gfLaeSulp1.1/nt_gfLaeSulp1.1.nin
Binary file not shown.
Binary file removed assets/test_full/nt_gfLaeSulp1.1/nt_gfLaeSulp1.1.nog
Binary file not shown.
Binary file removed assets/test_full/nt_gfLaeSulp1.1/nt_gfLaeSulp1.1.nos
Binary file not shown.
Binary file removed assets/test_full/nt_gfLaeSulp1.1/nt_gfLaeSulp1.1.not
Binary file not shown.
Binary file removed assets/test_full/nt_gfLaeSulp1.1/nt_gfLaeSulp1.1.nsq
Binary file not shown.
Binary file not shown.
Binary file removed assets/test_full/nt_gfLaeSulp1.1/nt_gfLaeSulp1.1.nto
Binary file not shown.
Binary file not shown.
10 changes: 5 additions & 5 deletions conf/test.config
Original file line number Diff line number Diff line change
Expand Up @@ -30,11 +30,11 @@ params {
taxon = "Meles meles"

// Databases
taxdump = "/lustre/scratch123/tol/resources/taxonomy/latest/new_taxdump"
busco = "/lustre/scratch123/tol/resources/nextflow/busco/blobtoolkit.GCA_922984935.2.2023-08-03"
blastp = "${projectDir}/assets/test/mMelMel3.1.buscogenes.dmnd"
blastx = "${projectDir}/assets/test/mMelMel3.1.buscoregions.dmnd"
blastn = "${projectDir}/assets/test/nt_mMelMel3.1"
taxdump = "https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.tar.gz"
busco = "https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/resources/blobtoolkit.GCA_922984935.2.2023-08-03.tar.gz"
blastp = "https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/resources/mMelMel3.1.buscogenes.dmnd.tar.gz"
blastx = "https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/resources/mMelMel3.1.buscoregions.dmnd.tar.gz"
blastn = "https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/resources/nt_mMelMel3.1.tar.gz"

// Need to be set to avoid overfilling /tmp
use_work_dir_as_temp = true
Expand Down
8 changes: 4 additions & 4 deletions conf/test_full.config
Original file line number Diff line number Diff line change
Expand Up @@ -25,11 +25,11 @@ params {
taxon = "Laetiporus sulphureus"

// Databases
taxdump = "/lustre/scratch123/tol/resources/taxonomy/latest/new_taxdump"
taxdump = "https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.tar.gz"
busco = "/lustre/scratch123/tol/resources/busco/latest"
blastp = "${projectDir}/assets/test_full/gfLaeSulp1.1.buscogenes.dmnd"
blastx = "${projectDir}/assets/test_full/gfLaeSulp1.1.buscoregions.dmnd"
blastn = "${projectDir}/assets/test_full/nt_gfLaeSulp1.1"
blastp = "https://tolit.cog.sanger.ac.uk/test-data/Laetiporus_sulphureus/resources/gfLaeSulp1.1.buscogenes.dmnd.tar.gz"
blastx = "https://tolit.cog.sanger.ac.uk/test-data/Laetiporus_sulphureus/resources/gfLaeSulp1.1.buscoregions.dmnd.tar.gz"
blastn = "https://tolit.cog.sanger.ac.uk/test-data/Laetiporus_sulphureus/resources/nt_gfLaeSulp1.1.tar.gz"

// Need to be set to avoid overfilling /tmp
use_work_dir_as_temp = true
Expand Down
10 changes: 5 additions & 5 deletions conf/test_raw.config
Original file line number Diff line number Diff line change
Expand Up @@ -31,11 +31,11 @@ params {
taxon = "Meles meles"

// Databases
taxdump = "/lustre/scratch123/tol/resources/taxonomy/latest/new_taxdump"
busco = "/lustre/scratch123/tol/resources/nextflow/busco/blobtoolkit.GCA_922984935.2.2023-08-03"
blastp = "${projectDir}/assets/test/mMelMel3.1.buscogenes.dmnd"
blastx = "${projectDir}/assets/test/mMelMel3.1.buscoregions.dmnd"
blastn = "${projectDir}/assets/test/nt_mMelMel3.1/"
taxdump = "https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.tar.gz"
busco = "https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/resources/blobtoolkit.GCA_922984935.2.2023-08-03.tar.gz"
blastp = "https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/resources/mMelMel3.1.buscogenes.dmnd.tar.gz"
blastx = "https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/resources/mMelMel3.1.buscoregions.dmnd.tar.gz"
blastn = "https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/resources/nt_mMelMel3.1.tar.gz"

// Need to be set to avoid overfilling /tmp
use_work_dir_as_temp = true
Expand Down
28 changes: 27 additions & 1 deletion docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,15 +78,20 @@ The BlobToolKit pipeline can be run in many different ways. The default way requ

It is a good idea to put a date suffix for each database location so you know at a glance whether you are using the latest version. We are using the `YYYY_MM` format as we do not expect the databases to be updated more frequently than once a month. However, feel free to use `DATE=YYYY_MM_DD` or a different format if you prefer.

Note that all input databases may be optionally passed directly to the pipeline compressed as `.tar.gz`, and the pipeline will handle decompression.
The instructions below show how to build each input database in _two_ forms: decompressed _and_ compressed. You may not need to do both. Select the one that is most appropriate for how you want to use the pipeline.

#### 1. NCBI taxdump database

Create the database directory, retrieve and decompress the NCBI taxonomy:

```bash
DATE=2024_10
TAXDUMP=/path/to/databases/taxdump_${DATE}
TAXDUMP_TAR=/path/to/databases/taxdump_${DATE}.tar.gz
mkdir -p "$TAXDUMP"
curl -L ftp://ftp.ncbi.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.tar.gz | tar -xzf - -C "$TAXDUMP"
curl -L ftp://ftp.ncbi.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.tar.gz -o $TAXDUMP_TAR
tar -xzf $TAXDUMP_TAR -C "$TAXDUMP"
```

#### 2. NCBI nucleotide BLAST database
Expand All @@ -96,6 +101,7 @@ Create the database directory and move into the directory:
```bash
DATE=2024_10
NT=/path/to/databases/nt_${DATE}
NT_TAR=/path/to/databases/nt_${DATE}.tar.gz
mkdir -p $NT
cd $NT
```
Expand All @@ -113,6 +119,11 @@ done
wget "https://ftp.ncbi.nlm.nih.gov/blast/db/v5/taxdb.tar.gz" &&
tar xf taxdb.tar.gz -C $NT &&
rm taxdb.tar.gz

# Compress and cleanup
cd ..
tar -cvzf $NT_TAR $NT
rm -r $NT
```

#### 3. UniProt reference proteomes database
Expand All @@ -126,6 +137,7 @@ Create the database directory and move into the directory:
```bash
DATE=2024_10
UNIPROT=/path/to/databases/uniprot_${DATE}
UNIPROT_TAR=/path/to/databases/uniprot_${DATE}.tar.gz
mkdir -p $UNIPROT
cd $UNIPROT
```
Expand All @@ -152,6 +164,12 @@ diamond makedb -p 16 --in reference_proteomes.fasta.gz --taxonmap reference_prot
# clean up
mv extract/{README,STATS} .
rm -r extract
rm -r $TAXDUMP

# Compress final database and cleanup
cd ..
tar -cvzf $UNIPROT_TAR $UNIPROT
rm -r $UNIPROT
```

#### 4. BUSCO databases
Expand All @@ -161,6 +179,7 @@ Create the database directory and move into the directory:
```bash
DATE=2024_10
BUSCO=/path/to/databases/busco_${DATE}
BUSCO_TAR=/path/to/databases/busco_${DATE}.tar.gz
mkdir -p $BUSCO
cd $BUSCO
```
Expand All @@ -181,6 +200,13 @@ If you have [GNU parallel](https://www.gnu.org/software/parallel/) installed, yo
find v5/data -name "*.tar.gz" | parallel "cd {//}; tar -xzf {/}"
```

Finally re-compress and cleanup the files:

```bash
tar -cvzf $BUSCO_TAR $BUSCO
rm -r $BUSCO
```

## Changes from Snakemake to Nextflow

### Commands
Expand Down
5 changes: 5 additions & 0 deletions modules.json
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,11 @@
"installed_by": ["modules"],
"patch": "modules/nf-core/seqtk/subseq/seqtk-subseq.diff"
},
"untar": {
"branch": "master",
"git_sha": "666652151335353eef2fcd58880bcef5bc2928e1",
"installed_by": ["modules"]
},
"windowmasker/mkcounts": {
"branch": "master",
"git_sha": "32cac29d4a92220965dace68a1fb0bb2e3547cac",
Expand Down
18 changes: 8 additions & 10 deletions modules/local/generate_config.nf
Original file line number Diff line number Diff line change
Expand Up @@ -10,13 +10,11 @@ process GENERATE_CONFIG {
val taxon_query
val busco_lin
path lineage_tax_ids
tuple val(meta2), path(blastn)
val reads
// The following are passed as "val" because we just want to know the full paths. No staging necessary
val blastp_path
val blastx_path
val blastn_path
val taxdump_path
tuple val(meta2), path(blastp)
tuple val(meta3), path(blastx)
tuple val(meta4), path(blastn)
tuple val(meta5), path(taxdump)

output:
tuple val(meta), path("*.yaml") , emit: yaml
Expand All @@ -43,10 +41,10 @@ process GENERATE_CONFIG {
$accession_params \\
--nt $blastn \\
$input_reads \\
--blastp ${blastp_path} \\
--blastx ${blastx_path} \\
--blastn ${blastn_path} \\
--taxdump ${taxdump_path} \\
--blastp ${blastp} \\
--blastx ${blastx} \\
--blastn ${blastn} \\
--taxdump ${taxdump} \\
--output_prefix ${prefix}

cat <<-END_VERSIONS > versions.yml
Expand Down
7 changes: 7 additions & 0 deletions modules/nf-core/untar/environment.yml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

84 changes: 84 additions & 0 deletions modules/nf-core/untar/main.nf

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

49 changes: 49 additions & 0 deletions modules/nf-core/untar/meta.yml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading
Loading