Skip to content

Commit

Permalink
Merge pull request #114 from sanger-tol/chrom_view
Browse files Browse the repository at this point in the history
Support for grid view (and a few other things)
  • Loading branch information
muffato authored Oct 2, 2024
2 parents e0bf8ba + 1909646 commit 102dbf4
Show file tree
Hide file tree
Showing 35 changed files with 325 additions and 358 deletions.
4 changes: 2 additions & 2 deletions .github/workflows/linting.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,10 +19,10 @@ jobs:
- uses: actions/setup-node@v4

- name: Install editorconfig-checker
run: npm install -g editorconfig-checker
run: npm install -g editorconfig-checker@3.0.2

- name: Run ECLint check
run: editorconfig-checker -exclude README.md $(find .* -type f | grep -v '.git\|.py\|.md\|json\|yml\|yaml\|html\|css\|work\|.nextflow\|build\|nf_core.egg-info\|log.txt\|Makefile')
run: editorconfig-checker -exclude README.md $(find .* -type f | grep -v '.git\|.py\|.md\|json\|yml\|yaml\|html\|css\|work\|.nextflow\|build\|nf_core.egg-info\|log.txt\|Makefile\|.sqlite3')

Prettier:
runs-on: ubuntu-latest
Expand Down
5 changes: 5 additions & 0 deletions .nf-core.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,13 +12,18 @@ lint:
- LICENCE
- lib/NfcoreTemplate.groovy
- CODE_OF_CONDUCT.md
- assets/sendmail_template.txt
- assets/email_template.html
- assets/email_template.txt
- assets/nf-core-blobtoolkit_logo_light.png
- docs/images/nf-core-blobtoolkit_logo_light.png
- docs/images/nf-core-blobtoolkit_logo_dark.png
- .github/ISSUE_TEMPLATE/bug_report.yml
- .github/PULL_REQUEST_TEMPLATE.md
- .github/workflows/linting.yml
- .github/workflows/branch.yml
- .github/CONTRIBUTING.md
- .github/workflows/linting_comment.yml
multiqc_config:
- report_comment
nextflow_config:
Expand Down
23 changes: 16 additions & 7 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,15 @@
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [[0.7.0](https://github.com/sanger-tol/blobtoolkit/releases/tag/0.7.0)] – Psyduck – [2024-10-02]

The pipeline is now considered to be a complete and suitable replacement for the Snakemake version.

- Fetch information about the chromosomes of the assemblies. Used to power
"grid plots".
- Fill in accurate read information in the blobDir. Users are now reqiured
to indicate in the samplesheet whether the reads are paired or single.

## [[0.6.0](https://github.com/sanger-tol/blobtoolkit/releases/tag/0.6.0)] – Bellsprout – [2024-09-13]

The pipeline has now been validated for draft (unpublished) assemblies.
Expand Down Expand Up @@ -87,13 +96,13 @@ The pipeline has now been validated on dozens of genomes, up to 11 Gbp.

Note, since the pipeline is using Nextflow DSL2, each process will be run with its own [Biocontainer](https://biocontainers.pro/#/registry). This means that on occasion it is entirely possible for the pipeline to be using different versions of the same tool. However, the overall software dependency changes compared to the last release have been listed below for reference. Only `Docker` or `Singularity` containers are supported, `conda` is not supported.

| Dependency | Old version | New version |
| ----------- | ------------- | ------------- |
| blobtoolkit | 4.3.3 | 4.3.9 |
| blast | 2.14.0 | 2.15.0 |
| multiqc | 1.17 and 1.18 | 1.20 and 1.21 |
| samtools | 1.18 | 1.19.2 |
| seqtk | 1.3 | 1.4 |
| Dependency | Old version | New version |
| ----------- | ------------- | ----------------- |
| blobtoolkit | 4.3.3 | 4.3.9 |
| blast | 2.14.0 | 2.15.0 and 2.14.1 |
| multiqc | 1.17 and 1.18 | 1.20 and 1.21 |
| samtools | 1.18 | 1.19.2 |
| seqtk | 1.3 | 1.4 |

> **NB:** Dependency has been **updated** if both old and new version information is present. </br> **NB:** Dependency has been **added** if just the new version information is present. </br> **NB:** Dependency has been **removed** if version information isn't present.
Expand Down
12 changes: 9 additions & 3 deletions CITATIONS.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,10 @@
## Pipeline tools

- [BLAST+](https://blast.ncbi.nlm.nih.gov/doc/blast-help/downloadblastdata.html)

> Camacho, Chritiam, et al. “BLAST+: architecture and applications.” BMC Bioinformatics, vol. 10, no. 412, Dec. 2009, https://doi.org/10.1186/1471-2105-10-421
- [BlobToolKit](https://github.com/blobtoolkit/blobtoolkit)

> Challis, Richard, et al. “BlobToolKit – Interactive Quality Assessment of Genome Assemblies.” G3 Genes|Genomes|Genetics, vol. 10, no. 4, Apr. 2020, pp. 1361–74, https://doi.org/10.1534/g3.119.400908.
Expand All @@ -26,9 +30,7 @@
- [Fasta_windows](https://github.com/tolkit/fasta_windows)

- [GoaT](https://goat.genomehubs.org)

> Challis, Richard, et al. “Genomes on a Tree (GoaT): A versatile, scalable search engine for genomic and sequencing project metadata across the eukaryotic tree of life.” Wellcome Open Research, vol. 8, no. 24, 2023, https://doi.org/10.12688/wellcomeopenres.18658.1.
> Brown, Max, et al. "Fasta_windows v0.2.3". GitHub, 2021. https://github.com/tolkit/fasta_windows
- [Minimap2](https://github.com/lh3/minimap2)

Expand All @@ -42,6 +44,10 @@

> Danecek, Petr, et al. “Twelve Years of SAMtools and BCFtools.” GigaScience, vol. 10, no. 2, Jan. 2021, https://doi.org/10.1093/gigascience/giab008.
- [SeqTK](https://github.com/lh3/seqtk)

> Li, Heng. "SeqTK v1.4" GitHub, 2023, https://github.com/lh3/seqtk
## Software packaging/containerisation tools

- [Anaconda](https://anaconda.com)
Expand Down
23 changes: 12 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,8 @@ It takes a samplesheet of BAM/CRAM/FASTQ/FASTA files as input, calculates genome
4. Run BUSCO ([`busco`](https://busco.ezlab.org/))
5. Extract BUSCO genes ([`blobtoolkit/extractbuscos`](https://github.com/blobtoolkit/blobtoolkit))
6. Run Diamond BLASTp against extracted BUSCO genes ([`diamond/blastp`](https://github.com/bbuchfink/diamond))
7. Run BLASTx against sequences with no hit ([`blast/blastn`](https://www.ncbi.nlm.nih.gov/books/NBK131777/))
8. Run BLASTn against sequences still with not hit ([`blast/blastx`](https://www.ncbi.nlm.nih.gov/books/NBK131777/))
7. Run BLASTx against sequences with no hit ([`diamond/blastx`](https://github.com/bbuchfink/diamond))
8. Run BLASTn against sequences still with not hit ([`blast/blastn`](https://www.ncbi.nlm.nih.gov/books/NBK131777/))
9. Count BUSCO genes ([`blobtoolkit/countbuscos`](https://github.com/blobtoolkit/blobtoolkit))
10. Generate combined sequence stats across various window sizes ([`blobtoolkit/windowstats`](https://github.com/blobtoolkit/blobtoolkit))
11. Imports analysis results into a BlobDir dataset ([`blobtoolkit/blobdir`](https://github.com/blobtoolkit/blobtoolkit))
Expand All @@ -37,13 +37,17 @@ First, prepare a samplesheet with your input data that looks as follows:
`samplesheet.csv`:

```csv
sample,datatype,datafile
mMelMel3,hic,GCA_922984935.2.hic.mMelMel3.cram
mMelMel1,illumina,GCA_922984935.2.illumina.mMelMel1.cram
mMelMel3,ont,GCA_922984935.2.ont.mMelMel3.cram
sample,datatype,datafile,library_layout
mMelMel3,hic,GCA_922984935.2.hic.mMelMel3.cram,PAIRED
mMelMel1,illumina,GCA_922984935.2.illumina.mMelMel1.cram,PAIRED
mMelMel3,ont,GCA_922984935.2.ont.mMelMel3.cram,SINGLE
```

Each row represents an aligned file. Rows with the same sample identifier are considered technical replicates. The datatype refers to the sequencing technology used to generate the underlying raw data and follows a controlled vocabulary (`ont`, `hic`, `pacbio`, `pacbio_clr`, `illumina`). The aligned read files can be generated using the [sanger-tol/readmapping](https://github.com/sanger-tol/readmapping) pipeline.
Each row represents an aligned file.
Rows with the same sample identifier are considered technical replicates.
The datatype refers to the sequencing technology used to generate the underlying raw data and follows a controlled vocabulary (`ont`, `hic`, `pacbio`, `pacbio_clr`, `illumina`).
The library layout indicates whether the reads are paired or single.
The aligned read files can be generated using the [sanger-tol/readmapping](https://github.com/sanger-tol/readmapping) pipeline.

Now, you can run the pipeline using:

Expand Down Expand Up @@ -77,9 +81,8 @@ sanger-tol/blobtoolkit was written in Nextflow by [Alexander Ramos Diaz](https:/

We thank the following people for their assistance in the development of this pipeline:

<!-- If applicable, make list of people who have also contributed -->

- [Guoying Qi](https://github.com/gq1)
- [Bethan Yates](https://github.com/BethYates)

## Contributions and Support

Expand All @@ -89,8 +92,6 @@ If you would like to contribute to this pipeline, please see the [contributing g

If you use sanger-tol/blobtoolkit for your analysis, please cite it using the following doi: [10.5281/zenodo.7949058](https://doi.org/10.5281/zenodo.7949058)

<!-- Add bibliography of tools and data used in your pipeline -->

An extensive list of references for the tools used by the pipeline can be found in the [`CITATIONS.md`](CITATIONS.md) file.

This pipeline uses code and infrastructure developed and maintained by the [nf-core](https://nf-co.re) community, reused here under the [MIT license](https://github.com/nf-core/tools/blob/master/LICENSE).
Expand Down
5 changes: 5 additions & 0 deletions assets/schema_input.json
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,11 @@
"type": "string",
"pattern": "^\\S+\\.(bam|cram|fa|fa.gz|fasta|fasta.gz|fq|fq.gz|fastq|fastq.gz)$",
"errorMessage": "Data file for reads cannot contain spaces and must be BAM/CRAM/FASTQ/FASTA"
},
"library_layout": {
"type": "string",
"pattern": "^(SINGLE|PAIRED)$",
"errorMessage": "The only valid layouts are SINGLE and PAIRED"
}
},
"required": ["datafile", "datatype", "sample"]
Expand Down
10 changes: 5 additions & 5 deletions assets/test/samplesheet.csv
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
sample,datatype,datafile
mMelMel3,hic,/lustre/scratch123/tol/resources/nextflow/test-data/Meles_meles/analysis/mMelMel3.2_paternal_haplotype/read_mapping/hic/GCA_922984935.2.subset.unmasked.hic.mMelMel3.cram
mMelMel1,illumina,/lustre/scratch123/tol/resources/nextflow/test-data/Meles_meles/analysis/mMelMel3.2_paternal_haplotype/read_mapping/illumina/GCA_922984935.2.subset.unmasked.illumina.mMelMel1.cram
mMelMel2,illumina,/lustre/scratch123/tol/resources/nextflow/test-data/Meles_meles/analysis/mMelMel3.2_paternal_haplotype/read_mapping/illumina/GCA_922984935.2.subset.unmasked.illumina.mMelMel2.cram
mMelMel3,ont,/lustre/scratch123/tol/resources/nextflow/test-data/Meles_meles/analysis/mMelMel3.2_paternal_haplotype/read_mapping/ont/GCA_922984935.2.subset.unmasked.ont.mMelMel3.cram
sample,datatype,datafile,library_layout
mMelMel3,hic,/lustre/scratch123/tol/resources/nextflow/test-data/Meles_meles/analysis/mMelMel3.2_paternal_haplotype/read_mapping/hic/GCA_922984935.2.subset.unmasked.hic.mMelMel3.cram,PAIRED
mMelMel1,illumina,/lustre/scratch123/tol/resources/nextflow/test-data/Meles_meles/analysis/mMelMel3.2_paternal_haplotype/read_mapping/illumina/GCA_922984935.2.subset.unmasked.illumina.mMelMel1.cram,PAIRED
mMelMel2,illumina,/lustre/scratch123/tol/resources/nextflow/test-data/Meles_meles/analysis/mMelMel3.2_paternal_haplotype/read_mapping/illumina/GCA_922984935.2.subset.unmasked.illumina.mMelMel2.cram,PAIRED
mMelMel3,ont,/lustre/scratch123/tol/resources/nextflow/test-data/Meles_meles/analysis/mMelMel3.2_paternal_haplotype/read_mapping/ont/GCA_922984935.2.subset.unmasked.ont.mMelMel3.cram,SINGLE
8 changes: 4 additions & 4 deletions assets/test/samplesheet_raw.csv
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
sample,datatype,datafile
mMelMel1,illumina,/lustre/scratch123/tol/resources/nextflow/test-data/Meles_meles/genomic_data/mMelMel1/illumina/31231_3#1_subset.cram
mMelMel2,illumina,/lustre/scratch123/tol/resources/nextflow/test-data/Meles_meles/genomic_data/mMelMel2/illumina/31231_4#1_subset.cram
mMelMel3,hic,/lustre/scratch123/tol/resources/nextflow/test-data/Meles_meles/genomic_data/mMelMel3/hic-arima2/35528_2#1_subset.cram
sample,datatype,datafile,library_layout
mMelMel1,illumina,/lustre/scratch123/tol/resources/nextflow/test-data/Meles_meles/genomic_data/mMelMel1/illumina/31231_3#1_subset.cram,PAIRED
mMelMel2,illumina,/lustre/scratch123/tol/resources/nextflow/test-data/Meles_meles/genomic_data/mMelMel2/illumina/31231_4#1_subset.cram,PAIRED
mMelMel3,hic,/lustre/scratch123/tol/resources/nextflow/test-data/Meles_meles/genomic_data/mMelMel3/hic-arima2/35528_2#1_subset.cram,PAIRED
10 changes: 5 additions & 5 deletions assets/test/samplesheet_s3.csv
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
sample,datatype,datafile
mMelMel3,hic,https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/analysis/mMelMel3.2_paternal_haplotype/read_mapping/hic/GCA_922984935.2.subset.unmasked.hic.mMelMel3.cram
mMelMel1,illumina,https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/analysis/mMelMel3.2_paternal_haplotype/read_mapping/illumina/GCA_922984935.2.subset.unmasked.illumina.mMelMel1.cram
mMelMel2,illumina,https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/analysis/mMelMel3.2_paternal_haplotype/read_mapping/illumina/GCA_922984935.2.subset.unmasked.illumina.mMelMel2.cram
mMelMel3,ont,https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/analysis/mMelMel3.2_paternal_haplotype/read_mapping/ont/GCA_922984935.2.subset.unmasked.ont.mMelMel3.cram
sample,datatype,datafile,library_layout
mMelMel3,hic,https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/analysis/mMelMel3.2_paternal_haplotype/read_mapping/hic/GCA_922984935.2.subset.unmasked.hic.mMelMel3.cram,PAIRED
mMelMel1,illumina,https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/analysis/mMelMel3.2_paternal_haplotype/read_mapping/illumina/GCA_922984935.2.subset.unmasked.illumina.mMelMel1.cram,PAIRED
mMelMel2,illumina,https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/analysis/mMelMel3.2_paternal_haplotype/read_mapping/illumina/GCA_922984935.2.subset.unmasked.illumina.mMelMel2.cram,PAIRED
mMelMel3,ont,https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/analysis/mMelMel3.2_paternal_haplotype/read_mapping/ont/GCA_922984935.2.subset.unmasked.ont.mMelMel3.cram,SINGLE
6 changes: 3 additions & 3 deletions assets/test_full/full_samplesheet.csv
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
sample,datatype,datafile
gfLaeSulp1,hic,/lustre/scratch123/tol/resources/nextflow/test-data/Laetiporus_sulphureus/analysis/gfLaeSulp1.1/read_mapping/hic/GCA_927399515.1.unmasked.hic.gfLaeSulp1.cram
gfLaeSulp1,pacbio,/lustre/scratch123/tol/resources/nextflow/test-data/Laetiporus_sulphureus/analysis/gfLaeSulp1.1/read_mapping/pacbio/GCA_927399515.1.unmasked.pacbio.gfLaeSulp1.cram
sample,datatype,datafile,library_layout
gfLaeSulp1,hic,/lustre/scratch123/tol/resources/nextflow/test-data/Laetiporus_sulphureus/analysis/gfLaeSulp1.1/read_mapping/hic/GCA_927399515.1.unmasked.hic.gfLaeSulp1.cram,PAIRED
gfLaeSulp1,pacbio,/lustre/scratch123/tol/resources/nextflow/test-data/Laetiporus_sulphureus/analysis/gfLaeSulp1.1/read_mapping/pacbio/GCA_927399515.1.unmasked.pacbio.gfLaeSulp1.cram,SINGLE
24 changes: 21 additions & 3 deletions bin/check_samplesheet.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,11 +45,17 @@ class RowChecker:
"ont",
)

VALID_LAYOUTS = (
"SINGLE",
"PAIRED",
)

def __init__(
self,
sample_col="sample",
type_col="datatype",
file_col="datafile",
layout_col="library_layout",
**kwargs,
):
"""
Expand All @@ -62,11 +68,14 @@ def __init__(
the read data (default "datatype").
file_col (str): The name of the column that contains the file path for
the read data (default "datafile").
layout_col(str): The name of the column that contains the layout of the
library (i.e. "PAIRED" or "SINGLE").
"""
super().__init__(**kwargs)
self._sample_col = sample_col
self._type_col = type_col
self._file_col = file_col
self._layout_col = layout_col
self._seen = set()
self.modified = []

Expand All @@ -82,6 +91,7 @@ def validate_and_transform(self, row):
self._validate_sample(row)
self._validate_type(row)
self._validate_file(row)
self._validate_layout(row)
self._seen.add((row[self._sample_col], row[self._file_col]))
self.modified.append(row)

Expand All @@ -94,7 +104,7 @@ def _validate_sample(self, row):

def _validate_type(self, row):
"""Assert that the data type matches expected values."""
if not any(row[self._type_col] for datatype in self.VALID_DATATYPES):
if row[self._type_col] not in self.VALID_DATATYPES:
raise AssertionError(
f"The datatype is unrecognized: {row[self._type_col]}\n"
f"It should be one of: {', '.join(self.VALID_DATATYPES)}"
Expand All @@ -114,6 +124,14 @@ def _validate_data_format(self, filename):
f"It should be one of: {', '.join(self.VALID_FORMATS)}"
)

def _validate_layout(self, row):
"""Assert that the library layout matches expected values."""
if not row[self._layout_col] in self.VALID_LAYOUTS:
raise AssertionError(
f"The library layout is unrecognized: {row[self._layout_col]}\n"
f"It should be one of: {', '.join(self.VALID_LAYOUTS)}"
)

def validate_unique_samples(self):
"""
Assert that the combination of sample name and aligned filename is unique.
Expand Down Expand Up @@ -178,7 +196,7 @@ def check_samplesheet(file_in, file_out):
This function checks that the samplesheet follows the following structure,
see also the `blobtoolkit samplesheet`_::
sample,datatype,datafile
sample,datatype,datafile,library_layout
sample1,hic,/path/to/file1.cram
sample1,pacbio,/path/to/file2.cram
sample1,ont,/path/to/file3.cram
Expand All @@ -187,7 +205,7 @@ def check_samplesheet(file_in, file_out):
https://raw.githubusercontent.com/sanger-tol/blobtoolkit/main/assets/test/samplesheet.csv
"""
required_columns = {"sample", "datatype", "datafile"}
required_columns = {"sample", "datatype", "datafile", "library_layout"}
# See https://docs.python.org/3.9/library/csv.html#id3 to read up on `newline=""`.
with file_in.open(newline="") as in_handle:
reader = csv.DictReader(in_handle, dialect=sniff_format(in_handle))
Expand Down
Loading

0 comments on commit 102dbf4

Please sign in to comment.