Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v0.3 - Pipeline now validated on five real genomes #88

Merged
merged 42 commits into from
Feb 9, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
275c048
Fixed the conditional flowing
muffato Jan 8, 2024
af953d8
We shouldn't be using this old taxonomy database
muffato Jan 9, 2024
26ad7a6
We also need the taxdb files for the NT database
muffato Jan 9, 2024
79d9bb5
File numbers now have three digits
muffato Jan 9, 2024
903ebc0
bugfix: removed extra brackets that were only needed in Snakemake rules
muffato Jan 10, 2024
04ea5e3
bugfix: the headers are separated with regular whitespace, not tabs
muffato Jan 10, 2024
023a041
Updated SAMTOOLS_FASTA
muffato Jan 10, 2024
1d7fe41
We want *all* the reads in the output file
muffato Jan 10, 2024
7729e20
Correctly pick up the name of the blast database
muffato Jan 10, 2024
e0f2486
Started the CHANGELOG
muffato Jan 10, 2024
7046a75
Allow FastQ files as input
muffato Jan 10, 2024
e54bc85
Skip the conversion to fasta if the input is fastq
muffato Jan 10, 2024
60c2f1d
Like in the genome-note pipeline having a duplicated profile entry ca…
muffato Jan 10, 2024
363d796
Support samplesheets created by nf-core/fetchngs
muffato Jan 11, 2024
74b60dd
Support input read-sets being paired-end
muffato Jan 11, 2024
a18fe52
[lint] black
muffato Jan 12, 2024
7527e7a
Updated CAT_CAT module that preserves the file extension
muffato Jan 12, 2024
b99101b
Report all BUSCOs and in the right order
muffato Jan 15, 2024
9dc30da
Removed "--update-plot" as it is not used by default in the Snakemake…
muffato Jan 16, 2024
48ebdca
Bumped up the number of retries to better accommodate larger genomes
muffato Jan 18, 2024
9e19765
Bumped up the version of blobtoolkit because of a bug in windowsstat
muffato Jan 18, 2024
a815df0
Updated these two modules from upstream
muffato Jan 18, 2024
ac42a3a
Updated the CHANGELOG
muffato Jan 19, 2024
864d3bc
Documented the decision tree
muffato Jan 19, 2024
4c27f52
Renamed the variables to make the code easier to understand
muffato Jan 19, 2024
8dc4f0c
Not used
muffato Jan 19, 2024
6789b6f
Fixed the link
muffato Jan 19, 2024
5e7d53a
Let's make it happen today
muffato Jan 19, 2024
9a0c6a0
Updated the documentation
muffato Jan 19, 2024
1fbde37
Explained the purpose of this release
muffato Jan 19, 2024
b8d3ae0
[lint] prettier
muffato Jan 19, 2024
50d2a71
Revert "Let's make it happen today"
muffato Jan 19, 2024
5388d77
Pin the version of prettier to match the nf-core dependency we have
muffato Jan 22, 2024
1dbbe9e
Need to make an exception for this file
muffato Jan 22, 2024
1303a69
bugfix: if only 1 BUSCO is provided (only lineage with a match), the …
muffato Jan 25, 2024
8d2ebed
Allow chosing the output format (png vs svg)
muffato Feb 7, 2024
2a3eca7
typo
muffato Feb 7, 2024
c4c5324
Changed the output directory of the images
muffato Feb 7, 2024
1f4284b
Because only one format can be chosen, they have to be marked as opti…
muffato Feb 9, 2024
fcfdb10
Moved the blobdir a level below, as per our convention
muffato Feb 7, 2024
a105dd3
Lower case, please
muffato Feb 9, 2024
ff87d76
Pin the version of nf-core
muffato Feb 9, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/workflows/linting.yml
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ jobs:
- uses: actions/setup-node@v4

- name: Install Prettier
run: npm install -g prettier
run: npm install -g prettier@3.1.0

- name: Run Prettier --check
run: prettier --check ${GITHUB_WORKSPACE}
Expand Down Expand Up @@ -84,7 +84,7 @@ jobs:
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install nf-core
pip install nf-core==2.11
- name: Run nf-core lint
env:
Expand Down
1 change: 1 addition & 0 deletions .nf-core.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ lint:
- docs/images/nf-core-blobtoolkit_logo_dark.png
- .github/ISSUE_TEMPLATE/bug_report.yml
- .github/PULL_REQUEST_TEMPLATE.md
- .github/workflows/linting.yml
multiqc_config:
- report_comment
nextflow_config:
Expand Down
30 changes: 30 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,36 @@
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [[0.3.0](https://github.com/sanger-tol/blobtoolkit/releases/tag/0.3.0)] – Poliwag – [2024-01-XX]

The pipeline has now been validated on five genomes, all under 100 Mbp: a
sponge, a platyhelminth, and three fungi.

### Enhancements & fixes

- Fixed the conditional runs of blastn
- Fixed the generation of the no-hit list
- Fixed the conversion of the unaligned input files to Fasta
- Fixed the documentation about preparing the NT database
- Fixed the detection of the NT database in the nf-core module
- The pipeline now supports samplesheets generated by the
[nf-core/fetchngs](https://nf-co.re/fetchngs) pipeline by passing the
`--fetchngs_samplesheet true` option.
- FastQ files can bypass the conversion to Fasta
- Fixed missing BUSCO results from the blobdir (only 1 BUSCO was loaded)
- Fixed the default category used to colour the blob plots
- Fixed the output directory of the images
- Added an option to select the format of the images (PNG or SVG)

### Parameters

| Old parameter | New parameter |
| ------------- | ---------------------- |
| | --fetchngs_samplesheet |
| | --image_format |

> **NB:** Parameter has been **updated** if both old and new parameter information is present. </br> **NB:** Parameter has been **added** if just the new parameter information is present. </br> **NB:** Parameter has been **removed** if new parameter information isn't present.

## [[0.2.0](https://github.com/sanger-tol/blobtoolkit/releases/tag/0.2.0)] – Pikachu – [2023-12-22]

### Enhancements & fixes
Expand Down
22 changes: 2 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,19 +13,6 @@

**sanger-tol/blobtoolkit** is a bioinformatics pipeline that can be used to identify and analyse non-target DNA for eukaryotic genomes. It takes a samplesheet and aligned CRAM files as input, calculates genome statistics, coverage and completeness information, combines them in a TSV file by window size to create a BlobDir dataset and static plots.

<!--
Complete this sentence with a 2-3 sentence summary of what types of data the pipeline ingests, a brief overview of the
major pipeline sections and the types of output it produces. You're giving an overview to someone new
to nf-core here, in 15-20 seconds. For an example, see https://github.com/nf-core/rnaseq/blob/master/README.md#introduction
-->

<!-- Include a figure that guides the user through the major workflow steps. Many nf-core
workflows use the "tube map" design for that. See https://nf-co.re/docs/contributing/design_guidelines#examples for examples. -->

<!-- # ![sanger-tol/blobtoolkit](https://raw.githubusercontent.com/sanger-tol/blobtoolkit/main/docs/images/sanger-tol-blobtoolkit_workflow.png) -->

<!-- Fill in short bullet-pointed list of the default steps in the pipeline -->

1. Calculate genome statistics in windows ([`fastawindows`](https://github.com/tolkit/fasta_windows))
2. Calculate Coverage ([`blobtk/depth`](https://github.com/blobtoolkit/blobtk))
3. Fetch associated BUSCO lineages ([`goat/taxonsearch`](https://github.com/genomehubs/goat-cli))
Expand All @@ -44,9 +31,6 @@
> [!NOTE]
> If you are new to Nextflow and nf-core, please refer to [this page](https://nf-co.re/docs/usage/installation) on how to set-up Nextflow. Make sure to [test your setup](https://nf-co.re/docs/usage/introduction#how-to-run-a-pipeline) with `-profile test` before running the workflow on actual data.

<!-- Describe the minimum required steps to execute the pipeline, e.g. how to prepare samplesheets.
Explain what rows and columns represent. For instance (please edit as appropriate): -->

First, prepare a samplesheet with your input data that looks as follows:

`samplesheet.csv`:
Expand All @@ -58,12 +42,10 @@ mMelMel1,illumina,GCA_922984935.2.illumina.mMelMel1.cram
mMelMel3,ont,GCA_922984935.2.ont.mMelMel3.cram
```

Each row represents an aligned file. Rows with the same sample identifier are considered technical replicates. The datatype refers to the sequencing technology used to generate the underlying raw data and follows a controlled vocabulary (ont, hic, pacbio, pacbio_clr, illumina). The aligned read files can be generated using the [sanger-tol/readmapping](https://github.com/sanger-tol/readmapping) pipeline.
Each row represents an aligned file. Rows with the same sample identifier are considered technical replicates. The datatype refers to the sequencing technology used to generate the underlying raw data and follows a controlled vocabulary (`ont`, `hic`, `pacbio`, `pacbio_clr`, `illumina`). The aligned read files can be generated using the [sanger-tol/readmapping](https://github.com/sanger-tol/readmapping) pipeline.

Now, you can run the pipeline using:

<!-- update the following command to include all required parameters for a minimal example -->

```bash
nextflow run sanger-tol/blobtoolkit \
-profile <docker/singularity/.../institute> \
Expand All @@ -86,7 +68,7 @@ For more details, please refer to the [usage documentation](https://pipelines.to

## Pipeline output

<!-- To see the the results of a test run with a full size dataset refer to the [results](https://pipelines.tol.sanger.ac.uk/blobtoolkit/results) tab on the sanger-tol website pipeline page. --> For more details about the output files and reports, please refer to the [output documentation](https://pipelines.tol.sanger.ac.uk/blobtoolkit/output).
For more details about the output files and reports, please refer to the [output documentation](https://pipelines.tol.sanger.ac.uk/blobtoolkit/output).

## Credits

Expand Down
252 changes: 252 additions & 0 deletions bin/check_fetchngs_samplesheet.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,252 @@
#!/usr/bin/env python


"""Provide a command line tool to validate and transform tabular samplesheets."""


import argparse
import csv
import logging
import sys
from collections import Counter
from pathlib import Path

logger = logging.getLogger()


class RowChecker:
"""
Define a service that can validate and transform each given row.

Attributes:
modified (list): A list of dicts, where each dict corresponds to a previously
validated and transformed row. The order of rows is maintained.

"""

VALID_FORMATS = (".fastq.gz",)

def __init__(
self,
accession_col="run_accession",
model_col="instrument_model",
platform_col="instrument_platform",
library_col="library_strategy",
file1_col="fastq_1",
file2_col="fastq_2",
**kwargs,
):
"""
Initialize the row checker with the expected column names.

Args:
accession_col (str): The name of the column that contains the accession name
(default "run_accession").
model_col (str): The name of the column that contains the model name
of the instrument (default "instrument_model").
platform_col (str): The name of the column that contains the platform name
of the instrument (default "instrument_platform").
library_col (str): The name of the column that contains the strategy of the
preparation of the library (default "library_strategy").
file2_col (str): The name of the column that contains the second file path
for the paired-end read data (default "fastq_2").
"""
super().__init__(**kwargs)
self._accession_col = accession_col
self._model_col = model_col
self._platform_col = platform_col
self._library_col = library_col
self._file1_col = file1_col
self._file2_col = file2_col
self._seen = set()
self.modified = []

def validate_and_transform(self, row):
"""
Perform all validations on the given row.

Args:
row (dict): A mapping from column headers (keys) to elements of that row
(values).

"""
self._validate_accession(row)
self._validate_file(row)
self._seen.add((row[self._accession_col], row[self._file1_col]))
self.modified.append(row)

def _validate_accession(self, row):
"""Assert that the run accession name exists."""
if len(row[self._accession_col]) <= 0:
raise AssertionError("Run accession is required.")

def _validate_file(self, row):
"""Assert that the datafile is non-empty and has the right format."""
if len(row[self._file1_col]) <= 0:
raise AssertionError("Data file is required.")
self._validate_data_format(row[self._file1_col])
if row[self._file2_col]:
self._validate_data_format(row[self._file2_col])

def _validate_data_format(self, filename):
"""Assert that a given filename has one of the expected FASTQ extensions."""
if not any(filename.endswith(extension) for extension in self.VALID_FORMATS):
raise AssertionError(
f"The data file has an unrecognized extension: {filename}\n"
f"It should be one of: {', '.join(self.VALID_FORMATS)}"
)

def validate_unique_accessions(self):
"""
Assert that the combination of accession name and aligned filename is unique.

In addition to the validation, also rename all accessions to have a suffix of _T{n}, where n is the
number of times the same accession exist, but with different FASTQ files, e.g., multiple runs per experiment.

"""
if len(self._seen) != len(self.modified):
raise AssertionError("The pair of accession and file name must be unique.")
seen = Counter()
for row in self.modified:
accession = row[self._accession_col]
seen[accession] += 1
row[self._accession_col] = f"{accession}_T{seen[accession]}"


def read_head(handle, num_lines=10):
"""Read the specified number of lines from the current position in the file."""
lines = []
for idx, line in enumerate(handle):
if idx == num_lines:
break
lines.append(line)
return "".join(lines)


def sniff_format(handle):
"""
Detect the tabular format.

Args:
handle (text file): A handle to a `text file`_ object. The read position is
expected to be at the beginning (index 0).

Returns:
csv.Dialect: The detected tabular format.

.. _text file:
https://docs.python.org/3/glossary.html#term-text-file

"""
peek = read_head(handle)
handle.seek(0)
sniffer = csv.Sniffer()
dialect = sniffer.sniff(peek)
return dialect


def check_samplesheet(file_in, file_out):
"""
Check that the tabular samplesheet has the structure expected by sanger-tol pipelines.

Validate the general shape of the table, expected columns, and each row. Also add
Args:
file_in (pathlib.Path): The given tabular samplesheet. The format can be either
CSV, TSV, or any other format automatically recognized by ``csv.Sniffer``.
file_out (pathlib.Path): Where the validated and transformed samplesheet should
be created; always in CSV format.

Example:
This function checks that the samplesheet follows the following structure,
see also the `blobtoolkit samplesheet`_::

sample,datatype,datafile
sample1,hic,/path/to/file1.cram
sample1,pacbio,/path/to/file2.cram
sample1,ont,/path/to/file3.cram

.. _blobtoolkit samplesheet:
https://raw.githubusercontent.com/sanger-tol/blobtoolkit/main/assets/test/samplesheet.csv

"""
required_columns = {
"run_accession",
"instrument_model",
"instrument_platform",
"library_strategy",
"fastq_1",
"fastq_2",
}
# See https://docs.python.org/3.9/library/csv.html#id3 to read up on `newline=""`.
with file_in.open(newline="") as in_handle:
reader = csv.DictReader(in_handle, dialect=sniff_format(in_handle))
# Validate the existence of the expected header columns.
if not required_columns.issubset(reader.fieldnames):
req_cols = ", ".join(required_columns)
logger.critical(f"The sample sheet **must** contain these column headers: {req_cols}.")
sys.exit(1)
# Validate each row.
checker = RowChecker()
for i, row in enumerate(reader):
try:
checker.validate_and_transform(row)
except AssertionError as error:
logger.critical(f"{str(error)} On line {i + 2}.")
sys.exit(1)
checker.validate_unique_accessions()
header = list(reader.fieldnames)
# See https://docs.python.org/3.9/library/csv.html#id3 to read up on `newline=""`.
with file_out.open(mode="w", newline="") as out_handle:
writer = csv.DictWriter(out_handle, header, delimiter=",")
writer.writeheader()
for row in checker.modified:
writer.writerow(row)


def parse_args(argv=None):
"""Define and immediately parse command line arguments."""
parser = argparse.ArgumentParser(
description="Validate and transform a tabular samplesheet.",
epilog="Example: python check_samplesheet.py samplesheet.csv samplesheet.valid.csv",
)
parser.add_argument(
"file_in",
metavar="FILE_IN",
type=Path,
help="Tabular input samplesheet in CSV or TSV format.",
)
parser.add_argument(
"file_out",
metavar="FILE_OUT",
type=Path,
help="Transformed output samplesheet in CSV format.",
)
parser.add_argument(
"-l",
"--log-level",
help="The desired log level (default WARNING).",
choices=("CRITICAL", "ERROR", "WARNING", "INFO", "DEBUG"),
default="WARNING",
)
parser.add_argument(
"-v",
"--version",
action="version",
version="%(prog)s 1.0.0",
)
return parser.parse_args(argv)


def main(argv=None):
"""Coordinate argument parsing and program execution."""
args = parse_args(argv)
logging.basicConfig(level=args.log_level, format="[%(levelname)s] %(message)s")
if not args.file_in.is_file():
logger.error(f"The given input file {args.file_in} was not found!")
sys.exit(2)
args.file_out.parent.mkdir(parents=True, exist_ok=True)
check_samplesheet(args.file_in, args.file_out)


if __name__ == "__main__":
sys.exit(main())
2 changes: 2 additions & 0 deletions bin/check_samplesheet.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,8 @@ class RowChecker:
VALID_FORMATS = (
".cram",
".bam",
".fastq",
".fastq.gz",
)

VALID_DATATYPES = (
Expand Down
4 changes: 2 additions & 2 deletions bin/nohitlist.sh
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,8 @@ E=$4

# find ids of sequences with no hits in the blastx search
grep '>' $fasta | \
grep -v -w -f <(awk -v evalue="$E" '{{if($14<{evalue}){{print $1}}}}' $blast | sort | uniq) | \
cut -f1 | sed 's/>//' > $prefix.nohit.txt
grep -v -w -f <(awk -v evalue="$E" '{if($14<evalue){print $1}}' $blast | sort | uniq) | \
awk '{print $1}' | sed 's/>//' > $prefix.nohit.txt



Loading
Loading