Skip to content

Commit

Permalink
Merge pull request #145 from sanger-tol/public_dev
Browse files Browse the repository at this point in the history
Release 2.0
  • Loading branch information
BethYates authored Oct 10, 2024
2 parents 2208ff8 + 0ef1ea9 commit 73375e6
Show file tree
Hide file tree
Showing 49 changed files with 3,402 additions and 56 deletions.
4 changes: 2 additions & 2 deletions .github/workflows/linting.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ jobs:
- uses: actions/setup-node@v3

- name: Install editorconfig-checker
run: npm install -g editorconfig-checker
run: npm install -g editorconfig-checker@3.0.2

- name: Run ECLint check
run: editorconfig-checker -exclude README.md $(find .* -type f | grep -v '.git\|.py\|.md\|cff\|json\|yml\|yaml\|html\|css\|work\|.nextflow\|build\|nf_core.egg-info\|log.txt\|Makefile')
Expand All @@ -32,7 +32,7 @@ jobs:
- uses: actions/setup-node@v3

- name: Install Prettier
run: npm install -g prettier
run: npm install -g prettier@3.1.0

- name: Run Prettier --check
run: prettier --check ${GITHUB_WORKSPACE}
Expand Down
1 change: 1 addition & 0 deletions .nf-core.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,3 +20,4 @@ lint:
multiqc_config:
- report_comment
actions_ci: false
template_strings: False
30 changes: 30 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,36 @@
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [[2.0.0](https://github.com/sanger-tol/genomenote/releases/tag/2.0.0)] - English Cocker Spaniel [2024-10-10]

### Enhancements & fixes

- New genome_metadata subworkflow to fetch metadata linked to the genome assembly from various sources (COPO, GoaT, GBIF, ENA, NCBI). The options `--assembly`, `--biosample_wgs`, `--biosample_hic` and `--biosample_rna` specify what metadata to fetch and process.
- Now outputs a partially completed genome note document based on a template file which contains placeholder parameters. These placeholders are replaced with data generated by the pipeline. The template file to use can be specified using the `--note_template` option.
- Added the `--write_to_portal` option to write a set of key-value data parameters to a Genome Notes database.
- Added the `--upload_higlass_data` option to automatically upload the Hi-C Map to a kubernetes hosted Hi-Glass server.
- Bugfix: don't rely on fasta file name to correctly set assembly accession needed for use with `ncbi datasets`.
- Bugfix: ensure meta.id is used consistently.

### Parameters

| Old parameter | New parameter |
| ------------- | -------------------------- |
| | --assembly |
| | --biosample_wgs |
| | --biosample_hic |
| | --biosample_rna |
| | --write_to_portal |
| | --genome_notes_api |
| | --note_template |
| | --upload_higlass_data |
| | --higlass_url |
| | --higlass_deployment_name |
| | --higlass_namespace |
| | --higlass_kubeconfig |
| | --higlass_upload_directory |
| | --higlass_data_project_dir |

## [[1.2.2](https://github.com/sanger-tol/genomenote/releases/tag/1.2.2)] - Pyrenean Mountain Dog (patch 2) - [2024-09-10]

### Enhancements & fixes
Expand Down
12 changes: 10 additions & 2 deletions CITATION.cff
Original file line number Diff line number Diff line change
Expand Up @@ -2,16 +2,24 @@
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: sanger-tol/genomenote v1.2.2
title: sanger-tol/genomenote v2.0.0
message: >-
If you use this software, please cite it using the
metadata from this file.
type: software
authors:
- given-names: Sandra
family-names: Babiyre
affiliation: Wellcome Sanger Institute
orcid: "https://orcid.org/0009-0004-7773-7008"
- given-names: Tyler
family-names: Chafin
affiliation: Wellcome Sanger Institute
orcid: "https://orcid.org/0000-0001-8687-5905"
- given-names: Chau
family-names: Duong
affiliation: Wellcome Sanger Institute
orcid: "https://orcid.org/0009-0001-0649-2291"
- given-names: Matthieu
family-names: Muffato
affiliation: Wellcome Sanger Institute
Expand All @@ -38,5 +46,5 @@ identifiers:
repository-code: "https://github.com/sanger-tol/genomenote"
license: MIT
commit: TODO
version: 1.2.2
version: 2.0.0
date-released: "2022-10-07"
23 changes: 14 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,14 +17,15 @@

<!--![sanger-tol/genomenote workflow](https://raw.githubusercontent.com/sanger-tol/genomenote/main/docs/images/sanger-tol-genomenote_workflow.png)-->

1. Summary statistics ([`NCBI datasets summary genome accession`](https://www.ncbi.nlm.nih.gov/datasets/docs/v2/reference-docs/command-line/datasets/summary/genome/datasets_summary_genome_accession/))
2. Convert alignment to BED ([`samtools view`](https://www.htslib.org/doc/samtools-view.html), [`bedtools bamtobed`](https://bedtools.readthedocs.io/en/latest/content/tools/bamtobed.html))
3. Filter BED ([`GNU sort`](https://www.gnu.org/software/coreutils/manual/html_node/sort-invocation.html), [`filter bed`](https://raw.githubusercontent.com/sanger-tol/genomenote/main/bin/filter_bed.sh))
4. Contact maps ([`Cooler cload`](https://cooler.readthedocs.io/en/latest/cli.html#cooler-cload-pairs), [`Cooler zoomify`](https://cooler.readthedocs.io/en/latest/cli.html#cooler-zoomify), [`Cooler dump`](https://cooler.readthedocs.io/en/latest/cli.html#cooler-dump))
5. Genome completeness ([`NCBI API`](https://www.ncbi.nlm.nih.gov/datasets/docs/v1/reference-docs/rest-api/), [`BUSCO`](https://busco.ezlab.org))
6. Consensus quality and k-mer completeness ([`FASTK`](https://github.com/thegenemyers/FASTK), [`MERQURY.FK`](https://github.com/thegenemyers/MERQURY.FK))
7. Collated summary table ([`createtable`](bin/create_table.py))
8. Present results and visualisations ([`MultiQC`](http://multiqc.info/), [`R`](https://www.r-project.org/))
1. Fetches genome metadata from [ENA](https://www.ebi.ac.uk/ena/browser/api/#/ENA_Browser_Data_API), [NCBI](https://www.ncbi.nlm.nih.gov/datasets/docs/v2/reference-docs/rest-api), and [GoaT](https://goat.genomehubs.org/api-docs/)
2. Summary statistics ([`NCBI datasets summary genome accession`](https://www.ncbi.nlm.nih.gov/datasets/docs/v2/reference-docs/command-line/datasets/summary/genome/datasets_summary_genome_accession/))
3. Convert alignment to BED ([`samtools view`](https://www.htslib.org/doc/samtools-view.html), [`bedtools bamtobed`](https://bedtools.readthedocs.io/en/latest/content/tools/bamtobed.html))
4. Filter BED ([`GNU sort`](https://www.gnu.org/software/coreutils/manual/html_node/sort-invocation.html), [`filter bed`](https://raw.githubusercontent.com/sanger-tol/genomenote/main/bin/filter_bed.sh))
5. Contact maps ([`Cooler cload`](https://cooler.readthedocs.io/en/latest/cli.html#cooler-cload-pairs), [`Cooler zoomify`](https://cooler.readthedocs.io/en/latest/cli.html#cooler-zoomify), [`Cooler dump`](https://cooler.readthedocs.io/en/latest/cli.html#cooler-dump))
6. Genome completeness ([`NCBI API`](https://www.ncbi.nlm.nih.gov/datasets/docs/v1/reference-docs/rest-api/), [`BUSCO`](https://busco.ezlab.org))
7. Consensus quality and k-mer completeness ([`FASTK`](https://github.com/thegenemyers/FASTK), [`MERQURY.FK`](https://github.com/thegenemyers/MERQURY.FK))
8. Collated summary table ([`createtable`](bin/create_table.py))
9. Present results and visualisations ([`MultiQC`](http://multiqc.info/), [`R`](https://www.r-project.org/))

## Usage

Expand Down Expand Up @@ -52,6 +53,9 @@ nextflow run sanger-tol/genomenote \
-profile <docker/singularity/.../institute> \
--input samplesheet.csv \
--fasta genome.fasta \
--assembly GCA_922984935.2 \
--bioproject PRJEB49353 \
--biosample SAMEA7524400 \
--outdir <OUTDIR>
```

Expand All @@ -69,8 +73,9 @@ sanger-tol/genomenote was originally written by [Priyanka Surana](https://github
We thank the following people for their assistance in the development of this pipeline:

- [Matthieu Muffato](https://github.com/muffato)
- [Beth Yates](https://github.com/BethYates)
- [Shane McCarthy](https://github.com/mcshane) and [Yumi Sims](https://github.com/yumisims) for providing software and algorithm guidance.
- [Cibin Sadasivan Baby](https://github.com/cibinsb) and [Beth Yates](https://github.com/BethYates) for providing reviews.
- [Cibin Sadasivan Baby](https://github.com/cibinsb) for providing reviews.

## Contributions and Support

Expand Down
9 changes: 9 additions & 0 deletions assets/genome_metadata_template.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
#File_source,File_type,Url,Output_type
ENA,Assembly,https://www.ebi.ac.uk/ena/browser/api/xml/ASSEMBLY_ACCESSION,xml
ENA,Bioproject,https://www.ebi.ac.uk/ena/browser/api/xml/BIOPROJECT_ACCESSION,xml
ENA,Biosample,https://www.ebi.ac.uk/ena/browser/api/xml/BIOSAMPLE_ACCESSION,xml
ENA,Taxonomy,https://www.ebi.ac.uk/ena/browser/api/xml/TAXONOMY_ID,xml
NCBI,Assembly,https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/accession/ASSEMBLY_ACCESSION/dataset_report?filters.exclude_atypical=false&filters.assembly_version=current&chromosomes=1&chromosomes=2&chromosomes=3&chromosomes=X&chromosomes=Y&chromosomes=M,json
NCBI,Taxonomy,https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=taxonomy&id=TAXONOMY_ID,xml
GOAT,Assembly,http://goat.genomehubs.org/api/v2/record?recordId=ASSEMBLY_ACCESSION&result=assembly&taxonomy=ncbi,json
COPO,Biosample,https://copo-project.org/api/sample/biosampleAccession/BIOSAMPLE_ACCESSION?standard=tol&return_type=json,json
Binary file added assets/genome_note_template.docx
Binary file not shown.
34 changes: 34 additions & 0 deletions assets/genome_note_template.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article>
<article>
<body>
<sec>
<title>Species taxonomy</title>
<p>{{ TAX_STRING }};
<italic>{{ GENUS }}</italic>;
<italic>{{ GENUS_SPECIES }}</italic> ($TAXONOMY_AUTHORITY) (NCBI:txid{{ NCBI_TAXID }}) {{ TEST_NOT_REPLACED }}.
</p>
</sec>
<sec>
<table>
<thead>
<tr>
<th align="center" valign="top">INSDC accession</th>
<th align="center" valign="top">Chromosome</th>
<th align="center" valign="top">Length (Mb)</th>
<th align="center" valign="top">GC%</th>
</tr>
</thead>
<tbody>
{% for chromosome in CHR_TABLE %}
<tr>
<td align="left" valign="top">{{ chromosome.get('Accession') }}</td>
<td align="center" valign="top">{{ chromosome.get('Chromosome') }}</td>
<td align="center" valign="top">{{ chromosome.get('Length') }}</td>
<td align="center" valign="top">{{ chromosome.get('GC') }}</td>
</tr>
{% endfor %}
</tbody>
</table>
</sec>
</body>
</article>
7 changes: 3 additions & 4 deletions assets/samplesheet.csv
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
sample,datatype,datafile
uoEpiScrs1,pacbio,https://tolit.cog.sanger.ac.uk/test-data/Epithemia_sp._CRS-2021b/genomic_data/uoEpiScrs1/pacbio/m64228e_220617_134154.ccs.bc1015_BAK8B_OA--bc1015_BAK8B_OA.rmdup.subset.bam
uoEpiScrs1,pacbio,https://tolit.cog.sanger.ac.uk/test-data/Epithemia_sp._CRS-2021b/genomic_data/uoEpiScrs1/pacbio/m64016e_220621_193126.ccs.bc1008_BAK8A_OA--bc1008_BAK8A_OA.rmdup.subset.bam
uoEpiScrs1c,hic,https://tolit.cog.sanger.ac.uk/test-data/Epithemia_sp._CRS-2021b/analysis/uoEpiScrs1.1/read_mapping/hic/GCA_946965045.1.unmasked.hic.uoEpiScrs1.subsampled.cram
uoEpiScrs1b,hic,https://tolit.cog.sanger.ac.uk/test-data/Epithemia_sp._CRS-2021b/analysis/uoEpiScrs1.1/read_mapping/hic/GCA_946965045.1.unmasked.hic.uoEpiScrs1.subsampled.bam
ilCerPisi1,pacbio,https://tolit.cog.sanger.ac.uk/test-data/Ceramica_pisi/genomic_data/ilCerPisi1/pacbio/m84047_230817_174414_s3.ccs.bc2048.subsampled.bam
ilCerPisi1,pacbio,https://tolit.cog.sanger.ac.uk/test-data/Ceramica_pisi/genomic_data/ilCerPisi1/pacbio/m64097e_230309_154741.ccs.bc1012_BAK8A_OA--bc1012_BAK8A_OA.subsampled.bam
ilCerPisi1,hic,https://tolit.cog.sanger.ac.uk/test-data/Ceramica_pisi/analysis/ilCerPisi1.1/read_mapping/hic/GCA_963859965.1.unmasked.hic.ilCerPisi2.subsampled.cram
145 changes: 145 additions & 0 deletions bin/check_parameters.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
#!/usr/bin/env python3

import os
import sys
import requests
import argparse


def parse_args(args=None):
Description = "Use the genome assembly accession to fetch additional infromation on genome from ENA"
Epilog = "Example usage: python check_parameters.py --assembly --wgs_biosample --output"

parser = argparse.ArgumentParser(description=Description, epilog=Epilog)
parser.add_argument("--assembly", required=True, help="The INSDC accession for the assembly")
parser.add_argument("--wgs_biosample", required=True, help="The biosample accession for the WGS data")
parser.add_argument("--hic_biosample", required=False, help="The biosample accession for the Hi-C data")
parser.add_argument("--rna_biosample", required=False, help="The biosample accession for the RNASeq data")
parser.add_argument("--output", required=True, help="Output file path")
return parser.parse_args()


def make_dir(path):
if len(path) > 0:
os.makedirs(path, exist_ok=True)


def fetch_assembly_data(assembly, wgs_biosample, hic_biosample, rna_biosample, output_file):
url = f"https://www.ebi.ac.uk/ena/portal/api/search?query=assembly_set_accession%3D%22{assembly}%22&result=assembly&fields=assembly_set_accession%2Ctax_id%2Cscientific_name%2Cstudy_accession&limit=0&download=true&format=json"
response = requests.get(url)

if response.status_code == 200:
assembly_data = response.json()
taxon_id = assembly_data[0].get("tax_id", None)
species = assembly_data[0].get("scientific_name", None).replace(" ", "_")
study = assembly_data[0].get("study_accession", None)
params = [assembly, species, taxon_id]
header = ["assembly", "species", "taxon_id"]

if study:
study_url = f"https://www.ebi.ac.uk/ena/portal/api/search?query=study_accession%3D%22{study}%22&result=study&fields=parent_study_accession&limit=0&download=true&format=json"
study_response = requests.get(study_url)

if study_response.status_code == 200:
study_data = study_response.json()
studies = study_data[0].get("parent_study_accession").split(";")
params.append(studies[0])
header.append("bioproject")

else:
raise AssertionError(f"Could not determine the Bioproject linked to this assembly {assembly}\n")
else:
raise AssertionError(f"Could not determine the Bioproject linked to this assembly {assembly}\n")

# Validate wgs_biosample
wgs_url = f"https://www.ebi.ac.uk/ena/portal/api/search?query=sample_accession%3D%22{wgs_biosample}%22&result=sample&fields=sample_accession%2Ctax_id&limit=0&download=true&format=json"
wgs_response = requests.get(wgs_url)

if wgs_response.status_code == 200:
wgs_data = wgs_response.json()
tax_id = wgs_data[0].get("tax_id")

if tax_id != taxon_id:
raise AssertionError(
f"The WGS biosample taxon id: {tax_id} does not match the assembly taxon id: {taxon_id}\n"
)
else:
params.append(wgs_biosample)
header.append("wgs_biosample")

else:
raise AssertionError(f"The WGS biosample id: {wgs_biosample} could not retrieved from ENA\n")

# Validate hic_biosample
if hic_biosample and hic_biosample != "null":
print(hic_biosample)
hic_url = f"https://www.ebi.ac.uk/ena/portal/api/search?query=sample_accession%3D%22{hic_biosample}%22&result=sample&fields=sample_accession%2Ctax_id&limit=0&download=true&format=json"
hic_response = requests.get(hic_url)

if hic_response.status_code == 200:
hic_data = hic_response.json()
hic_tax_id = hic_data[0].get("tax_id")

if hic_tax_id != taxon_id:
raise AssertionError(
f"The Hi-C biosample taxon id: {hic_tax_id} does not match the assembly taxon id: {taxon_id}\n"
)
else:
header.append("hic_biosample")
params.append(hic_biosample)

else:
raise AssertionError(f"The Hi-C biosample id: {hic_biosample} could not retrieved from ENA\n")
else:
header.append("hic_biosample")
params.append("null")

# Validate rna_biosample
if rna_biosample and rna_biosample != "null":
rna_url = f"https://www.ebi.ac.uk/ena/portal/api/search?query=sample_accession%3D%22{rna_biosample}%22&result=sample&fields=sample_accession%2Ctax_id&limit=0&download=true&format=json"
rna_response = requests.get(rna_url)

if rna_response.status_code == 200:
rna_data = rna_response.json()
rna_tax_id = rna_data[0].get("tax_id")

if rna_tax_id != taxon_id:
raise AssertionError(
f"The RNASeq biosample taxon id: {rna_tax_id} does not match the assembly taxon id: {taxon_id}\n"
)
else:
header.append("rna_biosample")
params.append(rna_biosample)

else:
raise AssertionError(f"The RNASeq biosample id: {rna_biosample} could not retrieved from ENA\n")

else:
header.append("rna_biosample")
params.append("null")

with open(output_file, "w") as fout:
# Write header
fout.write(",".join(header) + "\n")
fout.write(",".join(params) + "\n")

return output_file
else:
raise AssertionError(f"The assemby accession: {assembly} was not found\n")


def main(args=None):
args = parse_args(args)
hic_biosample = args.hic_biosample
rna_biosample = args.rna_biosample
fetch_assembly_data(
args.assembly,
args.wgs_biosample,
hic_biosample,
rna_biosample,
args.output,
)


if __name__ == "__main__":
sys.exit(main())
Loading

0 comments on commit 73375e6

Please sign in to comment.