sanger-tol · tkchafin · Aug 15, 2024 · Aug 15, 2024 · Aug 15, 2024 · Aug 15, 2024
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -31,7 +31,7 @@ jobs:
         uses: actions/checkout@v3
 
       - name: Install Nextflow
-        uses: nf-core/setup-nextflow@v1
+        uses: nf-core/setup-nextflow@v2
         with:
           version: "${{ matrix.NXF_VER }}"
 

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -3,6 +3,41 @@
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
+## [[2.1.0](https://github.com/sanger-tol/genomenote/releases/tag/2.1.0)] - Pembroke Welsh Corgi [2024-12-11]
+
+### Enhancements & fixes
+
+- New annotation_statistics subworkfow which runs BUSCO in protein mode and generates some basic statistics on the the annotated gene set if provided with a GFF3 file of gene annotations using the `--annotation_set` option.
+- The genome_metadata subworkflow now queries Ensembl's GraphQL API to determine if Ensembl has released gene annotation for the assembly being processed.
+- Module updates and remove Anaconda channels
+- Removed merquryfk completeness metric
+
+### Parameters
+
+| Old parameter | New parameter    |
+| ------------- | ---------------- |
+|               | --annotation_set |
+
+> **NB:** Parameter has been **updated** if both old and new parameter information is present. </br> **NB:** Parameter has been **added** if just the new parameter information is present. </br> **NB:** Parameter has been **removed** if new parameter information isn't present.
+
+### Software dependencies
+
+Note, since the pipeline is using Nextflow DSL2, each process will be run with its own [Biocontainer](https://biocontainers.pro/#/registry). This means that on occasion it is entirely possible for the pipeline to be using different versions of the same tool. However, the overall software dependency changes compared to the last release have been listed below for reference. Only `Docker` or `Singularity` containers are supported, `conda` is not supported.
+
+| Dependency  | Old version                              | New version                              |
+| ----------- | ---------------------------------------- | ---------------------------------------- |
+| `agat`      |                                          | 1.4.0                                    |
+| `bedtools`  | 2.30.0                                   | 2.31.1                                   |
+| `busco`     | 5.5.0                                    | 5.7.1                                    |
+| `cooler`    | 0.8.11                                   | 0.9.2                                    |
+| `fastk`     | 427104ea91c78c3b8b8b49f1a7d6bbeaa869ba1c | 666652151335353eef2fcd58880bcef5bc2928e1 |
+| `gffread`   |                                          | 0.12.7                                   |
+| `merquryfk` | d00d98157618f4e8d1a9190026b19b471055b22e |                                          |
+| `multiqc`   | 1.14                                     | 1.25.1                                   |
+| `samtools`  | 1.17                                     | 1.21                                     |
+
+> **NB:** Dependency has been **updated** if both old and new version information is present. </br> **NB:** Dependency has been **added** if just the new version information is present. </br> **NB:** Dependency has been **removed** if version information isn't present.
+
 ## [[2.0.0](https://github.com/sanger-tol/genomenote/releases/tag/2.0.0)] - English Cocker Spaniel [2024-10-10]
 
 ### Enhancements & fixes

diff --git a/CITATION.cff b/CITATION.cff
@@ -8,8 +8,8 @@ message: >-
     metadata from this file.
 type: software
 authors:
-    - given-names: Sandra
-      family-names: Babiyre
+    - given-names: Sandra Ruth
+      family-names: Babirye
       affiliation: Wellcome Sanger Institute
       orcid: "https://orcid.org/0009-0004-7773-7008"
     - given-names: Tyler

diff --git a/CITATIONS.md b/CITATIONS.md
@@ -12,6 +12,10 @@
 
 ## Pipeline tools
 
+- [AGAT](https://github.com/NBISweden/AGAT)
+
+  > Dainat J. AGAT: Another Gff Analysis Toolkit to handle annotations in any GTF/GFF format. (Version v1.4.0). Zenodo. https://www.doi.org/10.5281/zenodo.3552717
+
 - [BedTools](https://bedtools.readthedocs.io/en/latest/)
 
   > Quinlan, Aaron R., and Ira M. Hall. “BEDTools: A Flexible Suite of Utilities for Comparing Genomic Features.” Bioinformatics, vol. 26, no. 6, 2010, pp. 841–842., https://doi.org/10.1093/bioinformatics/btq033.
@@ -30,6 +34,10 @@
 
 - [FastK](https://github.com/thegenemyers/FASTK)
 
+- [GFFREAD](https://github.com/gpertea/gffread)
+
+  > Pertea G and Pertea M. "GFF Utilities: GffRead and GffCompare [version 1; peer review: 3 approved]". F1000Research 2020, 9:304 https://doi.org/10.12688/f1000research.23297.1
+
 - [MerquryFK](https://github.com/thegenemyers/MERQURY.FK)
 
 - [MultiQC](https://multiqc.info)
@@ -48,9 +56,9 @@
 
 ## Software packaging/containerisation tools
 
-- [Anaconda](https://anaconda.com)
+- [Conda](https://conda.org/)
 
-  > Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. Anaconda, Nov. 2016. Web.
+  > conda contributors. conda: A system-level, binary package and environment manager running on all major operating systems and platforms. Computer software. https://github.com/conda/conda
 
 - [Bioconda](https://bioconda.github.io)
 

diff --git a/README.md b/README.md
@@ -4,7 +4,7 @@
 [![Cite with Zenodo](http://img.shields.io/badge/DOI-10.5281/zenodo.7949384-1073c8?labelColor=000000)](https://doi.org/10.5281/zenodo.7949384)
 
 [![Nextflow](https://img.shields.io/badge/nextflow%20DSL2-%E2%89%A522.10.1-23aa62.svg)](https://www.nextflow.io/)
-[![run with conda](http://img.shields.io/badge/run%20with-conda-3EB049?labelColor=000000&logo=anaconda)](https://docs.conda.io/en/latest/)
+[![run with conda](http://img.shields.io/badge/run%20with-conda-3EB049?labelColor=000000&logo=conda)](https://docs.conda.io/en/latest/)
 [![run with docker](https://img.shields.io/badge/run%20with-docker-0db7ed?labelColor=000000&logo=docker)](https://www.docker.com/)
 [![run with singularity](https://img.shields.io/badge/run%20with-singularity-1d355c.svg?labelColor=000000)](https://sylabs.io/docs/)
 [![Launch on Nextflow Tower](https://img.shields.io/badge/Launch%20%F0%9F%9A%80-Nextflow%20Tower-%234256e7)](https://tower.nf/launch?pipeline=https://github.com/sanger-tol/genomenote)
@@ -13,7 +13,7 @@
 
 ## Introduction
 
-**sanger-tol/genomenote** is a bioinformatics pipeline that takes aligned HiC reads, creates contact maps and chromosomal grid using Cooler, and display on a [HiGlass server](https://genome-note-higlass.tol.sanger.ac.uk/app). The pipeline also collates (1) assembly information, statistics and chromosome details from NCBI datasets, (2) genome completeness from BUSCO, (3) consensus quality and k-mer completeness from MerquryFK, and (4) HiC primary mapped percentage from samtools flagstat.
+**sanger-tol/genomenote** is a bioinformatics pipeline that takes aligned HiC reads, creates contact maps and chromosomal grid using Cooler, and display on a [HiGlass server](https://genome-note-higlass.tol.sanger.ac.uk/app). The pipeline also collates (1) assembly information, statistics and chromosome details from NCBI datasets, (2) genome completeness from BUSCO, (3) consensus quality and k-mer completeness from MerquryFK, (4) HiC primary mapped percentage from samtools flagstat and optionally (5) Annotation statistics from AGAT and BUSCO. The pipeline combines the calculated statistics and collated assembly metadata with a template document to output a genome note document.
 
 <!--![sanger-tol/genomenote workflow](https://raw.githubusercontent.com/sanger-tol/genomenote/main/docs/images/sanger-tol-genomenote_workflow.png)-->
 
@@ -25,7 +25,9 @@
 6. Genome completeness ([`NCBI API`](https://www.ncbi.nlm.nih.gov/datasets/docs/v1/reference-docs/rest-api/), [`BUSCO`](https://busco.ezlab.org))
 7. Consensus quality and k-mer completeness ([`FASTK`](https://github.com/thegenemyers/FASTK), [`MERQURY.FK`](https://github.com/thegenemyers/MERQURY.FK))
 8. Collated summary table ([`createtable`](bin/create_table.py))
-9. Present results and visualisations ([`MultiQC`](http://multiqc.info/), [`R`](https://www.r-project.org/))
+9. Optionally calculates some annotation statistics and completeness , ([`AGAT`](https://github.com/NBISweden/AGAT), [`BUSCO`](https://busco.ezlab.org))
+10. Combines calculated statisics and assembly metadata with a template file to produce a genome note document.
+11. Present results and visualisations ([`MultiQC`](http://multiqc.info/), [`R`](https://www.r-project.org/))
 
 ## Usage
 

diff --git a/assets/genome_note_template.docx b/assets/genome_note_template.docx
diff --git a/bin/combine_parsed_data.py b/bin/combine_parsed_data.py
@@ -21,6 +21,7 @@
     ("COPO_BIOSAMPLE_HIC", "copo_biosample_hic_file"),
     ("COPO_BIOSAMPLE_RNA", "copo_biosample_rna_file"),
     ("GBIF_TAXONOMY", "gbif_taxonomy_file"),
+    ("ENSEMBL_ANNOTATION", "ensembl_annotation_file"),
 ]
 
 
@@ -42,6 +43,7 @@ def parse_args(args=None):
     parser.add_argument("--copo_biosample_hic_file", help="Input parsed COPO HiC biosample file.", required=False)
     parser.add_argument("--copo_biosample_rna_file", help="Input parsed COPO RNASeq biosample file.", required=False)
     parser.add_argument("--gbif_taxonomy_file", help="Input parsed GBIF taxonomy file.", required=False)
+    parser.add_argument("--ensembl_annotation_file", help="Input parsed Ensembl annotation file.", required=False)
     parser.add_argument("--out_consistent", help="Output file.", required=True)
     parser.add_argument("--out_inconsistent", help="Output file.", required=True)
     parser.add_argument("--version", action="version", version="%(prog)s 1.0")

diff --git a/bin/combine_statistics_data.py b/bin/combine_statistics_data.py
@@ -8,7 +8,8 @@
 
 files = [
     ("CONSISTENT", "in_consistent"),
-    ("STATISITCS", "in_statistics"),
+    ("GENOME_STATISTICS", "in_genome_statistics"),
+    ("ANNOTATION_STATISITCS", "in_annotation_statistics"),
 ]
 
 
@@ -19,7 +20,13 @@ def parse_args(args=None):
     parser = argparse.ArgumentParser(description=Description, epilog=Epilog)
     parser.add_argument("--in_consistent", help="Input consistent params file.", required=True)
     parser.add_argument("--in_inconsistent", help="Input consistent params file.", required=True)
-    parser.add_argument("--in_statistics", help="Input parsed genome statistics params file.", required=True)
+    parser.add_argument("--in_genome_statistics", help="Input parsed genome statistics params file.", required=True)
+    parser.add_argument(
+        "--in_annotation_statistics",
+        help="Input parsed annotation statistics params file.",
+        required=False,
+        default=None,
+    )
     parser.add_argument("--out_consistent", help="Output file.", required=True)
     parser.add_argument("--out_inconsistent", help="Output file.", required=True)
     parser.add_argument("--version", action="version", version="%(prog)s 1.0")
@@ -36,7 +43,7 @@ def process_file(file_in, file_type, params, param_sets):
         reader = csv.reader(infile)
 
         for row in reader:
-            if row[0] == "#paramName":
+            if row[0].startswith("#"):
                 continue
 
             key = row.pop(0)
@@ -95,7 +102,10 @@ def main(args=None):
     params_inconsistent = {}
 
     for file in files:
-        (params, param_sets) = process_file(getattr(args, file[1]), file[0], params, param_sets)
+        if file[0] == "ANNOTATION_STATISITCS" and args.in_annotation_statistics == None:
+            continue
+        else:
+            (params, param_sets) = process_file(getattr(args, file[1]), file[0], params, param_sets)
 
     for key in params.keys():
         value_set = {v for v in params[key]}

diff --git a/bin/extract_annotation_statistics_info.py b/bin/extract_annotation_statistics_info.py
@@ -0,0 +1,154 @@
+#!/usr/bin/env python3
+import re
+import csv
+import sys
+import argparse
+import json
+
+
+# Extract CDS information from mrna and transcript sections
+def extract_cds_info(file):
+    # Define regex patterns for different statistics
+    patterns = {
+        "TRANSC_MRNA": re.compile(r"Number of mrna\s+(\d+)"),
+        "PCG": re.compile(r"Number of gene\s+(\d+)"),
+        "CDS_PER_GENE": re.compile(r"mean mrnas per gene\s+([\d.]+)"),
+        "EXONS_PER_TRANSC": re.compile(r"mean exons per mrna\s+([\d.]+)"),
+        "CDS_LENGTH": re.compile(r"mean mrna length \(bp\)\s+([\d.]+)"),
+        "EXON_SIZE": re.compile(r"mean exon length \(bp\)\s+([\d.]+)"),
+        "INTRON_SIZE": re.compile(r"mean intron in cds length \(bp\)\s+([\d.]+)"),
+    }
+
+    # Initialize a dictionary to store content for different sections
+    section_content = {"mrna": "", "transcript": ""}
+
+    # Variable to keep track of the current section being processed
+    current_section = None
+
+    with open(file, "r") as f:
+        lines = f.read().splitlines()  # read all lines in the file
+
+    for line in lines:
+        line = line.strip()  # Remove any leading/trailing whitespace including newline characters
+
+        if "---------------------------------- mrna ----------------------------------" in line:
+            current_section = "mrna"  # Switch to 'mrna' section
+        elif "---------------------------------- transcript ----------------------------------" in line:
+            current_section = "transcript"  # Switch to 'transcript' section
+        elif "----------" in line:
+            current_section = None  # End of current section
+        elif current_section:
+            section_content[current_section] += (
+                line + " "
+            )  # Accumulate content for the current section, separate lines by a space
+
+    cds_info = {}
+
+    for label, pattern in patterns.items():
+        text_to_search = section_content["mrna"] if label != "EXONS_PER_TRANSC" else section_content["transcript"]
+        match = re.search(pattern, text_to_search)
+        if match:
+            cds_info[label] = match.group(1)
+
+    return cds_info
+
+
+# Function to extract the number of non-coding genes from the second file
+def extract_non_coding_genes(file):
+    non_coding_genes = {"ncrna_gene": 0}
+
+    with open(file, "r") as f:
+        for line in f:
+            parts = line.split()
+            if len(parts) < 2:
+                continue
+
+            gene_type = parts[0]
+            try:
+                count = int(parts[1])
+            except ValueError:
+                continue
+
+            if gene_type in non_coding_genes:
+                non_coding_genes[gene_type] += count
+
+    NCG = sum(non_coding_genes.values())
+    return {"NCG": NCG}
+
+
+# Extract the one_line_summary from a BUSCO JSON file
+def extract_busco_results(busco_stats_file):
+    try:
+        with open(busco_stats_file, "r") as file:
+            busco_data = json.load(file)
+            # Extract the one_line_summary from the results section
+            one_line_summary = busco_data.get("results", {}).get("one_line_summary")
+            if one_line_summary:
+                # Use regex to extract everything after the first colon
+                match = re.search(r':\s*"(.*)"', one_line_summary)
+                if match:
+                    one_line_summary = match.group(1)  # Get text after the colon
+            return {"BUSCO_PROTEIN_SCORES": one_line_summary} if one_line_summary else {}
+    except (json.JSONDecodeError, FileNotFoundError) as e:
+        print(f"Error loading BUSCO JSON file: {e}")
+        return {}
+
+
+# Function to write the extracted data to a CSV file
+def write_to_csv(data, output_file, busco_stats_file):
+    busco_results = extract_busco_results(busco_stats_file)
+
+    descriptions = {
+        "TRANSC_MRNA": "The number of transcribed mRNAs",
+        "PCG": "The number of protein coding genes",
+        "NCG": "The number of non-coding genes",
+        "CDS_PER_GENE": "The average number of coding transcripts per gene",
+        "EXONS_PER_TRANSC": "The average number of exons per transcript",
+        "CDS_LENGTH": "The average length of coding sequence",
+        "EXON_SIZE": "The average length of a coding exon",
+        "INTRON_SIZE": "The average length of coding intron size",
+        "BUSCO_PROTEIN_SCORES": "BUSCO results summary from running BUSCO in protein mode",
+    }
+
+    with open(output_file, "w", newline="") as csvfile:
+        writer = csv.writer(csvfile)
+
+        # Write descriptions at the top of the CSV file
+        for key, description in descriptions.items():
+            csvfile.write(f"# {key}: {description}\n")
+
+        # Write the Variable and Value columns header
+        writer.writerow(["#paramName", "paramValue"])
+
+        # Write the data
+        for key, value in data.items():
+            writer.writerow([key, value])
+
+        # Add the BUSCO results summary
+        for key, value in busco_results.items():
+            writer.writerow([key, value])
+
+
+# Main function to take input files and output file as arguments
+def main():
+    Description = "Parse contents of the agat_spstatistics, buscoproteins and agat_sqstatbasic to extract relevant annotation statistics information."
+    Epilog = (
+        "Example usage: python extract_annotation_statistics_info.py <basic_stats> <other_stats> <busco_stats> <output>"
+    )
+
+    parser = argparse.ArgumentParser(description=Description, epilog=Epilog)
+    parser.add_argument("basic_stats", help="Input txt file with basic_feature_statistics.")
+    parser.add_argument("other_stats", help="Input txt file with other_feature_statistics.")
+    parser.add_argument("busco_stats", help="Input JSON file for the BUSCO statistics.")
+    parser.add_argument("output", help="Output file.")
+    parser.add_argument("--version", action="version", version="%(prog)s 1.0")
+    args = parser.parse_args()
+
+    cds_info = extract_cds_info(args.other_stats)
+    non_coding_genes = extract_non_coding_genes(args.basic_stats)
+    data = {**cds_info, **non_coding_genes}
+    write_to_csv(data, args.output, args.busco_stats)
+
+
+if __name__ == "__main__":
+    sys.exit(main())