Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2.1.0 release #161

Open
wants to merge 148 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
148 commits
Select commit Hold shift + click to select a range
7dc9838
added two new nf-core modules 'agat/sptatistics and agat/sqstatbasic'
SandraBabirye Aug 15, 2024
0548d90
added two new nf-core modules 'agat/sptatistics and agat/sqstatbasic'
SandraBabirye Aug 15, 2024
cfafbaa
added new subworkflow to obtain annotation summary statistics
SandraBabirye Aug 15, 2024
ddc3edc
added params.annotation_set
SandraBabirye Aug 15, 2024
3bd733a
added documentation for params.annotation_set
SandraBabirye Aug 15, 2024
131b7b8
added annotation_stats as specific output directory
SandraBabirye Aug 15, 2024
e98a375
allocated resources for the two new processes 'AGAT-SPSTATISTICS and …
SandraBabirye Aug 15, 2024
5a2abe7
added new subworkflow and params.annotation_set
SandraBabirye Aug 15, 2024
c35db05
added path to the test annotation file
SandraBabirye Aug 15, 2024
2fc43e8
modified the file after running prettier
SandraBabirye Aug 15, 2024
a8dbd78
added a new subworkflow 'ANNOTATION_STATS)
SandraBabirye Aug 16, 2024
da91c57
Edited annotation_statistics.nf file
SandraBabirye Aug 16, 2024
65b463e
edited the pattern for --annotation_set to include gff rather than gff3
SandraBabirye Aug 16, 2024
85ac505
edited the pattern for --annotation_set to include gff rather than gff3
SandraBabirye Aug 16, 2024
7e201c7
Merge pull request #135 from sanger-tol/annotation_statistics
BethYates Aug 16, 2024
26bd58c
added a python script that extracts the relevant annotation statistic…
SandraBabirye Aug 29, 2024
fe3c78e
added a new local module to extract annotation statistics information
SandraBabirye Aug 29, 2024
fd0d992
edited the annotation_stats subworkflow to include the new local module
SandraBabirye Aug 29, 2024
e6c6778
adjusted the allocated memory for agat_spstatistics and agat_sqstatba…
SandraBabirye Aug 29, 2024
ead4298
edited the annotation_stats subworkflow to include the new local module
SandraBabirye Aug 29, 2024
a52bc5d
edited the annotation_stats subworkflow to include the new local module
SandraBabirye Aug 29, 2024
054dade
edited the annotation_stats subworkflow and removed .out
SandraBabirye Aug 29, 2024
0e316cd
edited the modules.confif file
SandraBabirye Aug 29, 2024
5902545
edited the file
SandraBabirye Aug 29, 2024
0f8b9ae
edited the file
SandraBabirye Aug 29, 2024
5b6db7a
edited file
SandraBabirye Aug 29, 2024
fe44659
edited file
SandraBabirye Aug 29, 2024
d715b26
edited file
SandraBabirye Aug 29, 2024
9b21ccf
edited file
SandraBabirye Aug 29, 2024
5735681
edited file
SandraBabirye Aug 29, 2024
ee54ce0
edited file
SandraBabirye Aug 29, 2024
af5ae87
edited file
SandraBabirye Aug 29, 2024
0b47a55
edited file
SandraBabirye Aug 29, 2024
9f6d570
edited file
SandraBabirye Aug 29, 2024
8276619
edited file
SandraBabirye Aug 29, 2024
71db7fb
edited file
SandraBabirye Aug 29, 2024
c442041
edited file
SandraBabirye Aug 29, 2024
65ce7b4
edited file
SandraBabirye Aug 29, 2024
8fc6d6a
edited file
SandraBabirye Aug 29, 2024
a86d691
edited file
SandraBabirye Aug 29, 2024
2993e06
edited file
SandraBabirye Aug 29, 2024
1e5066f
edited the input channels
SandraBabirye Aug 29, 2024
38edfb1
created a tuple to extract meta id and file path
SandraBabirye Aug 29, 2024
db4f0a2
edited output channel
SandraBabirye Aug 29, 2024
269e1f3
removal of the full publishDir block for both of the nf-core AGAT pro…
SandraBabirye Aug 29, 2024
5ac5e4b
edited the output channel
SandraBabirye Aug 29, 2024
54c908d
edited the output file name
SandraBabirye Aug 29, 2024
c40cb4f
edited the output file name
SandraBabirye Aug 29, 2024
1f10c83
edited the output file and input channels
SandraBabirye Sep 3, 2024
886fc6e
edited the output file
SandraBabirye Sep 3, 2024
601abc7
edited the file permisions
SandraBabirye Sep 3, 2024
b0c4490
edited the annotation_statistics.nf file
SandraBabirye Sep 3, 2024
2fcca2c
Fix code formatting with Black
SandraBabirye Sep 3, 2024
77c6d5e
Fix EditorConfig linting issues
SandraBabirye Sep 3, 2024
fd7ca06
Fix EditorConfig linting issues
SandraBabirye Sep 3, 2024
09630ed
Fix EditorConfig linting issues
SandraBabirye Sep 3, 2024
f94429e
Fix EditorConfig linting issues
SandraBabirye Sep 3, 2024
093c653
Fix EditorConfig linting issues
SandraBabirye Sep 3, 2024
f2d479a
Fix EditorConfig linting issues
SandraBabirye Sep 3, 2024
6f12169
Fix EditorConfig linting issues
SandraBabirye Sep 3, 2024
b053f15
edited the ch_versions for AGAT_SPSTATISTICS
SandraBabirye Sep 4, 2024
57eec24
Merge pull request #136 from SandraBabirye/annotation_statistics
BethYates Sep 4, 2024
e0f5a8c
added gffread nf-core module
SandraBabirye Sep 18, 2024
b339b7a
added busco to run in protein mode
SandraBabirye Sep 18, 2024
c3336b2
added input channel ch_fasta to subworkflow ANNOTATION_STATS
SandraBabirye Sep 18, 2024
e7e15f6
edited the input channels for the BUSCO process
SandraBabirye Sep 18, 2024
18f07a8
added the protein mode
SandraBabirye Sep 18, 2024
e69d60e
added the lineage_db as input channel for the ANNOTATION_STATS subwor…
SandraBabirye Sep 18, 2024
ce859c0
edited the input channels for BUSCO
SandraBabirye Sep 18, 2024
c64153e
added an input channel for busco stats in the local module
SandraBabirye Sep 18, 2024
9f6870e
edited input channels for the ANNOTATION_STATS subworkflow
SandraBabirye Sep 18, 2024
1813b2d
edited input channels for the ANNOTATION_STATS subworkflow
SandraBabirye Sep 18, 2024
0e9fdbc
edited input channels for the ANNOTATION_STATS subworkflow
SandraBabirye Sep 18, 2024
554f1c5
edited the file to extract fasta file path
SandraBabirye Sep 18, 2024
6a39117
edited the file to extract fasta file path
SandraBabirye Sep 18, 2024
6bad994
edited the input channel for the extract annotation statistics local …
SandraBabirye Sep 18, 2024
7c4c496
removed the trailing white space
SandraBabirye Sep 18, 2024
7fe58fc
added -y argument to the GFFREAD process
SandraBabirye Sep 23, 2024
21f95b7
Update CITATION.cff
tkchafin Oct 11, 2024
39b6690
Merge pull request #147 from sanger-tol/spelling
tkchafin Oct 11, 2024
bc7809f
edited files
SandraBabirye Oct 23, 2024
6db6859
edited the file to run busco in protein mode
SandraBabirye Oct 23, 2024
1feb2a6
edited python script to include the busco stats file
SandraBabirye Oct 23, 2024
ee4e2a9
edited files
SandraBabirye Oct 23, 2024
21f2b4d
edited files
SandraBabirye Oct 23, 2024
64a39d5
fixed black linting issues
SandraBabirye Oct 23, 2024
8e5490c
added a busco stats file as parameter
SandraBabirye Oct 23, 2024
208e3ab
edited the arguments
SandraBabirye Oct 23, 2024
b060f6e
fixing black linting issues
SandraBabirye Oct 23, 2024
80301fe
edited the python script to extract busco stats
SandraBabirye Oct 25, 2024
f4d28c7
edited the input channels for buscoproteins process
SandraBabirye Oct 25, 2024
d95f4b5
Changed the order of inputs for ANNOTATION_STATS
SandraBabirye Oct 25, 2024
164b74d
edited the busco output file channel to json
SandraBabirye Oct 25, 2024
99a6e26
edited the python script to extract the one_line_summary information …
SandraBabirye Oct 25, 2024
dfdb0cf
Updated annotation_statistics.nf file
SandraBabirye Oct 25, 2024
fbc3deb
fixing black issues
SandraBabirye Oct 25, 2024
af3a412
fixing linting issues
SandraBabirye Oct 25, 2024
7acb7b8
remove trailing whitespace
SandraBabirye Oct 25, 2024
5ad88dd
removed the left padding space
SandraBabirye Oct 25, 2024
cf1e9d1
removed the left padding space
SandraBabirye Oct 25, 2024
067833c
updated modules.config
SandraBabirye Oct 25, 2024
05114c8
Merge pull request #142 from SandraBabirye/busco_feature
BethYates Oct 31, 2024
0ebe932
Merge branch 'annotation_stats_dev' into public_dev
BethYates Nov 1, 2024
7a08874
Added annotation statistics to the full parameter list and genomenote…
BethYates Nov 13, 2024
37b9838
typo
BethYates Nov 13, 2024
70a80b8
corrected some of the statistics being being returned, we should be …
BethYates Nov 14, 2024
96cf0bd
updated docs to include annotation statistics subworkflow
BethYates Nov 14, 2024
06e2b20
prettier fixes
BethYates Nov 14, 2024
fb20ba4
Query Ensembl's new metadata API to determine if this assembly has be…
BethYates Nov 14, 2024
d2a5b17
documentation updates
BethYates Nov 14, 2024
3f52746
black fix
BethYates Nov 14, 2024
3490b8e
switch to beta url as ensembl rapid site data is now frozen and all d…
BethYates Nov 15, 2024
6a12288
query by taxon_id rather than species name
BethYates Nov 19, 2024
019e9ae
updated comments/usage to refelect change to querying using taxon_id …
BethYates Nov 19, 2024
a343c27
deleted empty file
BethYates Nov 19, 2024
748a325
fixed typo
BethYates Nov 19, 2024
14614fb
modified output file name and directory of annotation_statistics subw…
BethYates Nov 19, 2024
3db5b58
Merge pull request #151 from sanger-tol/add_ensembl_metadata_check
BethYates Nov 20, 2024
5418341
Merge branch 'dev' into public_dev
BethYates Nov 20, 2024
c4f45a0
Merge pull request #150 from sanger-tol/public_dev
BethYates Nov 20, 2024
ae8db5c
Updated docs ahead of release
BethYates Nov 20, 2024
d61696d
updated nf-core/agat modules
tkchafin Nov 22, 2024
6cc12cd
update nf-core/bamtobed
tkchafin Nov 22, 2024
13687d9
nf-core/busco updated to 5.7.1 and patched
tkchafin Nov 22, 2024
798369b
nf-core/cooler updated
tkchafin Nov 22, 2024
c9e4211
nf-core/dumpsoftwareversions updated
tkchafin Nov 22, 2024
061ac2e
nf-core/gffread updated
tkchafin Nov 22, 2024
1f18dad
nf-core module updates
tkchafin Nov 23, 2024
bd8bfaa
nf-core module updates
tkchafin Nov 25, 2024
dd58ec1
remove anaconda references
tkchafin Nov 25, 2024
4a86006
Merge pull request #153 from tkchafin/anaconda_purge
tkchafin Nov 25, 2024
f6220c0
nf-core samtools/view updated
tkchafin Nov 25, 2024
2930baf
Merge pull request #154 from tkchafin/anaconda_purge
tkchafin Nov 25, 2024
a01d592
remote deprecated cleanup=true
tkchafin Dec 5, 2024
2f5b69b
Update CHANGELOG.md
tkchafin Dec 5, 2024
c4e97c3
Update CHANGELOG.md
tkchafin Dec 5, 2024
5f89876
Merge pull request #155 from sanger-tol/anaconda_purge
tkchafin Dec 5, 2024
4eb4aa6
Merge branch 'dev' into Release-2.1.0
BethYates Dec 5, 2024
bbfada7
prettier fix
BethYates Dec 5, 2024
e448f75
Update Utils.groovy
tkchafin Dec 9, 2024
e11e4a8
Merge pull request #152 from sanger-tol/Release-2.1.0
tkchafin Dec 9, 2024
bf9e345
Update CHANGELOG.md
tkchafin Dec 9, 2024
8b9de0f
Merge pull request #158 from sanger-tol/changelog
tkchafin Dec 9, 2024
44f238e
removed merquryfk
tkchafin Dec 10, 2024
8add69e
prettier linting
tkchafin Dec 10, 2024
1092fdc
Merge pull request #160 from tkchafin/merquryfk
tkchafin Dec 13, 2024
974f5cc
Fixed the CI by switching to the v2 of the action
muffato Dec 13, 2024
ee78df8
Merge pull request #162 from sanger-tol/ci
tkchafin Dec 16, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ jobs:
uses: actions/checkout@v3

- name: Install Nextflow
uses: nf-core/setup-nextflow@v1
uses: nf-core/setup-nextflow@v2
with:
version: "${{ matrix.NXF_VER }}"

Expand Down
35 changes: 35 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,41 @@
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [[2.1.0](https://github.com/sanger-tol/genomenote/releases/tag/2.1.0)] - Pembroke Welsh Corgi [2024-12-11]

### Enhancements & fixes

- New annotation_statistics subworkfow which runs BUSCO in protein mode and generates some basic statistics on the the annotated gene set if provided with a GFF3 file of gene annotations using the `--annotation_set` option.
- The genome_metadata subworkflow now queries Ensembl's GraphQL API to determine if Ensembl has released gene annotation for the assembly being processed.
- Module updates and remove Anaconda channels
- Removed merquryfk completeness metric

### Parameters

| Old parameter | New parameter |
| ------------- | ---------------- |
| | --annotation_set |

> **NB:** Parameter has been **updated** if both old and new parameter information is present. </br> **NB:** Parameter has been **added** if just the new parameter information is present. </br> **NB:** Parameter has been **removed** if new parameter information isn't present.

### Software dependencies

Note, since the pipeline is using Nextflow DSL2, each process will be run with its own [Biocontainer](https://biocontainers.pro/#/registry). This means that on occasion it is entirely possible for the pipeline to be using different versions of the same tool. However, the overall software dependency changes compared to the last release have been listed below for reference. Only `Docker` or `Singularity` containers are supported, `conda` is not supported.

| Dependency | Old version | New version |
| ----------- | ---------------------------------------- | ---------------------------------------- |
| `agat` | | 1.4.0 |
| `bedtools` | 2.30.0 | 2.31.1 |
| `busco` | 5.5.0 | 5.7.1 |
| `cooler` | 0.8.11 | 0.9.2 |
| `fastk` | 427104ea91c78c3b8b8b49f1a7d6bbeaa869ba1c | 666652151335353eef2fcd58880bcef5bc2928e1 |
| `gffread` | | 0.12.7 |
| `merquryfk` | d00d98157618f4e8d1a9190026b19b471055b22e | |
| `multiqc` | 1.14 | 1.25.1 |
| `samtools` | 1.17 | 1.21 |

> **NB:** Dependency has been **updated** if both old and new version information is present. </br> **NB:** Dependency has been **added** if just the new version information is present. </br> **NB:** Dependency has been **removed** if version information isn't present.

## [[2.0.0](https://github.com/sanger-tol/genomenote/releases/tag/2.0.0)] - English Cocker Spaniel [2024-10-10]

### Enhancements & fixes
Expand Down
4 changes: 2 additions & 2 deletions CITATION.cff
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,8 @@ message: >-
metadata from this file.
type: software
authors:
- given-names: Sandra
family-names: Babiyre
- given-names: Sandra Ruth
family-names: Babirye
affiliation: Wellcome Sanger Institute
orcid: "https://orcid.org/0009-0004-7773-7008"
- given-names: Tyler
Expand Down
12 changes: 10 additions & 2 deletions CITATIONS.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,10 @@

## Pipeline tools

- [AGAT](https://github.com/NBISweden/AGAT)

> Dainat J. AGAT: Another Gff Analysis Toolkit to handle annotations in any GTF/GFF format. (Version v1.4.0). Zenodo. https://www.doi.org/10.5281/zenodo.3552717

- [BedTools](https://bedtools.readthedocs.io/en/latest/)

> Quinlan, Aaron R., and Ira M. Hall. “BEDTools: A Flexible Suite of Utilities for Comparing Genomic Features.” Bioinformatics, vol. 26, no. 6, 2010, pp. 841–842., https://doi.org/10.1093/bioinformatics/btq033.
Expand All @@ -30,6 +34,10 @@

- [FastK](https://github.com/thegenemyers/FASTK)

- [GFFREAD](https://github.com/gpertea/gffread)

> Pertea G and Pertea M. "GFF Utilities: GffRead and GffCompare [version 1; peer review: 3 approved]". F1000Research 2020, 9:304 https://doi.org/10.12688/f1000research.23297.1

- [MerquryFK](https://github.com/thegenemyers/MERQURY.FK)

- [MultiQC](https://multiqc.info)
Expand All @@ -48,9 +56,9 @@

## Software packaging/containerisation tools

- [Anaconda](https://anaconda.com)
- [Conda](https://conda.org/)

> Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. Anaconda, Nov. 2016. Web.
> conda contributors. conda: A system-level, binary package and environment manager running on all major operating systems and platforms. Computer software. https://github.com/conda/conda

- [Bioconda](https://bioconda.github.io)

Expand Down
8 changes: 5 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
[![Cite with Zenodo](http://img.shields.io/badge/DOI-10.5281/zenodo.7949384-1073c8?labelColor=000000)](https://doi.org/10.5281/zenodo.7949384)

[![Nextflow](https://img.shields.io/badge/nextflow%20DSL2-%E2%89%A522.10.1-23aa62.svg)](https://www.nextflow.io/)
[![run with conda](http://img.shields.io/badge/run%20with-conda-3EB049?labelColor=000000&logo=anaconda)](https://docs.conda.io/en/latest/)
[![run with conda](http://img.shields.io/badge/run%20with-conda-3EB049?labelColor=000000&logo=conda)](https://docs.conda.io/en/latest/)
[![run with docker](https://img.shields.io/badge/run%20with-docker-0db7ed?labelColor=000000&logo=docker)](https://www.docker.com/)
[![run with singularity](https://img.shields.io/badge/run%20with-singularity-1d355c.svg?labelColor=000000)](https://sylabs.io/docs/)
[![Launch on Nextflow Tower](https://img.shields.io/badge/Launch%20%F0%9F%9A%80-Nextflow%20Tower-%234256e7)](https://tower.nf/launch?pipeline=https://github.com/sanger-tol/genomenote)
Expand All @@ -13,7 +13,7 @@

## Introduction

**sanger-tol/genomenote** is a bioinformatics pipeline that takes aligned HiC reads, creates contact maps and chromosomal grid using Cooler, and display on a [HiGlass server](https://genome-note-higlass.tol.sanger.ac.uk/app). The pipeline also collates (1) assembly information, statistics and chromosome details from NCBI datasets, (2) genome completeness from BUSCO, (3) consensus quality and k-mer completeness from MerquryFK, and (4) HiC primary mapped percentage from samtools flagstat.
**sanger-tol/genomenote** is a bioinformatics pipeline that takes aligned HiC reads, creates contact maps and chromosomal grid using Cooler, and display on a [HiGlass server](https://genome-note-higlass.tol.sanger.ac.uk/app). The pipeline also collates (1) assembly information, statistics and chromosome details from NCBI datasets, (2) genome completeness from BUSCO, (3) consensus quality and k-mer completeness from MerquryFK, (4) HiC primary mapped percentage from samtools flagstat and optionally (5) Annotation statistics from AGAT and BUSCO. The pipeline combines the calculated statistics and collated assembly metadata with a template document to output a genome note document.

<!--![sanger-tol/genomenote workflow](https://raw.githubusercontent.com/sanger-tol/genomenote/main/docs/images/sanger-tol-genomenote_workflow.png)-->

Expand All @@ -25,7 +25,9 @@
6. Genome completeness ([`NCBI API`](https://www.ncbi.nlm.nih.gov/datasets/docs/v1/reference-docs/rest-api/), [`BUSCO`](https://busco.ezlab.org))
7. Consensus quality and k-mer completeness ([`FASTK`](https://github.com/thegenemyers/FASTK), [`MERQURY.FK`](https://github.com/thegenemyers/MERQURY.FK))
8. Collated summary table ([`createtable`](bin/create_table.py))
9. Present results and visualisations ([`MultiQC`](http://multiqc.info/), [`R`](https://www.r-project.org/))
9. Optionally calculates some annotation statistics and completeness , ([`AGAT`](https://github.com/NBISweden/AGAT), [`BUSCO`](https://busco.ezlab.org))
10. Combines calculated statisics and assembly metadata with a template file to produce a genome note document.
11. Present results and visualisations ([`MultiQC`](http://multiqc.info/), [`R`](https://www.r-project.org/))

## Usage

Expand Down
Binary file modified assets/genome_note_template.docx
Binary file not shown.
2 changes: 2 additions & 0 deletions bin/combine_parsed_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@
("COPO_BIOSAMPLE_HIC", "copo_biosample_hic_file"),
("COPO_BIOSAMPLE_RNA", "copo_biosample_rna_file"),
("GBIF_TAXONOMY", "gbif_taxonomy_file"),
("ENSEMBL_ANNOTATION", "ensembl_annotation_file"),
]


Expand All @@ -42,6 +43,7 @@ def parse_args(args=None):
parser.add_argument("--copo_biosample_hic_file", help="Input parsed COPO HiC biosample file.", required=False)
parser.add_argument("--copo_biosample_rna_file", help="Input parsed COPO RNASeq biosample file.", required=False)
parser.add_argument("--gbif_taxonomy_file", help="Input parsed GBIF taxonomy file.", required=False)
parser.add_argument("--ensembl_annotation_file", help="Input parsed Ensembl annotation file.", required=False)
parser.add_argument("--out_consistent", help="Output file.", required=True)
parser.add_argument("--out_inconsistent", help="Output file.", required=True)
parser.add_argument("--version", action="version", version="%(prog)s 1.0")
Expand Down
18 changes: 14 additions & 4 deletions bin/combine_statistics_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,8 @@

files = [
("CONSISTENT", "in_consistent"),
("STATISITCS", "in_statistics"),
("GENOME_STATISTICS", "in_genome_statistics"),
("ANNOTATION_STATISITCS", "in_annotation_statistics"),
]


Expand All @@ -19,7 +20,13 @@ def parse_args(args=None):
parser = argparse.ArgumentParser(description=Description, epilog=Epilog)
parser.add_argument("--in_consistent", help="Input consistent params file.", required=True)
parser.add_argument("--in_inconsistent", help="Input consistent params file.", required=True)
parser.add_argument("--in_statistics", help="Input parsed genome statistics params file.", required=True)
parser.add_argument("--in_genome_statistics", help="Input parsed genome statistics params file.", required=True)
parser.add_argument(
"--in_annotation_statistics",
help="Input parsed annotation statistics params file.",
required=False,
default=None,
)
parser.add_argument("--out_consistent", help="Output file.", required=True)
parser.add_argument("--out_inconsistent", help="Output file.", required=True)
parser.add_argument("--version", action="version", version="%(prog)s 1.0")
Expand All @@ -36,7 +43,7 @@ def process_file(file_in, file_type, params, param_sets):
reader = csv.reader(infile)

for row in reader:
if row[0] == "#paramName":
if row[0].startswith("#"):
continue

key = row.pop(0)
Expand Down Expand Up @@ -95,7 +102,10 @@ def main(args=None):
params_inconsistent = {}

for file in files:
(params, param_sets) = process_file(getattr(args, file[1]), file[0], params, param_sets)
if file[0] == "ANNOTATION_STATISITCS" and args.in_annotation_statistics == None:
continue
else:
(params, param_sets) = process_file(getattr(args, file[1]), file[0], params, param_sets)

for key in params.keys():
value_set = {v for v in params[key]}
Expand Down
154 changes: 154 additions & 0 deletions bin/extract_annotation_statistics_info.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,154 @@
#!/usr/bin/env python3
import re
import csv
import sys
import argparse
import json


# Extract CDS information from mrna and transcript sections
def extract_cds_info(file):
# Define regex patterns for different statistics
patterns = {
"TRANSC_MRNA": re.compile(r"Number of mrna\s+(\d+)"),
"PCG": re.compile(r"Number of gene\s+(\d+)"),
"CDS_PER_GENE": re.compile(r"mean mrnas per gene\s+([\d.]+)"),
"EXONS_PER_TRANSC": re.compile(r"mean exons per mrna\s+([\d.]+)"),
"CDS_LENGTH": re.compile(r"mean mrna length \(bp\)\s+([\d.]+)"),
"EXON_SIZE": re.compile(r"mean exon length \(bp\)\s+([\d.]+)"),
"INTRON_SIZE": re.compile(r"mean intron in cds length \(bp\)\s+([\d.]+)"),
}

# Initialize a dictionary to store content for different sections
section_content = {"mrna": "", "transcript": ""}

# Variable to keep track of the current section being processed
current_section = None

with open(file, "r") as f:
lines = f.read().splitlines() # read all lines in the file

for line in lines:
line = line.strip() # Remove any leading/trailing whitespace including newline characters

if "---------------------------------- mrna ----------------------------------" in line:
current_section = "mrna" # Switch to 'mrna' section
elif "---------------------------------- transcript ----------------------------------" in line:
current_section = "transcript" # Switch to 'transcript' section
elif "----------" in line:
current_section = None # End of current section
elif current_section:
section_content[current_section] += (
line + " "
) # Accumulate content for the current section, separate lines by a space

cds_info = {}

for label, pattern in patterns.items():
text_to_search = section_content["mrna"] if label != "EXONS_PER_TRANSC" else section_content["transcript"]
match = re.search(pattern, text_to_search)
if match:
cds_info[label] = match.group(1)

return cds_info


# Function to extract the number of non-coding genes from the second file
def extract_non_coding_genes(file):
non_coding_genes = {"ncrna_gene": 0}

with open(file, "r") as f:
for line in f:
parts = line.split()
if len(parts) < 2:
continue

gene_type = parts[0]
try:
count = int(parts[1])
except ValueError:
continue

if gene_type in non_coding_genes:
non_coding_genes[gene_type] += count

NCG = sum(non_coding_genes.values())
return {"NCG": NCG}


# Extract the one_line_summary from a BUSCO JSON file
def extract_busco_results(busco_stats_file):
try:
with open(busco_stats_file, "r") as file:
busco_data = json.load(file)
# Extract the one_line_summary from the results section
one_line_summary = busco_data.get("results", {}).get("one_line_summary")
if one_line_summary:
# Use regex to extract everything after the first colon
match = re.search(r':\s*"(.*)"', one_line_summary)
if match:
one_line_summary = match.group(1) # Get text after the colon
return {"BUSCO_PROTEIN_SCORES": one_line_summary} if one_line_summary else {}
except (json.JSONDecodeError, FileNotFoundError) as e:
print(f"Error loading BUSCO JSON file: {e}")
return {}


# Function to write the extracted data to a CSV file
def write_to_csv(data, output_file, busco_stats_file):
busco_results = extract_busco_results(busco_stats_file)

descriptions = {
"TRANSC_MRNA": "The number of transcribed mRNAs",
"PCG": "The number of protein coding genes",
"NCG": "The number of non-coding genes",
"CDS_PER_GENE": "The average number of coding transcripts per gene",
"EXONS_PER_TRANSC": "The average number of exons per transcript",
"CDS_LENGTH": "The average length of coding sequence",
"EXON_SIZE": "The average length of a coding exon",
"INTRON_SIZE": "The average length of coding intron size",
"BUSCO_PROTEIN_SCORES": "BUSCO results summary from running BUSCO in protein mode",
}

with open(output_file, "w", newline="") as csvfile:
writer = csv.writer(csvfile)

# Write descriptions at the top of the CSV file
for key, description in descriptions.items():
csvfile.write(f"# {key}: {description}\n")

# Write the Variable and Value columns header
writer.writerow(["#paramName", "paramValue"])

# Write the data
for key, value in data.items():
writer.writerow([key, value])

# Add the BUSCO results summary
for key, value in busco_results.items():
writer.writerow([key, value])


# Main function to take input files and output file as arguments
def main():
Description = "Parse contents of the agat_spstatistics, buscoproteins and agat_sqstatbasic to extract relevant annotation statistics information."
Epilog = (
"Example usage: python extract_annotation_statistics_info.py <basic_stats> <other_stats> <busco_stats> <output>"
)

parser = argparse.ArgumentParser(description=Description, epilog=Epilog)
parser.add_argument("basic_stats", help="Input txt file with basic_feature_statistics.")
parser.add_argument("other_stats", help="Input txt file with other_feature_statistics.")
parser.add_argument("busco_stats", help="Input JSON file for the BUSCO statistics.")
parser.add_argument("output", help="Output file.")
parser.add_argument("--version", action="version", version="%(prog)s 1.0")
args = parser.parse_args()

cds_info = extract_cds_info(args.other_stats)
non_coding_genes = extract_non_coding_genes(args.basic_stats)
data = {**cds_info, **non_coding_genes}
write_to_csv(data, args.output, args.busco_stats)


if __name__ == "__main__":
sys.exit(main())
Loading
Loading