Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate gene reference files #47

Merged
merged 9 commits into from
May 9, 2024
Merged

Conversation

j23414
Copy link
Contributor

@j23414 j23414 commented May 8, 2024

Description of proposed changes

In order to support gene phylogenetic trees (e.g. E gene trees), add rules to automatically generate gene reference GenBank and FASTA files (e.g. reference_denv4_E.gb and reference_denv4_E.fasta) by following the rules used in RSV.

This is part of a larger and older issue of creating E gene builds and is being split out into smaller PRs to maintain QC and scope of review. This will not generate an E gene phylogenetic tree, subsequent PRs will modify this to generate the trees.

Visual summary (view whole pipeline plan so far)

Related issue(s)

Checklist

nextstrain build phylogenetic results/config/reference_all_E.gb results/config/reference_all_E.fasta
nextstrain build phylogenetic results/config/reference_denv1_E.gb results/config/reference_denv1_E.fasta
nextstrain build phylogenetic results/config/reference_denv2_E.gb results/config/reference_denv2_E.fasta
nextstrain build phylogenetic results/config/reference_denv3_E.gb results/config/reference_denv3_E.fasta
nextstrain build phylogenetic results/config/reference_denv4_E.gb results/config/reference_denv4_E.fasta
Example shortened reference_denv2_E.gb
LOCUS       DENV2/THAILAND/REFERENCE/1964 1485 bp    DNA              UNK 01-JAN-1980
DEFINITION  Dengue virus 2, complete genome.
ACCESSION   NC_001474
VERSION     NC_001474.2
KEYWORDS    .
SOURCE      .
  ORGANISM  .
            .
FEATURES             Location/Qualifiers
     CDS             1..1485
                     /gene="E"
                     /db_xref="VBRC:35921"
                     /product="envelope protein E"
                     /protein_id="NP_739583.2"
     source          1..1485
                     /collection_date="1964"
                     /country="Thailand"
                     /db_xref="taxon:11060"
                     /mol_type="genomic RNA"
                     /organism="Dengue virus 2"
                     /strain="16681"
ORIGIN
        1 atgcgttgca taggaatgtc aaatagagac tttgtggaag gggtttcagg aggaagctgg
       61 gttgacatag tcttagaaca tggaagctgt gtgacgacga tggcaaaaaa caaaccaaca
      121 ttggattttg aactgataaa aacagaagcc aaacagcctg ccaccctaag gaagtactgt
      ...
     1381 gtcattatca catggatagg aatgaattca cgcagcacct cactgtctgt gacactagta
     1441 ttggtgggaa ttgtgacact gtatttggga gtcatggtgc aggcc
//

@j23414 j23414 force-pushed the generate-gene-reference-files branch from 58099ae to 4102012 Compare May 8, 2024 16:30
@j23414 j23414 requested a review from a team May 8, 2024 18:01
phylogenetic/bin/newreference.py Outdated Show resolved Hide resolved
phylogenetic/config/reference_denv2_genome.gb Outdated Show resolved Hide resolved
phylogenetic/rules/prepare_sequences.smk Outdated Show resolved Hide resolved
j23414 added a commit to nextstrain/rsv that referenced this pull request May 8, 2024
This is a fixup to an earlier commit:

8cd6a13

This updates the docs to reflect that the script will NOT just throw a warning, but actually error out
if the gene is not found in the GenBank file. This was flagged by comment:

nextstrain/dengue#47 (comment)
@j23414
Copy link
Contributor Author

j23414 commented May 8, 2024

I was wondering why the CI was taking so long, then remembered that example files gets connected to "phylogenetic/data"

https://github.com/nextstrain/.github/blob/4f41fa6db826dff3f1eb09f8d2e0a1512c9e358d/.github/workflows/pathogen-repo-ci.yaml#L236-L237

Fixed with: 30b1d5a
CI seems much faster

j23414 and others added 9 commits May 8, 2024 16:19
Adds some wildcard constraints on serotype-gene combinations to avoid
unchecked wildcard matching, such as having {serotype}.fasta match both
"denv1_E.fasta" and "denv1.fasta".
This is in preperation of having separate genome and gene (e.g. E, NS1) reference files.
This is in preperation of nesting each gene's specific files in
subdirectories (e.g. `results/E/tree.nwk`) as suggested in comment:

* nextstrain/private#102 (comment)
In prep of building "genome" and "E" intermediate and final files for the
phylogenetic pipeline.
Move gene annotation to top of CDS to match other genbank files (denv1,3,4)
This generates the reference_serotype_gene.gb and reference_serotype_gene.fasta
files for each serotype.

These files can then be subsequently used in augur align, augur translate, and
optionally for nextclade align during the gene trees.
@j23414 j23414 force-pushed the generate-gene-reference-files branch from 30b1d5a to f5b7bf6 Compare May 8, 2024 23:33
@j23414 j23414 merged commit e720a96 into main May 9, 2024
41 checks passed
@j23414 j23414 deleted the generate-gene-reference-files branch May 9, 2024 22:56
This was referenced May 23, 2024
j23414 added a commit to nextstrain/rsv that referenced this pull request Jun 5, 2024
This is a fixup to an earlier commit:

8cd6a13

This updates the docs to reflect that the script will NOT just throw a warning, but actually error out
if the gene is not found in the GenBank file. This was flagged by comment:

nextstrain/dengue#47 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants