-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allows for CDS (as well as gene) features to generate a new gene reference #55
Conversation
…rence The newreference.py script processes a GenBank file, producing new gene reference files (GenBank and FASTA) which are required for creating the Nextstrain gene tree views. As we intend to use this script as a template for other viruses (e.g. dengue, measles), it was necessary to allow the use of CDS features in the GenBank file, not just the gene.
for generating whole genome references, this script might also be useful: We use this to bootstrap nextclade datasets |
for feature in ref.features: | ||
if feature.type == 'source': | ||
ref_source_feature = feature | ||
if feature.type =='gene': | ||
if feature.type =='gene' or feature.type == 'CDS': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we'd use a centralised GenBank parser, such as augur.utils.load_features
. It seem like our parser only considers CDS feature types at the moment, but that could be easily extended.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohh, I'm open to using augur.utils.load_features
. Ah, is there an example script that can be easily referenced?
For extending to "gene feature types" is this basically adding a duplicate code block like the following?
if feat.type=='gene':
# ... copy internal code from "if feat.type=='CDS'"
I assume if both a "gene" and a "CDS" feature are named "E", one won't overwrite the nucleotide coordinates of the other (although hopefully they match)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for nextstrain/augur#1435! Upon review, and in order to not hold up progress here (and measles), I'd suggest using the code as it stands and then we can move to using a util function once we're happy with the changes there.
822d2d6
to
3400636
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, left a single non-blocking code style suggestion.
Currently, if the --gene option is used and the gene name is not found, the script will use the entire genome which may cause some silent undesired behaviors. This commit changes that such that the script will error out if the gene is not found in the GenBank file as this indicates the gene name may be misspelled or the user may be using the wrong GenBank file. If the --gene option is not used, the script will continue to process the entire genome as expected. Co-authored-by: Jover Lee <[email protected]>
95b7446
to
8cd6a13
Compare
Just tagging that this caused some downstream bugs where "CDS" and "gene" coordinates did not match, subsequently fixed by #59 (comment) In the future we should pick a preference for "gene" or "CDS" coordinates and update scripts accordingly. |
Agree, definitely better to make this choice explicit. |
Description of proposed changes
It seems that we intend to use the
newreference.py
script as a template for creating gene trees for other viruses (e.g. dengue, measles). Thenewreference.py
script processes a GenBank file, producing new gene reference files (GenBank and FASTA) which are required for creating the Nextstrain gene tree views.When using the
scripts/newreference.py
to generate an E gene reference for Dengue:Encountered error
The underlying issue was that the script is looking for a "gene" feature" and was ignoring "CDS" features and this PR extends the functionality to CDS.
Looking at the measles genbank reference, it will also require the CDS functionality to pull out the N gene.
Related issue(s)
Checklist