-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ancestral / translate improvements & tests #1348
Conversation
9566edd
to
a529731
Compare
2c201bf
to
357345f
Compare
a529731
to
5b0545f
Compare
357345f
to
84343d8
Compare
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #1348 +/- ##
==========================================
+ Coverage 65.94% 66.04% +0.10%
==========================================
Files 68 68
Lines 7156 7172 +16
Branches 1758 1761 +3
==========================================
+ Hits 4719 4737 +18
+ Misses 2173 2171 -2
Partials 264 264 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Skimmed through this quickly - seems good! Just a couple small things.
- Ambiguous nucleotides (there are 3 Ns in sample_C) are inferred (default) | ||
|
||
$ ${AUGUR} ancestral \ | ||
> --tree $TESTDIR/../data/simple-genome/tree.nwk \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(non-blocking, more about consistency)
> --tree $TESTDIR/../data/simple-genome/tree.nwk \ | |
> --tree "$TESTDIR"/../data/simple-genome/tree.nwk \ |
This isn't a dealbreaker because $TESTDIR
is known to not have spaces, but it's used across other tests probably just as good practice. Recommendation applies generally to variables in this test file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah good shout - I think I've fixed these across all the modified tests (well, once the force push comes through I will have)
For my understanding, if we have something like DATA="$TESTDIR/../data"
- do we then need to quote every string which includes $DATA
? For instance, must --tree "${DATA}/tb/tree.nwk"
be quoted like so, or is --tree ${DATA}/tb/tree.nwk
going to be ok? (Assuming of course that $DATA
has a space in it.)
For instance, across the board we do
export AUGUR="${AUGUR:-$TESTDIR/../../../../bin/augur}" # appropriate quoting
$ {AUGUR} translate ... # no quoting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the DATA
example, yes that requires quoting. Here's an example: 4281ed8
For $AUGUR
, if I were writing all the tests from scratch, I would add quotes. I think it's easier to always quote as a rule of thumb, even if the value is known to not have spaces, just so I (and readers) don't have to think about whether it might contain spaces now or in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following up for anyone revisiting this issue
We cannot use quoting for $AUGUR
because our CI does not want it treated as a single word:
augur/.github/workflows/ci.yaml
Lines 67 to 68 in 9cd915a
env: | |
AUGUR: coverage run -a ${{ github.workspace }}/bin/augur |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh that's a good point, thanks for noting! Unquoted should be fine as long as we don't use spaces in our test paths.
While we had tests that (probably?) covered the code here, using a simple genome is much easier to reason with and to validate every mutation. In my opinion it's also important to see tests using "big" datasets (e.g. those bigger than we can manually verify each and every mutation) as not necessarily testing the accuracy of the command, but rather testing for regressions / changes in behaviour. VCF-input tests using the same data will be added in a future commit because (spoiler!) there are bugs which need fixing.
The move to "Creating files in the initial working directory" is motivated by <#1344 (comment)> and <#1176>. Additionally, I remove the pushd commands which were confusing (there were multiple!) and use variables to refer to common directories to improve readability.
Uses the same approach as an earlier commit introduced for `augur ancestral`, but this time for `augur translate`. Tests the `--genes` argument as well for good measure. The exact format of GFF/GenBank files here is extremely important, but future commits will hopefully relax this.
This helps readability of the --help content. We enforce this behaviour by adding a check in the code following the example in `augur ancestral`
Simplifies the reading of the main function, and follows the established pattern in this file of using a pair of functions (one for VCF, one for JSON/FASTA). Added docstrings. The only code change is to throw an AugurError rather than `return 1`, but the result is unchanged.
There should be no functional changes. Co-locating the sequence reading & feature extraction is easier to read, and were Python's scoping to be different it'd be even nicer as we wouldn't leave around variables which are never re-used. As part of this `translate_vcf_feature` has changed from using an undocumented `return None` to raising a (documented) error which (IMO) is easier to reason with.
84343d8
to
03c1965
Compare
Thanks for the review @victorlin Merging despite CI failure, as that's unrelated to this PR - it's failing while running the
|
Unfortunately a few commits were squashed together in this PR. It's now been merged so I'm not going to change anything. 03c1965 contains what was meant to be three commits - a small refactor + the following commits
diff --git a/CHANGES.md b/CHANGES.md
index 64f40df5..1cfe73a2 100644
--- a/CHANGES.md
+++ b/CHANGES.md
@@ -8,7 +8,7 @@
* ancestral: Improvements to command line arguments. [#1344][] (@jameshadfield)
* Incompatible arguments are now checked, especially related to VCF vs FASTA inputs.
* `--vcf-reference` and `--root-sequence` are now mutually exclusive.
-
+* translate: Tree nodes are checked against the node-data JSON input to ensure sequences are present. [#1348][] (@jameshadfield)
* translate: Improvements to command line arguments. [#1348][] (@jameshadfield)
* `--tree` and `--ancestral-sequences` are now required arguments.
* separate VCF-only arguments into their own group
diff --git a/augur/translate.py b/augur/translate.py
index 461738b2..1fc73040 100644
--- a/augur/translate.py
+++ b/augur/translate.py
@@ -20,6 +20,7 @@ from .io.vcf import write_VCF_translation
from .utils import read_node_data, load_features, write_json, get_json_name
from treetime.vcf_utils import read_vcf
from augur.errors import AugurError
+from textwrap import dedent
class MissingNodeError(Exception):
pass
@@ -339,7 +340,7 @@ def sequences_json(node_data_json, tree):
Extract the full nuc sequence for each node in the provided node-data JSON.
Returns a dict, keys are node names and values are a string of the genome sequence (nuc)
"""
- node_data = read_node_data(node_data_json, tree)
+ node_data = read_node_data(node_data_json)
if node_data is None:
raise AugurError("could not read node data (incl sequences)")
# extract sequences from node meta data
@@ -347,6 +348,14 @@ def sequences_json(node_data_json, tree):
for k,v in node_data['nodes'].items():
if 'sequence' in v:
sequences[k] = v['sequence']
+ tree_nodes = {c.name for c in tree.find_clades()}
+ tree_nodes_missing_sequences = tree_nodes - set(sequences.keys())
+ if len(tree_nodes_missing_sequences):
+ raise AugurError(dedent(f"""\
+ {len(tree_nodes_missing_sequences)} nodes on the tree are missing nucleotide sequences in the node-data JSON.
+ These must be present under 'nodes' → <node_name> → 'sequence'.
+ This error may originate from using 'augur ancestral' with VCF input; in this case try using VCF output from that command here.
+ """))
return sequences
def register_parser(parent_subparsers):
@@ -412,7 +421,7 @@ def run(args):
if len(features_without_variation):
print("{} genes had no mutations and so have been be excluded.".format(len(features_without_variation)))
else:
- sequences = sequences_json(args.ancestral_sequences, args.tree)
+ sequences = sequences_json(args.ancestral_sequences, tree)
translations = {fname: translate_feature(sequences, feat) for fname, feat in features.items() if feat.type != 'source'}
## glob the annotations for later auspice export
diff --git a/CHANGES.md b/CHANGES.md
index 1cfe73a2..54b14f8d 100644
--- a/CHANGES.md
+++ b/CHANGES.md
@@ -9,6 +9,7 @@
* Incompatible arguments are now checked, especially related to VCF vs FASTA inputs.
* `--vcf-reference` and `--root-sequence` are now mutually exclusive.
* translate: Tree nodes are checked against the node-data JSON input to ensure sequences are present. [#1348][] (@jameshadfield)
+* translate: The 'source' ID for GFF files is now ignored as a potential gene feature. [#1348][] (@jameshadfield)
* translate: Improvements to command line arguments. [#1348][] (@jameshadfield)
* `--tree` and `--ancestral-sequences` are now required arguments.
* separate VCF-only arguments into their own group
diff --git a/augur/translate.py b/augur/translate.py
index 1fc73040..8c7253c3 100644
--- a/augur/translate.py
+++ b/augur/translate.py
@@ -129,7 +129,7 @@ def translate_feature(aln, feature):
return translations
-def translate_vcf_feature(sequences, ref, feature):
+def translate_vcf_feature(sequences, ref, feature, feature_name):
'''Translates a subsequence of input nucleotide sequences.
Parameters
@@ -168,7 +168,7 @@ def translate_vcf_feature(sequences, ref, feature):
# Need to get ref translation to store. check if multiple of 3 for sanity.
# will be padded in safe_translate if not
if len(refNuc)%3:
- print("Gene length of {} is not a multiple of 3. will pad with N".format(feature.qualifiers['Name'][0]), file=sys.stderr)
+ print(f"Gene length of {feature_name!r} is not a multiple of 3. will pad with N", file=sys.stderr)
ref_aa_seq = safe_translate(refNuc)
prot['reference'] = ref_aa_seq
@@ -409,13 +409,16 @@ def run(args):
print("Read in {} features from reference sequence file".format(len(features)))
## Read in sequences & for each sequence translate each feature _except for_ the source (nuc) feature
+ ## Note that `load_features` _only_ extracts {'gene', 'source'} for GFF files, {'CDS', 'source'} for GenBank.
translations = {}
if is_vcf:
(sequences, ref) = sequences_vcf(args.vcf_reference, args.ancestral_sequences)
features_without_variation = []
for fname, feat in features.items():
+ if feat.type=='source':
+ continue
try:
- translations[fname] = translate_vcf_feature(sequences, ref, feat)
+ translations[fname] = translate_vcf_feature(sequences, ref, feat, fname)
except NoVariationError:
features_without_variation.append(fname)
if len(features_without_variation):
diff --git a/tests/functional/translate/cram/translate-with-gff-and-locus-tag.t b/tests/functional/translate/cram/translate-with-gff-and-locus-tag.t
index f0d25bee..1de2856e 100644
--- a/tests/functional/translate/cram/translate-with-gff-and-locus-tag.t
+++ b/tests/functional/translate/cram/translate-with-gff-and-locus-tag.t
@@ -15,7 +15,7 @@ Translate amino acids for genes using a GFF3 file where the gene names are store
> --output-node-data aa_muts.json \
> --alignment-output translations.vcf \
> --vcf-reference-output translations_reference.fasta
- Gene length of rrs_Rvnr01 is not a multiple of 3. will pad with N
+ Gene length of 'rrs' is not a multiple of 3. will pad with N
Read in 187 specified genes to translate.
Read in 187 features from reference sequence file
162 genes had no mutations and so have been be excluded.
|
@jameshadfield thanks for writing that! I've copied it into a comment on 03c1965. |
Description of proposed changes
A number of small commits which both leave the code in a better place and set the stage for a number of forthcoming PRs which will solve (bigger) bugs. Tests using a small, hand-crafted genome have been added for both ancestral & translate. Command arguments have been improved in both the
--help
messages and the code which checks for incompatible arg combinations. Additionally two small bugs have been fixed - #1345 and #591. See commit messages for more.Re: added tests, I have a summary PDF of the genome I created which I can add to this PR if that would be helpful?
Related issue(s)
#1345
#591
Checklist