-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: explicitly specify how the root and ambiguous states are handled… #1690
base: master
Are you sure you want to change the base?
Conversation
Running this on a mpox clade-I build which currently assigns all root mutations to the basal branch leading to clade Ib. This PR: Branch leading to clade Ia
This PR: Branch leading to clade Ib
Pre-PR behaviour
So it's a lot better - and perhaps this PR is correct within its remit - but something's not quite right. The total number of inferred ATGC-ATGC mutations (73) is unchanged, but this doesn't match the refined branch lengths (65). Perhaps this is due to masking being used in the refine alignment but not considered for the conversion of subs/site to number of mutations? Secondly, and this is just for my own interest, how are the gap mutations being allocated among sister branches? I'd have thought clade IIb gets I've got some tests here which use small contrived (50nt-ish) genomes where we can reason with the mutations so I'll take a look at them later. |
I thought this was simply the subs/site scaled by some scalar, but it's actually doing the full inference and counting mutations, and this inference is modified by this PR (I missed this). Rerunning mpox including the refine step makes the assigned mutations match the mutation count. The allocation of gap-mutations also follow the branch-lengths. |
I looked into the tests, starting with the first test here which infers ancestral mutations given the reference via The results using this PR are stochastic, I observed the following three reconstructions back-to-back-to-back:
For the same test but looking at the case where we don't supply the root-sequence we see similar stochasticity, with differences in root-sequence reconstruction observed at pos 7, 18, 33, 43. |
thanks for digging into this. This makes all sense to me and is the expected result as far as i can tell. providing a |
note that the sample C is a lot closer to the root (0.02) than Node AB (0.06). So in most cases, the root will agree with C. and yes, we infer the actual mutations for the branch length in mutation units. this made sense from the perspective of very similar genomes, where you want branch length to correspond to the mutations in a direct discrete way. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't look into the main issue, but noticed a small Python syntax change to be addressed.
I tried to make the Do folks feel like the |
I think seeding is reasonable. It's also used in many |
Yes, this seems like a good direction, both for testing and facilitating reproducible builds. I'm also happy to skip certain tests (or parts of tests) where expedient. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #1690 +/- ##
==========================================
+ Coverage 72.79% 72.82% +0.02%
==========================================
Files 79 79
Lines 8271 8272 +1
Branches 1691 1691
==========================================
+ Hits 6021 6024 +3
+ Misses 1961 1960 -1
+ Partials 289 288 -1 ☔ View full report in Codecov by Sentry. |
Thanks, @victorlin and @jameshadfield! I managed to get the remaining broken tests to pass (stably?). If this looks good to you both and @rneher, then I can update the CHANGELOG to reflect the additional |
… during sequence reconstruction and mutation counting
Co-authored-by: Victor Lin <[email protected]>
Adds a new `--seed` argument to the ancestral command, following the existing pattern in the refine command, and passes the user-provided value to TreeTime's `TreeAnc` class to ensure deterministic outcomes for random samples. Since this PR changes ancestral's root sequence inference to be stochastic, this new feature allows us to get the same "random" result with each run of our functional tests. Correspondingly, this commit updates the non-VCF functional tests of ancestral to use a fixed seed. This seed should fix the random behavior to a specific outcome, but we need to update our expected JSON outputs to match that outcome here. In this case, the change is an introduction of a "T" at position 14 in the node_root of the "simple genome" data.
As in the previous commit, this commit updates the expected VCF output to match the fixed "random" outcome for the given seed where position 14 is a T in the root node.
Updates the functional test of multiallele VCF behavior when one of the alleles is a "N" character. The multiallele VCF in the original test (setting sample B to genotype of 30G) caused the inference of alleles at sites 7 and 14 to change substantially in this PR's new implementation. This commit stabilizes the test by assigning both samples A and B to the 30G genotype such that all nodes are inferred to have a 30G and the root node gets a A30G mutation. It should maintain the originally intended behavior of the test [1]. As part of this change, I've replaced the dynamically modified VCF used by this test with a static version of the VCF in version control. This change simplified test debugging by allowing me to inspect the input file outside of the cram temporary environment and it also fixes an issue with the original dynamic VCF where an additional line for site 30 was added but the original line was not being removed. [1] #1380
The change to how augur refine can stochastically assign the root node mutations led to a change in inferred branch lengths for our functional tests. Since the test uses a random seed argument for augur refine, the results of subsequent runs should be deterministically successful after this change.
Updates data for augur translate functional tests to match the updated test data for the ancestral command.
Updates SNPs in manually-curated VCF used to test augur translate so they match the new expected genotype for the "simple genome" data at position 14. This fixes the vcf.t translate test. Also updates the expected amino acid translations JSON for the vcf with root mutation test. As part of this update, I've created a complete copy of the "truth" aa_muts.json with the root mutation and used this instead of the version that was previously created by modifying the original aa_muts file with sed. Although this change duplicates test data, I was able to debug the test more easily with a complete version of the JSON living outside of the cram tests.
Adds random seed argument to all functional tests of the ancestral command. This should hopefully resolve remaining stochastic changes in cram runs during CI.
f556cc9
to
d618f53
Compare
I tried to resolve a conflict in the change log with GitHub's web UI which ended up making a merge commit from |
… during sequence reconstruction and mutation counting.
Fixes #1689
Checklist