-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
qc algorithm for nonACTGNs #1524
Comments
Hi Ryan @ryhisner, Sorry, these things are not documented very well (or at all). The actual logic is as follows (code):
So the threshold acts as a denominator, and not as something you subtract or clamp to, which is what you expect perhaps. The naming of the parameter could have been better, but it's too late at this point :)
No, for "mixed" check there's only (P.S. Note, if reading the linked code, the values in JSON use "camelCase" naming convention and the values in the code use "snake_case" naming convention - this is a historical artifact from the olden days) Let us know what are you trying to do exactly, what's not working for you and whether you have any improvements in mind! |
Thank you so much, Ivan! I am in a rush to get to school and will come back to add more later, but briefly, there are a several things I'm trying to do at the moment. Here are a few of the main ones. • Determine the number of independent developments of certain mutations, mostly ones that have only rarely evolved. I've spent a really long time trying to figure out how many times ∆A28271 has arisen, and it's turning out to be extremely frustrating. The privateMutations category is too expansive for this purpose, often counting a fairly large number of related sequences as all having the same private mutation. • I'm trying to run a nucleotide-context function for all GISAID sequences. There's so much garbage out there that I made an overall qc score maximum of 5 and disallowed all reversions (otherwise the most common mutations turn out to all be artifactual reversions due to some labs' heinous practice of reverting to reference in areas of dropout). But this filters out 60% of all sequences, which is rather more than I'm comfortable with. • One other qc filter that I am planning to manually add to some of my functions is one that counts the number of separate areas of dropout. When I was trying to narrow down the number of non-chronic sequences that legitimately had ∆A28271 as a private mutation, I had to repeatedly filter out various categories of artifactual sequences. The most common category was Delta or Alpha sequences that only registered 1-8 substitutions, the rest simply registering as dropout. The overall amount of dropout in these sequences was low enough that they scored ≤5 on overall qc, but they often had >50 separate areas of dropout. I'm not sure if it would be possible to add something of this sort to the "missing" qc score in Nextclade. |
I'm trying to alter the "mixed sites" qc rating, but I'm having trouble figuring out how.
The pathogen.json file and the documentation on Github (both pictured below) only mention a threshold of 10 mixed sites. Yet I know Nextclade Web penalizes for only three mixed sites.
The private nuc mutations lists both a threshold and a "typical" value. Is there also a hidden value for "typical" for mixed sites? There must be something since Nextclade Web flags sequences with 3 or more nonACTGNs.
The text was updated successfully, but these errors were encountered: