qc algorithm for nonACTGNs #1524

ryhisner · 2024-09-23T02:15:54Z

I'm trying to alter the "mixed sites" qc rating, but I'm having trouble figuring out how.

The pathogen.json file and the documentation on Github (both pictured below) only mention a threshold of 10 mixed sites. Yet I know Nextclade Web penalizes for only three mixed sites.

The private nuc mutations lists both a threshold and a "typical" value. Is there also a hidden value for "typical" for mixed sites? There must be something since Nextclade Web flags sequences with 3 or more nonACTGNs.

ivan-aksamentov · 2024-09-23T06:24:46Z

Hi Ryan @ryhisner,

Sorry, these things are not documented very well (or at all).

The actual logic is as follows (code):

take the nucleotide composition of the sequence (the counters which tell the number of occurrences for each character in the sequence)
take counters for ambiguous nucleotides: everything except ACGTN-
sum them up
divide the sum by mixed_sites_threshold coming from dataset, and multiply that by 100
clamp the minimum value to 0 (I think this is redundant - it cannot be negative)
this is the QC score of the "M" QC check

So the threshold acts as a denominator, and not as something you subtract or clamp to, which is what you expect perhaps. The naming of the parameter could have been better, but it's too late at this point :)

The private nuc mutations lists both a threshold and a "typical" value. Is there also a hidden value for "typical" for mixed sites?

No, for "mixed" check there's only enabled and mixed_sites_threshold parameters (code).

(P.S. Note, if reading the linked code, the values in JSON use "camelCase" naming convention and the values in the code use "snake_case" naming convention - this is a historical artifact from the olden days)

Let us know what are you trying to do exactly, what's not working for you and whether you have any improvements in mind!

ryhisner · 2024-09-23T11:03:23Z

Thank you so much, Ivan!

I am in a rush to get to school and will come back to add more later, but briefly, there are a several things I'm trying to do at the moment. Here are a few of the main ones.

• Determine the number of independent developments of certain mutations, mostly ones that have only rarely evolved. I've spent a really long time trying to figure out how many times ∆A28271 has arisen, and it's turning out to be extremely frustrating. The privateMutations category is too expansive for this purpose, often counting a fairly large number of related sequences as all having the same private mutation.

• I'm trying to run a nucleotide-context function for all GISAID sequences. There's so much garbage out there that I made an overall qc score maximum of 5 and disallowed all reversions (otherwise the most common mutations turn out to all be artifactual reversions due to some labs' heinous practice of reverting to reference in areas of dropout). But this filters out 60% of all sequences, which is rather more than I'm comfortable with.
I've noticed that a lot of UK sequences have ≥3 mixed nucs despite being otherwise of pretty high quality (S:R403K and M:D3H register as mixed more often than not it seems). In most other labs, which just label such sites as dropout instead of mixed nucs, the qc wouldn't suffer at all from this. So in some sense, the UK sequences are penalized for being more accurate, i.e. registering mixed nucs instead of ignoring them. Maybe it would be enough to exclude super common mixed sites like 22770R and 26610R?
Anyway, I thought I would try to loosing the qc contribution of nonACGTNs. I also think I want to slightly loosen the private mutations qc requirements. In my experience, sequences that have ≥8 private mutations in Nextclade Web often turn out to only have maybe 2 or 3 private mutations when uploaded to Usher, so I'm hoping I might be able to exclude fewer good sequences this way (without sweeping in too much trash).

• One other qc filter that I am planning to manually add to some of my functions is one that counts the number of separate areas of dropout. When I was trying to narrow down the number of non-chronic sequences that legitimately had ∆A28271 as a private mutation, I had to repeatedly filter out various categories of artifactual sequences. The most common category was Delta or Alpha sequences that only registered 1-8 substitutions, the rest simply registering as dropout. The overall amount of dropout in these sequences was low enough that they scored ≤5 on overall qc, but they often had >50 separate areas of dropout. I'm not sure if it would be possible to add something of this sort to the "missing" qc score in Nextclade.

ryhisner added needs triage Mark for review and label assignment t:ask Type: question, request of information 1 labels Sep 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

qc algorithm for nonACTGNs #1524

qc algorithm for nonACTGNs #1524

ryhisner commented Sep 23, 2024

ivan-aksamentov commented Sep 23, 2024 •

edited

Loading

ryhisner commented Sep 23, 2024

qc algorithm for nonACTGNs #1524

qc algorithm for nonACTGNs #1524

Comments

ryhisner commented Sep 23, 2024

ivan-aksamentov commented Sep 23, 2024 • edited Loading

ryhisner commented Sep 23, 2024

ivan-aksamentov commented Sep 23, 2024 •

edited

Loading