Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

qc algorithm for nonACTGNs #1524

Open
ryhisner opened this issue Sep 23, 2024 · 2 comments
Open

qc algorithm for nonACTGNs #1524

ryhisner opened this issue Sep 23, 2024 · 2 comments
Labels
needs triage Mark for review and label assignment t:ask Type: question, request of information 1

Comments

@ryhisner
Copy link

I'm trying to alter the "mixed sites" qc rating, but I'm having trouble figuring out how.

The pathogen.json file and the documentation on Github (both pictured below) only mention a threshold of 10 mixed sites. Yet I know Nextclade Web penalizes for only three mixed sites.
image
image

The private nuc mutations lists both a threshold and a "typical" value. Is there also a hidden value for "typical" for mixed sites? There must be something since Nextclade Web flags sequences with 3 or more nonACTGNs.

@ryhisner ryhisner added needs triage Mark for review and label assignment t:ask Type: question, request of information 1 labels Sep 23, 2024
@ivan-aksamentov
Copy link
Member

ivan-aksamentov commented Sep 23, 2024

Hi Ryan @ryhisner,

Sorry, these things are not documented very well (or at all).

The actual logic is as follows (code):

  • take the nucleotide composition of the sequence (the counters which tell the number of occurrences for each character in the sequence)
  • take counters for ambiguous nucleotides: everything except ACGTN-
  • sum them up
  • divide the sum by mixed_sites_threshold coming from dataset, and multiply that by 100
  • clamp the minimum value to 0 (I think this is redundant - it cannot be negative)
  • this is the QC score of the "M" QC check

So the threshold acts as a denominator, and not as something you subtract or clamp to, which is what you expect perhaps. The naming of the parameter could have been better, but it's too late at this point :)

The private nuc mutations lists both a threshold and a "typical" value. Is there also a hidden value for "typical" for mixed sites?

No, for "mixed" check there's only enabled and mixed_sites_threshold parameters (code).

(P.S. Note, if reading the linked code, the values in JSON use "camelCase" naming convention and the values in the code use "snake_case" naming convention - this is a historical artifact from the olden days)

Let us know what are you trying to do exactly, what's not working for you and whether you have any improvements in mind!

@ryhisner
Copy link
Author

Thank you so much, Ivan!

I am in a rush to get to school and will come back to add more later, but briefly, there are a several things I'm trying to do at the moment. Here are a few of the main ones.

• Determine the number of independent developments of certain mutations, mostly ones that have only rarely evolved. I've spent a really long time trying to figure out how many times ∆A28271 has arisen, and it's turning out to be extremely frustrating. The privateMutations category is too expansive for this purpose, often counting a fairly large number of related sequences as all having the same private mutation.

• I'm trying to run a nucleotide-context function for all GISAID sequences. There's so much garbage out there that I made an overall qc score maximum of 5 and disallowed all reversions (otherwise the most common mutations turn out to all be artifactual reversions due to some labs' heinous practice of reverting to reference in areas of dropout). But this filters out 60% of all sequences, which is rather more than I'm comfortable with.
I've noticed that a lot of UK sequences have ≥3 mixed nucs despite being otherwise of pretty high quality (S:R403K and M:D3H register as mixed more often than not it seems). In most other labs, which just label such sites as dropout instead of mixed nucs, the qc wouldn't suffer at all from this. So in some sense, the UK sequences are penalized for being more accurate, i.e. registering mixed nucs instead of ignoring them. Maybe it would be enough to exclude super common mixed sites like 22770R and 26610R?
Anyway, I thought I would try to loosing the qc contribution of nonACGTNs. I also think I want to slightly loosen the private mutations qc requirements. In my experience, sequences that have ≥8 private mutations in Nextclade Web often turn out to only have maybe 2 or 3 private mutations when uploaded to Usher, so I'm hoping I might be able to exclude fewer good sequences this way (without sweeping in too much trash).

• One other qc filter that I am planning to manually add to some of my functions is one that counts the number of separate areas of dropout. When I was trying to narrow down the number of non-chronic sequences that legitimately had ∆A28271 as a private mutation, I had to repeatedly filter out various categories of artifactual sequences. The most common category was Delta or Alpha sequences that only registered 1-8 substitutions, the rest simply registering as dropout. The overall amount of dropout in these sequences was low enough that they scored ≤5 on overall qc, but they often had >50 separate areas of dropout. I'm not sure if it would be possible to add something of this sort to the "missing" qc score in Nextclade.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs triage Mark for review and label assignment t:ask Type: question, request of information 1
Projects
None yet
Development

No branches or pull requests

2 participants