Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mean read quality calculation #376

Open
alvanuffelen opened this issue Jul 10, 2024 · 5 comments
Open

Mean read quality calculation #376

alvanuffelen opened this issue Jul 10, 2024 · 5 comments

Comments

@alvanuffelen
Copy link

The mean read quality is calculated by:

  1. Converting the mean phred quality score of each read to the base-calling error probability
  2. Calculating the mean of all the mean error probabilities from step 1
  3. Converting the total mean error probability from step 2 back to a phred quality score

In Nanomath, for step 1, the mean phred score of each read is interpreted as an int instead of a float.

Does this conversion not introduce an error in the mean read quality?
If I have a read with a mean quality of 16.8, the script converts it to 16 and the probability becomes 0.02512.
But according to the formula $P = 10^{-\frac{10}{Q}}$, it should be 0.02089.

Using this test file (columns are read id, length and mean quality) and without converting to int:

import pandas as pd
import numpy as np
from math import log 

df = pd.read_table("test.txt",
                   header=None,
                   names=['id', 'length', 'quality'])
                   
convert_to_probs = lambda q: 10 ** (-q/10)
vfunc = np.vectorize(convert_to_probs)
probs = vfunc(df['quality'])
-10 * log(probs.sum() / len(probs), 10)

The mean read quality 13.22437.
If I convert the scores to int:

probs = vfunc(df['quality'].astype(int))
-10 * log(probs.sum() / len(probs), 10)

The mean read quality is 12.71993, the same as Nanoplot reports (see attached file).
NanoStats.txt

Is there a reason to convert the mean score of each read to int before calculating the probabilities?

@wdecoster
Copy link
Owner

I think you are right and this is an error! I'll do my best to fix this soon, or you are welcome to open a pull request :)

@GZZHY79
Copy link

GZZHY79 commented Aug 8, 2024

I think you are right and this is an error! I'll do my best to fix this soon, or you are welcome to open a pull request :)

I notice the latest release is in Oct, 2023. Have the error been fixed now?

@jbh-cas
Copy link

jbh-cas commented Sep 16, 2024

Wouter, This comment is being added here as another item to possibly consider if/when revisiting the mean qscore.

Not exactly this issue with the mean qscore, but having read your informative blog posts about them, I found an Issue on the dorado repo that might explain the qscore difference from what the basecaller reports and NanoMath/NanoPlot report.

Issue is nanoporetech/dorado#937

It explains that, for whatever reason, the first 60 bases are skipped for the mean qual calc. Doing this does gives a qscore in agreement with basecaller values in my modest testing. This default of 60 is set in function get_mean_qscore_start_pos_by_model_name in file dorado/basecall/CRFModelConfig.cpp of the repo. Apparently models could set this as a value but usually do not, so 60 bases are typically skipped.

I use your method -- thanks! -- to reattach a mean qscore after trimming to guide user filtering of reads and will continue to take all bases into account. These differences would seem to argue more for cropping low-qual beginning bases than ignoring them in a calculation.

In any event, I thought I'd pass this along since I'd found the difference a puzzler.

best,
jim

@wdecoster
Copy link
Owner

Hi Jim,

Thanks for letting me know! Skipping some bases sounds entirely reasonable to me, dealing with issues because of adapter sequences and the start of the read. I might adopt that :-)

Best,
Wouter

@jbh-cas
Copy link

jbh-cas commented Sep 16, 2024 via email

wdecoster added a commit that referenced this issue Oct 18, 2024
and also fixing a deprecation warning in timeplots
This was referenced Dec 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants