-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ReadID does not match the read-assignment-distributions #20
Comments
Is it possible that this discrepancy is a due to the different algorithms between blast and minimap2 (and perhaps different databases..i'm unclear here if you are using the same database in both emu and your blast run here. If using a different database, the same species classification may be represented by a different representative sequence). If you were to run minimap2 using just your selected reads against the database you are using for emu, does the discrepancy still appear? |
Thank you for the fast response. My command for emu using the emu-default database: If you need more information or a detailed overview how I extracted the sequences etc. please let me know. And thanks a lot for your help. |
We would expect the read classification distributions to change depending on which reads are included in the input. The algorithm behind emu upweights likelihood for more abundant species, and downweights less abundant species. Thus, the classification distribution for each read varies depending on the other reads in the sample. I ran emu on your latest provided sequences. I choose to focus on the fourth read (97fe77d8-c844-40ff-8e18-a9ae7d3c0047_Rickettsia) since you have labeled this as Rickettsia and emu has labeled it has Wolbachia (split between tax ids 307502 and 66084 per the read-assignment-distribution file). I looked at the minimap2 output (sam file) that is produced as part of emu. When looking at the alignments for this read (attached), indeed the top 8 alignments are of Wolbachia, and the later alignments are for Rickettsia species. You can see in the chaining score (s1), Wolbachia references are in the 1054-1122 range, while Rickettsia has the highest s1 of 691. Minimap2 favors Wolbachia for this read. One thing I noticed about this read is that is that it is over 6kbp. In the provided fasta file, many reads appear to be over 3kbp, especially towards the top of the file. Depending on the type of sequencing done, I suspect these may be chimeric? I'm curious how a chimeric read corrector would perform on these reads. 97fe77d8-c844-40ff-8e18-a9ae7d3c0047_Rickettsia_minimap2.sam.txt |
Hi there!
I utilized EMU to classify 16S and 18S rRNA amplicons sequenced using nanopore sequencing with the SQK-NBD114.96 kit. However, I noticed that some of the taxonomic assignments made by EMU didn't match our expectations, so I wanted to verify some of these assignments manually. I used the read-assignment-distributions file to extract the readIDs for some arbitrarily chosen reads.
I next greped the readIDs in the according to fastq files and extracted the according sequence. When I now blast the sequences, some of them have different assignments by blast than by emu. All assignments from blasting do occur in the taxa assigned by emu, but some of them don't match the readID.
Attached please find the files containing the information to blast readID (c509fbd9-628a-4fd4-b62e-96d6781e5484).
The sequence to readID c509fbd9-628a-4fd4-b62e-96d6781e5484:
ATTACATCTACTTCGTTCAGTTACGTATTGCTAAGGTTAACACAAAGACACCGACAACTTTCTTCAGCACCTTACGGCTACCTTGTTACGACTTCACCCCAGTCGCTAATTTTACCGTGGTTGGCTGCCTCTTGCGTTAGCTCACCACCTTCAGGTAAAACCAACTCCCATGGCGTGACGGGCAGTGTGTACAAGGCCCGAGAACGTATTCACCGCGGCATGCTGATCCGCGATTACTAGCGATTCCAACTTCATGCTCTCGAGTTGCAGAGAACAATCCGAACTGAGATGTCTTTAGGGATTTGCTCCACGTCGCCGTATTGCTTCCCTCTGTAAACACCATTGTAGCACGCGTGTAGCCCAACCCATAAGGGCCATGATGACTTGACGTCGTCCCCACCTTCCTCCGGCTTATCACCGGCAGTTTTCTTATAGTTCCCGGCATTACCCGCTGGCAAATAAGAAGTCGGGGTTGCGCTGAGTTGCGGGACTTAACCCGACATCTCACGACACGAGCTGACGACAGCCATGCAACACCTGTGTGTGGTCCAGCCGAACTGAAGGAAAGCATCTCTGCGATCCGTAACCACCATGTCAAGGGTTGGTAAGGTTTTTCGCGTAACATCGAATTAAACCGCATGCTCCACCGCTTGTGCGAGCCCCCGTCAATTCCTTTGAGTTTTAATCTTGCGACCGTACTCCCCAGGCGGAGTGCTTAATGCGTTAGCTACGAAACCGAAAGAAATTCCTCCGATATCTAGCACTCATCGTTTACGGCGTGACTACCGGGTATCTGATCCTCTTTGCTCCACGCTTTCGTGCATCAGTGTCATTATCCTCTTACCGAGATGACCGCCTTCGCCACCGGTGTTCCTCCTAATATCTAAGAATTTCACCTCTACACTAGGAATTCCATCATCCCCTACTACACTCTAGATTAGTAGTTTTGAAAGCTATTCCGAGGTTAAGCCCCGGGCTTTCACTTTCCAACTTACTAAACCACCTACGCACTCTTTACGCCCAGTAATTCCGAACAACGCTAGCCCCCTCCGTCTTACCGCCGGCTGCTGGCACGGAGTTAGCCGGGGCTTTTTCTGCAGGTAACGTCATTATCTTCCCTGCTAAAAGAGCTTTACAACCCTAAGGCCTTCGTCACTCACTCGGTATTGCTGGATCAGGCTTTCGCCCATTGTCCAATATTCCCCACTGCTGCCTCCAGTAGGAGTCTGGGCCGTGTCTCAGTCCCAGTGTGGCTGATCATCCTCTCAGACCAGCTACAGATCGTAGGCTTGGTAGGCCATTACCCCACCAACTACCTAATCTGACGCGGGCTCATCCATCAGCGATAAATCTTTCCTCCTCGGAGAATATACGGTATTAGCTTTTATTTCTAAAAGTTATTCCGTACTGATGGGCAGATTCCCACGTGTTTCACCAGTCTGCCACTAATTAATTAGAGCAAGCCCTAATTAACTCGTTCGACTTGCATGTGTTAAGCATACCGATAGCGTTCGTTCTGAGCCATGATCAAACTCTATACGGTTACCTTGTTACGACTTCACCCCAGTCATGAATCACAAAGTGGTAAGCGCCCTCCCGAAGGTTGAGCTTACCTACTTCTTTTGCAACCCACTCCCATGGTGTGACGGGCGGTGTTCCAAGGCCCGGGAACGTATTCACCGTAGCATTCTGATCTACGATTACTAGCGATTCCGACTTCACGGAGTCGAGTTGCAGACTCCGATCCGGACTACGACGTACTTTATGAGGTCCGCTTGCTCTCGCGAGTTCGCTTTTCTTTGTATACGCCATTGTAGCACGTGTGTAGCCCTACTCGTAAGGGCCATGATAGCTTGACGTCATACCCACCTTCCTCCGGTTTATCACCGGCAGTCTCCTTTGAGTTCCCACCATTACGTGCTGGCAACAAAGGATAAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATTTCACAACACGAGCTGACGACAGCCATGCAGCACCTGTCTCAGAGTTCCCGAAGGCACTAAGCTATCTCTAGCAAATTCTCTGGATGTCAAGAGTAGGTAAGGTTCTTCGCGTTGCATCGAATTAAACCACATGCTCCACCCGCTTGTGCGGGCCCCCGTCAATTCATTTGAGTTTTAACCTTGCGGCCGTACTCCCCAGGCGGTCGACTTAACGCGTTAGCTCCGGAAGCCACGCCTCGAGGGCACAACCTACAGTCGACATCGTTTACAGCGTGGACTACCAGGGTATCTAATCCTGTTTGCTCTACCCACGCTTTCAGCCTGCCTGACGTCAGTCTTTGTCCAGGGGGCCGCCTTCACCACCGGTATTCCTCCCAGATCTCTACGCATTTCACCGCTACACCTGGAATTCTACCCCCCCTCCCCAGACTCTAGCTTGCCAGTTTCAAATGCAGTTCCCACGTTAAGCGCGGGGATTTCACATCTGACTTAACAAACCGCCTGCGTGCGCTTTACGCCCAGTAATTCCGATTAACGCTTGCACCCTCCGTATTACCGCGGCTGCTGGCACGGAGTTAGCCGGTGCTTCTTCTGCGAGTAACGTCAATGAACAGTGCTATTAACACTGAACCCTTCCTCCTCGCTGAAAGTGCTTTACAACCCGAAGGCCTTCTTCACACACGCGGCATGGCTGCATCAGGCTTGCGCCCATTGTGCAATATTCCCCACTGCTGCCTCCCGTAGGAGTCTGGACCGTGTCTCAGTTCCAGTGTGGCTGGTCATCCTATCTTCAGACCAGCTAGGGATCGTCGCCTAGGTGAGCCATTACCCCACCTCTAGCTAATCCCATCTGGGCACATCTGATGGCATGAGCCCAGAAGGTCCCCCACTTTGGTCCGTAGACGTTATGCGGTATTAGCTACCGTTTCCAGTAGTTATCCCCCTCCATCAGGCAGTTTCCCAGACATTACTCACCCGTCCGCCGCTCGTCACCCAGAGAGCAAGCTCTCCTGTGCTACCGCTCGACTTGCATGTGTTAGGCCTGCCGCCAGCGTTCAATCTGAGCCATGATCAAACTCTAGGTGCTGAAGAAAGTTGTCGGTGTCTTTGTGTTAACCTTCACAATACGT
It was assigned to Rickettsia bellii, but the blast search resulted in Serratia liquefaciens.
So far, I have not found any systematic shift in the rowIDs that could explain the inconsistencies. help would be much appreciated.
Kind regards,
Stephan
read-assignments.txt
example_fastq.txt
The text was updated successfully, but these errors were encountered: