Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Low alignment rates of synthetic reads to gold standard coassembly scaffolds #38

Closed
shw079 opened this issue Jul 10, 2018 · 7 comments
Closed
Assignees

Comments

@shw079
Copy link

shw079 commented Jul 10, 2018

Hi, I don't know if this is the right place to talk about this, but I am using the 2nd CAMI Challenge Human Microbiome Project Toy Dataset. When I mapped the reads (bowtie2) to the gold standard co-assembly scaffolds, for oral, skin, and airways, the alignment rate is pretty high (>99%), but for gastrointestinal tract and urogenital tract, the alignment rate is very low (~40%). I am wondering why this is happening for these two body sites. Thanks!

@AlphaSquad
Copy link
Collaborator

Interesting, thanks for pointing this out. Could you point us to the exact files you used for mapping so we can reproduce this and investigate what might be going on?

@shw079
Copy link
Author

shw079 commented Jul 10, 2018

I just used the anonymous_reads.fq in Illumina synthetic samples. For gastrointestinal tract and urogenital tract, the alignment rates are low for all samples. I trimmed and filtered the reads and redid the alignment again, which improved the alignment rates but they are still low.

For example, after trimming, for 2017.12.04_18.56.22_sample_6 in urogenital tract, the alignment rate is only ~15%.

@AlphaSquad
Copy link
Collaborator

We are investigating this. It is possible that the gold standards for urogenital and gastrointestinal tract are mixed up with each other.

@AlphaSquad
Copy link
Collaborator

AlphaSquad commented Jul 18, 2018

Could you please post/send via email the exact steps/commands you performed?
The tests I did so far could not reproduce this problem, i.e. the read file short_read/2017.12.04_18.56.22_sample_6/reads/anonymous_reads.fq.gz had a 100% mapping rate to the hybrid/pooled/anonymous_gsa.fasta

@AlphaSquad AlphaSquad self-assigned this Jul 18, 2018
@shw079
Copy link
Author

shw079 commented Jul 18, 2018

I have short_read/2017.12.04_18.56.22_sample_6/reads/anonymous_reads.fq.gz mapped to short_read/gsa.fasta.gz and I haven't touched hybrid/pooled/anonymous_gsa.fasta. Is that the problem?

@AlphaSquad
Copy link
Collaborator

AlphaSquad commented Jul 19, 2018

I tried reproducing this, exact steps follow:
Downloading the reads from sample 6 with
java -jar camiClient.jar -d https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/CAMI_Urogenital_tract . sample_6/reads/anonymous_reads then downloading the corresponding gsa with java -jar camiClient.jar -d https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/CAMI_Urogenital_tract . short_read/gsa, then unzipping and mapping with gunzip gsa.fasta.gz and gunzip anonymous_reads.fq.gz, bwa index gsa.fasta, bwa mem gsa.fasta anonymous_reads.fq > gsa.sam, samtools view -bS gsa.sam > gsa.bam and finally samtools flagstat gsa.bam and got a mapping rate of 100%.
Maybe you accidentally overwrote the urogenital/gastrointestinal gold standards when downloading the next set? Anyhow, I will add a new Issue about unique filenames based on the sample name.

@abremges
Copy link
Member

This seems resolved; else please let us know and we'll re-open the issue and further look into it.
I support the idea proposed in #39 for a next version, maybe already for CAMI2 data generation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants