Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limited valid data size of filtered BAM file in phased polyploid assembly #86

Open
Axolotl233 opened this issue Oct 30, 2024 · 0 comments

Comments

@Axolotl233
Copy link

Hello,

Thank you for devoloping such a good tool, it is very useful for all community! I am working on a allotetraploidy plant (2n = 4x = AABB), which originated from a hybrid polyploid event involving two closely related species (A and B, which have different karyotype) and did not obviously diploidzated. Additionally, this plant is outcrossing and have a heterozygosity. Recently we sequenced this plant using Pacbio Hifi, illumina short reads, and Hi-C, and we want to construct a chromosome-level genome and performed genome analysis.

Both homogeneity and heterogeneity are exists in two subgenomes because of the close relationship between two progenitor genomes. It is means some genome regions have more similarity (autoploid like) than other genome region (alloploid like), which maked challenge in genome assembly. We finally assembled draft genome using hifiasm, and got alomst complete 4:1 genome collinearity when compared haplotype concentrated genome (hap1 + hap2) with one progenitor genome (we current only have one). But when the two haplotypes were separated and compared with the progenitor genome, they both showed redundancy or missing in some genome region, indicating that the phase was inaccurate. so I decide to use this concentrated genome for chromosome-anchor using HapHic.

However, I encountered the same problem as #21 , my filtered bamfile only have 3.8G size, compared with 117G of raw. This limited data caused a weird hic heatmap in quickview mode. my commads is

bwa index hic.hap12.fa
bwa mem -5SP -t 40 ./hic.hap12.fa  ../../../../z.data/hic_1.fq.gz   ../../../../z.data/hic_2.fq.gz | samblaster | samtools view - -@ 20 -S -h -b -F 3340 -o HiC.bam
/software/HapHiC/utils/filter_bam HiC.bam 1 --nm 3 --threads 14 | samtools view - -b -@ 14 -o HiC.filtered.bam
/software/HapHiC/haphic pipeline hic.hap12.fa HiC.filtered.bam 10 --quick_view --threads 30 --processes 30 --gfa "hic.hap1.p_ctg.gfa,hic.hap2.p_ctg.gfa"

hicmap

You suggested in #21 to use P_utg data for analysis in this situation, but here I wondere if there have any possible to using concentrated genome (cat hap1 + hap2), maybe I can relax the filter conditions of bam filter step in some geonome regions, which have exactly the same sequence between two haplotype?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant