Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for promer #259

Open
bricoletc opened this issue Jul 3, 2024 · 3 comments
Open

Support for promer #259

bricoletc opened this issue Jul 3, 2024 · 3 comments

Comments

@bricoletc
Copy link

bricoletc commented Jul 3, 2024

Hello!

Thank you for this brilliant tool.
I've been using it for an application in which syntenies inferred using mummer's nucmer (so, at the DNA-level) were partial, when compared with mummer's promer (as assessed using mummerplot). This is unsurprising as promer is at the protein-level, so accesses more highly-diverged synteny.

I wanted to make a plotsr using promer coordinates and not nucmer coordinates and I have found a simple, though slightly hacky, way for doing it. I'm happy to share how I did it/PR instructions to your documentation page if you tell me where to do that

It involves formatting the output of show-coords on promer .delta files in the same way as on nucmer .delta files, as show-coords produces slightly different .coords files (docs here).

Maybe in the long run you'd want to build in support inside syri directly, it might not be too difficult.

Best,
Brice

@mnshgl0110
Copy link
Member

Hi Brice,
That's an interesting idea. If I understand correctly, in the following example where three proteins (P1, P2, and P3) are in the same order on two chromosomes but has a large TE insertion in one of them, promer + mummerplot would find these three proteins as syntenic.. right?
image

This would be antithesis to the syri design as it is targeted to find such rearrangements as well. As such, I would not prefer to add "official" support for protein comparisons.

However, as a fan of hacky ways, I think, it would be possible to include the script that pre-process promer's show-coords file in the repositories. That would allow experienced users to do the manipulations themselves while keeping things simple for the less experienced users.

You are welcomed to open a pull-request (I think, normal fork and pull should work) and share your script.

Best
Manish

@bricoletc
Copy link
Author

Hello Manish,

For a PR, where are the documentation files that get published at https://schneebergerlab.github.io/syri?

And for your specific example, I will look in detail at my example and get back to you !

@bricoletc
Copy link
Author

Hi again Manish,

So in your example, no, nucmer and promer will give the same results, provided P1, P2 and P3 are sufficiently similar to be aligned at the DNA level. I.e. both nucmer and promer + mummerplot would show hits between P1, P2 and P3, plus no alignment for the TE. (Btw promer, aligns all six-frame DNA translations of reference and query, so P1/P2/P3 probably don't even have to be true proteins)

However, using promer will increase the sensitivity of alignments for highly-diverged sequences. This can affect synteny, but in a good way IMO. Here is a concrete example, I aligned the mitochondrial sequences of two highly-diverged species using nucmer or promer, here are the mummerplots side by side (nucmer left, promer right):

image

For the left-hand plot, syri won't infer any changes in synteny, but that's only because the sequences in the middle are too diverged at the DNA-level to be aligned.
Here's the syri VCF for the nucmer-based alignment:

##fileformat=VCFv4.3
##fileDate=20240705
##source=syri
##contig=<ID=contig_1,length=14620>
##ALT=<ID=SYN,Description="Syntenic region">
##ALT=<ID=INV,Description="Inversion">
##ALT=<ID=TRANS,Description="Translocation">
##ALT=<ID=INVTR,Description="Inverted Translocation">
##ALT=<ID=DUP,Description="Duplication">
##ALT=<ID=INVDP,Description="Inverted Duplication">
##ALT=<ID=SYNAL,Description="Syntenic alignment">
##ALT=<ID=INVAL,Description="Inversion alignment">
##ALT=<ID=TRANSAL,Description="Translocation alignment">
##ALT=<ID=INVTRAL,Description="Inverted Translocation alignment">
##ALT=<ID=DUPAL,Description="Duplication alignment">
##ALT=<ID=INVDPAL,Description="Inverted Duplication alignment">
##ALT=<ID=HDR,Description="Highly diverged regions">
##ALT=<ID=INS,Description="Insertion in non-reference genome">
##ALT=<ID=DEL,Description="Deletion in non-reference genome">
##ALT=<ID=CPG,Description="Copy gain in non-reference genome">
##ALT=<ID=CPL,Description="Copy loss in non-reference genome">
##ALT=<ID=SNP,Description="Single nucleotide polymorphism">
##ALT=<ID=TDM,Description="Tandem repeat">
##ALT=<ID=NOTAL,Description="Not Aligned region">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position on reference genome">
##INFO=<ID=ChrB,Number=1,Type=String,Description="Chromosome ID on the non-reference genome">
##INFO=<ID=StartB,Number=1,Type=Integer,Description="Start position on non-reference genome">
##INFO=<ID=EndB,Number=1,Type=Integer,Description="End position on non-reference genome">
##INFO=<ID=Parent,Number=1,Type=String,Description="ID of the parent SR">
##INFO=<ID=VarType,Number=1,Type=String,Description="SR for structural arrangements, ShV for short variants, missing otherwise">
##INFO=<ID=DupType,Number=1,Type=String,Description="Copy gain or loss in the non-reference genome">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	sample
contig_1	1	NOTAL1	N	<NOTAL>	.	PASS	END=1057;ChrB=.;StartB=.;EndB=.;Parent=.;VarType=.;DupType=.	GT	1
contig_1	1058	SYNAL1	N	<SYNAL>	.	PASS	END=1629;ChrB=contig_1;StartB=1078;EndB=1647;Parent=SYN1;VarType=.;DupType=.	GT	1
contig_1	1058	SYN1	N	<SYN>	.	PASS	END=14596;ChrB=contig_1;StartB=1078;EndB=14858;Parent=.;VarType=SR;DupType=-	GT	1
contig_1	1629	HDR1	N	<HDR>	.	PASS	END=5383;ChrB=contig_1;StartB=1647;EndB=9170;Parent=SYN1;VarType=ShV;DupType=.	GT	1
contig_1	5384	SYNAL2	N	<SYNAL>	.	PASS	END=6252;ChrB=contig_1;StartB=9171;EndB=10041;Parent=SYN1;VarType=.;DupType=.	GT	1
contig_1	6252	HDR2	N	<HDR>	.	PASS	END=12778;ChrB=contig_1;StartB=10041;EndB=13051;Parent=SYN1;VarType=ShV;DupType=.	GT	1
contig_1	12779	SYNAL3	N	<SYNAL>	.	PASS	END=14596;ChrB=contig_1;StartB=13052;EndB=14858;Parent=SYN1;VarType=.;DupType=.	GT	1
contig_1	14597	NOTAL2	N	<NOTAL>	.	PASS	END=14620;ChrB=.;StartB=.;EndB=.;Parent=.;VarType=.;DupType=.	GT	1

Because promer does align the two sequences almost entirely, we can then see a translocation.
Here's the corresponding syri VCF:

##fileformat=VCFv4.3
##fileDate=20240705
##source=syri
##contig=<ID=contig_1,length=14620>
##ALT=<ID=SYN,Description="Syntenic region">
##ALT=<ID=INV,Description="Inversion">
##ALT=<ID=TRANS,Description="Translocation">
##ALT=<ID=INVTR,Description="Inverted Translocation">
##ALT=<ID=DUP,Description="Duplication">
##ALT=<ID=INVDP,Description="Inverted Duplication">
##ALT=<ID=SYNAL,Description="Syntenic alignment">
##ALT=<ID=INVAL,Description="Inversion alignment">
##ALT=<ID=TRANSAL,Description="Translocation alignment">
##ALT=<ID=INVTRAL,Description="Inverted Translocation alignment">
##ALT=<ID=DUPAL,Description="Duplication alignment">
##ALT=<ID=INVDPAL,Description="Inverted Duplication alignment">
##ALT=<ID=HDR,Description="Highly diverged regions">
##ALT=<ID=INS,Description="Insertion in non-reference genome">
##ALT=<ID=DEL,Description="Deletion in non-reference genome">
##ALT=<ID=CPG,Description="Copy gain in non-reference genome">
##ALT=<ID=CPL,Description="Copy loss in non-reference genome">
##ALT=<ID=SNP,Description="Single nucleotide polymorphism">
##ALT=<ID=TDM,Description="Tandem repeat">
##ALT=<ID=NOTAL,Description="Not Aligned region">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position on reference genome">
##INFO=<ID=ChrB,Number=1,Type=String,Description="Chromosome ID on the non-reference genome">
##INFO=<ID=StartB,Number=1,Type=Integer,Description="Start position on non-reference genome">
##INFO=<ID=EndB,Number=1,Type=Integer,Description="End position on non-reference genome">
##INFO=<ID=Parent,Number=1,Type=String,Description="ID of the parent SR">
##INFO=<ID=VarType,Number=1,Type=String,Description="SR for structural arrangements, ShV for short variants, missing otherwise">
##INFO=<ID=DupType,Number=1,Type=String,Description="Copy gain or loss in the non-reference genome">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	sample
contig_1	1	SYNAL1	N	<SYNAL>	.	PASS	END=768;ChrB=contig_1;StartB=25;EndB=792;Parent=SYN1;VarType=.;DupType=.	GT	1
contig_1	1	SYN1	N	<SYN>	.	PASS	END=3555;ChrB=contig_1;StartB=25;EndB=3574;Parent=.;VarType=SR;DupType=-	GT	1
contig_1	768	HDR1	N	<HDR>	.	PASS	END=919;ChrB=contig_1;StartB=792;EndB=972;Parent=SYN1;VarType=ShV;DupType=.	GT	1
contig_1	920	SYNAL2	N	<SYNAL>	.	PASS	END=1474;ChrB=contig_1;StartB=973;EndB=1491;Parent=SYN1;VarType=.;DupType=.	GT	1
contig_1	1474	HDR2	N	<HDR>	.	PASS	END=1656;ChrB=contig_1;StartB=1491;EndB=1681;Parent=SYN1;VarType=ShV;DupType=.	GT	1
contig_1	1657	SYNAL3	N	<SYNAL>	.	PASS	END=3555;ChrB=contig_1;StartB=1682;EndB=3574;Parent=SYN1;VarType=.;DupType=.	GT	1
contig_1	3556	NOTAL1	N	<NOTAL>	.	PASS	END=5311;ChrB=.;StartB=.;EndB=.;Parent=.;VarType=.;DupType=.	GT	1
contig_1	5312	SYNAL4	N	<SYNAL>	.	PASS	END=5782;ChrB=contig_1;StartB=9096;EndB=9569;Parent=SYN2;VarType=.;DupType=.	GT	1
contig_1	5312	SYN2	N	<SYN>	.	PASS	END=9097;ChrB=contig_1;StartB=9096;EndB=12876;Parent=.;VarType=SR;DupType=-	GT	1
contig_1	5743	SYNAL5	N	<SYNAL>	.	PASS	END=5973;ChrB=contig_1;StartB=9531;EndB=9761;Parent=SYN2;VarType=.;DupType=.	GT	1
contig_1	5938	SYNAL6	N	<SYNAL>	.	PASS	END=6279;ChrB=contig_1;StartB=9724;EndB=10068;Parent=SYN2;VarType=.;DupType=.	GT	1
contig_1	6279	HDR3	N	<HDR>	.	PASS	END=6313;ChrB=contig_1;StartB=10068;EndB=10103;Parent=SYN2;VarType=ShV;DupType=.	GT	1
contig_1	6314	SYNAL7	N	<SYNAL>	.	PASS	END=8374;ChrB=contig_1;StartB=10104;EndB=12161;Parent=SYN2;VarType=.;DupType=.	GT	1
contig_1	8021	SYNAL8	N	<SYNAL>	.	PASS	END=9097;ChrB=contig_1;StartB=11812;EndB=12876;Parent=SYN2;VarType=.;DupType=.	GT	1
contig_1	9098	NOTAL2	N	<NOTAL>	.	PASS	END=9163;ChrB=.;StartB=.;EndB=.;Parent=.;VarType=.;DupType=.	GT	1
contig_1	9164	TRANSAL10	N	<TRANSAL>	.	PASS	END=11455;ChrB=contig_1;StartB=3720;EndB=5987;Parent=TRANS4;VarType=.;DupType=.	GT	1
contig_1	9164	TRANS4	N	<TRANS>	.	PASS	END=12522;ChrB=contig_1;StartB=3720;EndB=7077;Parent=.;VarType=SR;DupType=-	GT	1
contig_1	11251	TDM4	N	<TDM>	.	PASS	END=11455;ChrB=contig_1;StartB=5784;EndB=6011;Parent=TRANS4;VarType=ShV;DupType=.	GT	1
contig_1	11251	TRANSAL11	N	<TRANSAL>	.	PASS	END=12522;ChrB=contig_1;StartB=5806;EndB=7077;Parent=TRANS4;VarType=.;DupType=.	GT	1
contig_1	12523	NOTAL3	N	<NOTAL>	.	PASS	END=12779;ChrB=.;StartB=.;EndB=.;Parent=.;VarType=.;DupType=.	GT	1
contig_1	12780	SYN3	N	<SYN>	.	PASS	END=14381;ChrB=contig_1;StartB=13053;EndB=14642;Parent=.;VarType=SR;DupType=-	GT	1
contig_1	12780	SYNAL9	N	<SYNAL>	.	PASS	END=14381;ChrB=contig_1;StartB=13053;EndB=14642;Parent=SYN3;VarType=.;DupType=.	GT	1
contig_1	14382	NOTAL4	N	<NOTAL>	.	PASS	END=14620;ChrB=.;StartB=.;EndB=.;Parent=.;VarType=.;DupType=.	GT	1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants