Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issue with overlapping amplicons #7

Open
hudenise2 opened this issue Oct 29, 2021 · 5 comments
Open

issue with overlapping amplicons #7

hudenise2 opened this issue Oct 29, 2021 · 5 comments

Comments

@hudenise2
Copy link

I run cojac on wastewater samples which are sequenced using NimagenV3 primer amplicons. It worked well for most variant profile however I encountered this error while using a profile which has co-occurrence in amplicon 120:
File "/usr/local/bin/cooc-mutbamscan2", line 245, in
table[sample]=scanbam(alnfname, amplicons)
File "/usr/local/bin/cooc-mutbamscan2", line 155, in scanbam
amplicon_iter = alnfile.fetch(rq_chr, rq_b, rq_e)
File "pysam/libcalignmentfile.pyx", line 1091, in pysam.libcalignmentfile.AlignmentFile.fetch
File "pysam/libchtslib.pyx", line 688, in pysam.libchtslib.HTSFile.parse_region
ValueError: invalid coordinates: start (22867) > stop (22846)

and here are the corresponding NimagenV3 insert coordinates:
amplicon_119 [22595-22837]
amplicon_120 [22804-22976]
amplicon_121 [22876-23119]
amplicon_122 [23072-23288]
According to the cooc-mutbamscan script, the boundaries are defined by +30 from previous amplicon end (22867) and -30 from next amplicon start (22846) which lead to this error.
Could you implement a fix for it?
Irene Bassano with you had contact with about cojac used to work with me. Thanks Hubert

@ggrimes
Copy link

ggrimes commented Dec 1, 2021

I am also having the same error using the subArtic primers set.

MN908947.3 23193 23400 17 1 +
MN908947.3 23290 23463 18 2 +
MN908947.3 23408 23579 19 1 +

amplicon_18_EU	23430	23378	{23403: 'G'}	Traceback (most recent call last):
  File "/gpfs/igmmfs01/eddie/COVID-WW/scratch/cojac/cojac/./cooc-mutbamscan2", line 323, in <module>
    table[sample]=scanbam(alnfname, amplicons,rq_chr)
  File "/gpfs/igmmfs01/eddie/COVID-WW/scratch/cojac/cojac/./cooc-mutbamscan2", line 168, in scanbam
    amplicon_iter = alnfile.fetch(rq_chr, rq_b, rq_e)
  File "pysam/libcalignmentfile.pyx", line 1082, in pysam.libcalignmentfile.AlignmentFile.fetch
  File "pysam/libchtslib.pyx", line 688, in pysam.libchtslib.HTSFile.parse_region
ValueError: invalid coordinates: start (23430) > stop (23378)

If I change the cooc-mutbamscan (dev branch) script by removing the -30 +30 it runs, but I assume this would lead to incorrect results.

# make query start and query end for each amplicon
amp_bed["qstart"] = pd.concat([amp_bed["start"][0:1], amp_bed["stop"][:-1]]).reset_index(drop=True) + 0
amp_bed["qstop"] = pd.concat([amp_bed["start"][1:], amp_bed["stop"][-1:]]).reset_index(drop=True) - 0

@ggrimes
Copy link

ggrimes commented Dec 16, 2021

the latest patch 247ab05 seems to fix the issue

@DrYak
Copy link
Member

DrYak commented Dec 21, 2021

Yup, it looks like this commit manager to fix it.


A bit of background about this optimization/hack

In samtools' libhts and in pysam if you request between the start and end of a window, the library will return all the read-pairs in that partially overlap that window.
In our case, if we submit a request spanning the whole amplicon's window, we will get all the read-pairs of that amplicon, BUT because the multiplex PCR are tiled, we would get also the read-pairs from neighbouring amplicons.

e.g.: (where b: some base and p: base of a primer)

bbbbbbbbbbppp----------------ppppbbbbbbbbb : amp 119 and 121
----pppbbbbbbbbbbbbbbbbbbbbbbbbbbbbppp---- : amp 120

       |--------------------------| : request

With such a request we would get all the (whole) reads-pairs of all 3 amplicons.
We would need to then check for each read received if it actually covers the mutation position.

Whereas using the neighbouring amplicon start/stop and skipping -30/+30 to skip the primers:

bbbbbbbbbbppp----------------ppppbbbbbbbbb : amp 119 and 121
----pppbbbbbbbbbbbbbbbbbbbbbbbbbbbbppp---- : amp 120
         | + 30            30 - | : skip
               |---------| : request

The only read-pairs that overlap with the request are the ones from amplicon 120 We will get only (whole) read-pairs of that amplicon, so we don't need to filter-out reads that absolutely don't cover the positions.

But in Nimagen, the amplicon are so short that our usual -30/+30 overshoots and crosses.
We fixed it (doing +5/-5 from the centre in that case).

bbbbbbbbbbppp----------------ppppbbbbbbbbb : amp 119 and 121
----pppbbbbbbbbbbbbbbbbbbbbbbbbbbbbppp---- : amp 120
                5 - | + 5  : skip
               |---------| : request

Caveat regarding amplicon lenght

source: doi:10.1101/2021.05.26.21257861

Other groups have shown that the optimal amplicon length for wastewater seem to be around ~400bp as done by, e.g., the Artic protocols.
With the much shorter amplicons like Nimagen, you might have less opportunities for cooccurrences of mutations.
(and with much longer amplicons, there might be fewer fragments as large that manage to survive through the wastewater all the way to the sequencing)

@ggrimes
Copy link

ggrimes commented Jan 7, 2022

Thanks that makes it clear. Unfortunately, my amplicons are very short and this centering method would still retrieve PE reads that overlap multiple amplicons. Maybe I can create a custom insert bed file that has nonoverlapping amplicon regions. Or reduce the window to a single base.

@DrYak
Copy link
Member

DrYak commented Jan 7, 2022

From the theory point of view:

  • the paired end are then scanned for positions present. The accidentally picked up neighbours will show as carrying 0 position of interest and not show up in your statistics.
  • thus they will not perturbate your result, just be more expensive because they'll need to get analysed in detail instead of being filtered out by the "custom request" trick. (You'll just lose some speed benefit of the approach).

BUT! in practice:

  • please check that in such short read pairs you manage to have actually usable cooccurrence.

Since commit d505a5d before Christmas, there are new options to get this requests in a YAML file:

> ./cooc-mutbamscan --help
usage: cooc-mutbamscan [-h] [-s TSV | -a BAM/CRAM [BAM/CRAM ...]] [-/ [BATCHNAME]] [-p PATH] [-r REFID] [-m DIR] [-b BED] [-# COOC] [-Q YAML | -A YAML] [-j JSON] [-y YAML] [-t TSV] [-d]

scan amplicon (covered by long read pairs) for mutation cooccurrence

optional arguments:
  -h, --help            show this help message and exit

  -Q YAML, --in-amp YAML, --amplicons YAML
                        use the supplied YAML file to query amplicons instead of building it from BED + voc's DIR
  -A YAML, --out-amp YAML, --out-amplicons YAML
                        output amplicon query in a YAML file

use the --out-amplicons option and look at the 5th column in this file: it's a map (a dictionary) of all mutation visible on that position.

For example with Artic V4 and Omicron, this is what we get:

75_om: [22428, 22785, 22504, 22647, {22578: A, 22673: C, 22674: T, 22679: C}]
76_om: [22677, 23028, 22815, 22944, {22679: C, 22813: T, 22882: G, 22898: A, 22992: A,
    23013: C}]
77_om: [22974, 23327, 23058, 23216, {22992: A, 23013: C, 23040: G, 23048: A, 23055: G,
    23202: A}]
78_om: [23246, 23611, 23357, 23545, {23525: T, 23599: G}]
88_om: [26277, 26635, 26368, 26557, {26530: G, 26577: G}]

Two amplicons (78, 88) carry two mutation, the three other (75, 76, 77) carry many more.

Another separate thing to watch out for:
beware that your amplicon's primer don't happen to map to a heavily mutated or a deleted part of your variant of interest, otherwise you're going to have drop out. (e.g.: the primer that correspond to Artic V3's amplicons 75 and 76 have trouble binding to omicron and thus we lose information there).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants