-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
load_gff3 miscalculates CDS #60
Comments
I have a few more observations on this, I have no idea if they're expected/normal. I looked into the json that load_gff3 generates as output:
json output attached from following command: Happy to keep on sleuthing in another direction if needed... Thanks for your help with this! |
Update - after looking at the gffs of a number of models after uploading, it looks like duplicate CDS features are being created. There are always 2x the number of CDS lines in the gff after loading, whether using --disable_CDS_recalculation or not. I'm 99% certain it's not a problem with the gffs that I am trying to load - I've tried gffs from multiple sources, and our gff3_QC program doesn't detect any problems. |
And another update - I was able to reproduce this on the Apollo demo site (human, region chr4:111819946..111848259 (28.31 Kb)). Input gff attached. |
Hi @MonicaPoelchau-USDA , I'll try and find some time this week to look into it.
|
Thanks for looking into this, @hexylena ! Let me know if there's anything else I can test/provide. |
Ok, investigating today (sorry again @MonicaPoelchau-USDA) I can reproduce the issue entirely. I'm seeing zero functional differences with and without the no-cds-recalc, the files are functionally identical (module UUIDs), so that code path can at least be ignored for now: arrow sends the exact same data to apollo's addTranscript API. As for the generated transcripts I'm getting back duplicated CDSs as well, only the addition of a number of non-canonical start sites (expected, i'm testing on an ecoli copy.) I can likewise confirm the inability to delete when loading via legacy, that's really ... not great. It would require apollo changes to correct (or in your case maybe manually deleting it from the DB is an option.) I've tried deleting the CDSs one by one and got almost to the last one which now is just stuck there permanently lol. I can confirm fetching the sequence does nothing, this is due to a 500 internal server error on the apollo side. Genomic and cdna both return sequences, but cds/peptide fail.
Which is seen in the apollo codebase here. I, don't understand why this would be an error, having more than 1 child feature. Your gene model is gff3 spec compliant, Uploading the GFF3 and manually creating the annotation works perfectly, but that's not what we want. Looking at the uploaded feature in JBrowse's view, it looks right, 6 exons and 6 CDSs. Investigating the websocket call that's made when I drag and drop this feature onto the annotation track is fascinating, it's very different! At the top level an mRNA is sent, along with the following:
Notably it's only exons except for where the CDS doesn't match the parent exon. So that looks like we're calling the ok, so, deleting any CDSs which match their parent exons completely, before sending produces the thing that looks the closest to what was copied from the native jbrowse track ( But that still fails to get the correct sequence from Apollo. The thing that finally worked was... dropping ALL I think in this case we're getting "lucky" that there weren't any manual modifications made to the CDS regions and apollo's algorithm works, but, yikes that's not good. Deleting all CDSs from the source gff3 file does not work (weirdly), instead I need this patch to get the loading to work at all: diff --git a/apollo/util.py b/apollo/util.py
index 802625b..11529cd 100644
--- a/apollo/util.py
+++ b/apollo/util.py
@@ -121,6 +121,28 @@ def _yieldGeneData(gene, disable_cds_recalculation=False, use_name=False):
# # TODO: handle description
# # TODO: handle GO, Gene Product, Provenance
+ def __floc(location):
+ return f"{location['fmin']}-{location['fmax']}-{location['strand']}"
+
+ for child1 in current['children']:
+ exon_regions = []
+ for child1 in current['children']:
+ for child in child1['children']:
+ print(child)
+ if child['type']['name'] == 'exon':
+ exon_regions.append(__floc(child['location']))
+ new_current_children = []
+ for child in child1['children']:
+ if child['type']['name'] == 'CDS':
+ continue
+ nnn = __floc(child['location'])
+ if nnn not in exon_regions:
+ new_current_children.append(child)
+ else:
+ new_current_children.append(child)
+ child1['children'] = new_current_children
+ print(exon_regions)
+
if 'children' in current and gene.type == 'gene':
# Only sending mRNA level as apollo is more comfortable with orphan mRNAs
return current['children'] I don't love any of this 🙃 I'm not sure what we could change to make Apollo happier here, short of just removing the CDSs blindly. For whatever reason sending CDSs causes problems |
Thanks @hexylena ! I pulled your branch, was able to add the annotations, and they loaded correctly - I was able to retrieve the protein and CDS sequence. I was also able to load them correctly using the main branch and deleting all CDS features from the gff3 file beforehand, as you mentioned. I think this will work for us in most cases since our CDS's should be pretty vanilla (starting in phase 0). |
I wasn't around when this was written, but this is my best guess of what's going on here: What Apollo is expecting here for CDS is a single feature that spans the whole coding region. In the sample GFF3 file, all the CDSs have the same With this patch, I'm able to load the GFF3 correctly, and it contains the CDS info from the file: diff --git a/apollo/util.py b/apollo/util.py
--- a/apollo/util.py
+++ b/apollo/util.py
@@ -122,6 +122,25 @@ def _yieldGeneData(gene, disable_cds_recalculation=False, use_name=False):
# # TODO: handle GO, Gene Product, Provenance
if 'children' in current and gene.type == 'gene':
+ for mRNA in current['children']:
+ new_mRNA_children = []
+ new_cds = None
+ for feature in mRNA['children']:
+ if feature['type']['name'] == 'CDS':
+ if new_cds:
+ new_cds_start = new_cds['location']['fmin']
+ new_cds_end = new_cds['location']['fmax']
+ this_cds_start = feature['location']['fmin']
+ this_cds_end = feature['location']['fmax']
+ new_cds['location']['fmin'] = min(new_cds_start, this_cds_start)
+ new_cds['location']['fmax'] = max(new_cds_end, this_cds_end)
+ else:
+ new_cds = feature
+ else:
+ new_mRNA_children.append(feature)
+ if new_cds:
+ mRNA['children'] = new_mRNA_children
+ mRNA['children'].append(new_cds)
# Only sending mRNA level as apollo is more comfortable with orphan mRNAs
return current['children']
else: |
Oh wow @garrettjstevens thanks for looking into it as well!
that is a fascinating interpretation of the spec.
Ok, we'll merge that in behind a flag then! |
I just encountered the same problem. The fixes you wrote allowed me to properly import the gff, but I now have several exons/transcripts I cannot delete (I get the same error than @mpoelchau). I'd appreciate any pointers on how to solve this! |
As a temporary hack, we have been deleting CDS features from the gff3 before loading them. This is not ideal though (esp if your CDS start and stop don't follow Apollo's assumptions when it recalculates them). |
Alright, thanks. Related question: were you able to delete the user-created annotations that you uploaded via |
@aathbt I was not able to remove the undeletable exons. The ebst result I had was by removing the CDSs, then one could get the exon almost deleted, but something was wrong in the database still. I'm not sure it's something we can fix via the API either |
We've been having trouble with load_gff3 - I initially posted this on the Apollo repo (GMOD/Apollo#2662), reposting it here now. Thanks for considering this, I'd appreciate any pointers!
We are trying to use the python-apollo
arrow annotations load_gff3
command to load annotations to the user-created annotations track. It is changing the CDS locations of the model, both with and without the --disable_cds_recalculation option.Here is what a load without --disable_cds_recalculation looks like; the correct frame can be seen in the track below.
The gff3 that was used to load the annotation has 6 CDS lines; the gff3 for the uploaded annotation has 12 CDS lines (even though the view shows only one CDS segment). Apollo also won't calculate a protein or CDS sequence on the uploaded annotation.
Here is what a load with --disable_cds_recalculation looks like (command:
arrow annotations load_gff3 --source https://apollo2-stage-node1-cbo.nal.usda.gov/apollo Anoplophora_glabripennis ~/Downloads/NW_019416298.gff3 --disable_cds_recalculation
)Again, the gff3 for the uploaded annotation in Apollo has 12 CDS lines instead of 6. Apollo also won't calculate a protein or CDS sequence on the uploaded annotation.
I'll note that if you run the same command multiple times, the single CDS will display in a different spot each time.
If you load the annotation by dragging it up, it loads correctly:
This is happening for many (but not all) annotations in multiple assemblies/organisms.
Some other observations:
I've attached "before" and "after" gff3s. (Used .txt extension because GitHub wouldn't let me upload otherswise)
before.txt
after-nocdsrecalc.txt
after.txt
Provide the javascript console log output generated from the action.
None.
Provide the server log output generated from the action (typically
catalina.out
).nothing is added to Catalina.out when I add the annotations.
The text was updated successfully, but these errors were encountered: