Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with multiple transcripts: MEROPs, BUSCO, PFAM and dbCAN/CAZyme annotations missing for T2 transcripts from final annotation file #1055

Open
calizilla opened this issue Jul 12, 2024 · 0 comments

Comments

@calizilla
Copy link

All transcripts in the final anotation.txt file produced by funannotate 'annotate' have been converted to T1 transcripts, where the input files have both T1 and T2. In other final output files (eg proteins.fa, gff3, gbk etc) the T2 transcript IDs have been retained. This is not a huge problem and is easily fixed if desired, but the major problem is that for all T2 transcripts, they are missing annotations from BUSCO, dbCAN/CAZyme, PFAM and MEROPs.

The unifying feature for these 4 database annotations is that they were executed by the funannotate 'anotate' step, and not excecuted manually (see below).

I ran funannotate workflow on a fungal genome on HPC. Due to some issues at various steps, I manually ran some of the annotations (ie with a separate custom script, not executed as part of funannotate), being careful to follow the same parameters as applied in the funannotate python code. Manul annotations were run for:

  • coding quarry
  • phobius
  • antismash
  • interproscan
  • eggnog

At antiSMASH, I encountered a previosuly described error due to multiple transcripts. I followed the suggestion by @sunnycqcn in this antiSMASH issue to use agat to keep only the longest transcript, and then ran 'funannotate fix' to update the gbk and tbl files. After agat, there were 579 T2 transcripts and the remainder were T1.

After completing the funannotate workflow, I chanced upon noticing that some genes were missing annotations that were present in the manual annotation output files, and that in all cases, these were genes that had the 'T2' designation in the proteins.fa file. I wrote a script to check the annotations in the 'annotate_misc' directory against the annotations in the final 'annotate_results/annotations.txt' file for the 579 T2 genes. Every gene that had an annotation in 'annotate_misc' against any of BUSCO, PFAM, MEROPs or dbCAN was missing the annotation being included in the 'annotate_results/annotations.txt'.

Providing pre-computed annotation files to the funannotate 'annotate' step is a valid approach, using the parameters:

  --eggnog             Eggnog-mapper annotations file (if NOT installed)
  --antismash          antiSMASH secondary metabolism results (GBK file from output)
  --iprscan            InterProScan5 XML file
  --phobius            Phobius pre-computed results (if phobius NOT installed)

So while I can't be sure this bug would occur for users relying solely on the funannotate codebase (ie not executing some of the annotations manually) , it seems likely that this bug may affect others who have had to perform steps manually, as I did.

One proposed solution would be to adjust the way the 'annotate' step treats the transcript IDs, and not perform a conversion of everything to T1 when compiling all the annotations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant