Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUSCO-mediated training failure in 1.8.17, but not in 1.8.15 #1071

Open
JWDebler opened this issue Oct 7, 2024 · 11 comments
Open

BUSCO-mediated training failure in 1.8.17, but not in 1.8.15 #1071

JWDebler opened this issue Oct 7, 2024 · 11 comments

Comments

@JWDebler
Copy link

JWDebler commented Oct 7, 2024

I just installed 1.8.17 on a new system and am going through the test pipeline fixing errors.

The current one though I am not sure what to do about.

It happens during BUSCO-mediated training. However, the same step finishes fine on my old machine with 1.8.15 (below)

1.8.17:

Running `funannotate predict` BUSCO-mediated training unit testing
CMD: funannotate predict -i test.softmasked.fa --protein_evidence protein.evidence.fasta -o annotate --cpus 32 --species Awesome busco
#########################################################
-------------------------------------------------------
[Oct 07 05:38 AM]: OS: Ubuntu 24.04, 32 cores, ~ 247 GB RAM. Python: 3.9.19
[Oct 07 05:38 AM]: Running funannotate v1.8.17
[Oct 07 05:38 AM]: Skipping CodingQuarry as no --rna_bam passed
[Oct 07 05:38 AM]: Parsed training data, run ab-initio gene predictors as follows:
  Program      Training-Method
  augustus     busco
  genemark     selftraining
  glimmerhmm   busco
  snap         busco
[Oct 07 05:38 AM]: Loading genome assembly and parsing soft-masked repetitive sequences
[Oct 07 05:38 AM]: Genome loaded: 6 scaffolds; 3,776,588 bp; 19.75% repeats masked
/data/mamba_envs/envs/funannotate/lib/python3.9/site-packages/funannotate/aux_scripts/funannotate-p2g.py:14: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
  from pkg_resources import parse_version
[Oct 07 05:38 AM]: Mapping 1,065 proteins to genome using diamond and exonerate
[Oct 07 05:38 AM]: Found 1,505 preliminary alignments with diamond in 0:00:01 --> generated FASTA files for exonerate in 0:00:00
     Progress: 1505 complete, 0 failed, 0 remaining
[Oct 07 05:38 AM]: Exonerate finished in 0:00:10: found 1,270 alignments
[Oct 07 05:38 AM]: Running GeneMark-ES on assembly
[Oct 07 05:39 AM]: 1,566 predictions from GeneMark
[Oct 07 05:39 AM]: Running BUSCO to find conserved gene models for training ab-initio predictors
[Oct 07 05:42 AM]: 370 valid BUSCO predictions found, validating protein sequences
[Oct 07 05:42 AM]: 189 BUSCO predictions validated
[Oct 07 05:42 AM]: Not enough gene models 189 to train Augustus (200 required), exiting
#########################################################
Traceback (most recent call last):
  File "/data/mamba_envs/envs/funannotate/bin/funannotate", line 10, in <module>
    sys.exit(main())
  File "/data/mamba_envs/envs/funannotate/lib/python3.9/site-packages/funannotate/funannotate.py", line 717, in main
    mod.main(arguments)
  File "/data/mamba_envs/envs/funannotate/lib/python3.9/site-packages/funannotate/test.py", line 407, in main
    runBuscoTest(args)
  File "/data/mamba_envs/envs/funannotate/lib/python3.9/site-packages/funannotate/test.py", line 200, in runBuscoTest
    assert 1500 <= countGFFgenes(os.path.join(
  File "/data/mamba_envs/envs/funannotate/lib/python3.9/site-packages/funannotate/test.py", line 45, in countGFFgenes
    with open(input, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'test-busco_07886a94-a47b-4261-a14b-f81f46fe307d/annotate/predict_results/Awesome_busco.gff3'

1.8.15:

#########################################################                                                                                                                                                                                    Running `funannotate predict` BUSCO-mediated training unit testing                                                                                                                                                                           CMD: funannotate predict -i test.softmasked.fa --protein_evidence protein.evidence.fasta -o annotate --cpus 16 --species Awesome busco                                                                                                       #########################################################                                                                                                                                                                                    -------------------------------------------------------                                                                                                                                                                                      [Oct 07 05:26 AM]: OS: Ubuntu 18.04, 16 cores, ~ 66 GB RAM. Python: 3.8.15                                                                                                                                                                   [Oct 07 05:26 AM]: Running funannotate v1.8.15                                                                                                                                                                                               [Oct 07 05:26 AM]: Skipping CodingQuarry as no --rna_bam passed                                                                                                                                                                              [Oct 07 05:26 AM]: Parsed training data, run ab-initio gene predictors as follows:
  Program      Training-Method
  augustus     busco
  genemark     selftraining
  glimmerhmm   busco
  snap         busco
[Oct 07 05:26 AM]: Loading genome assembly and parsing soft-masked repetitive sequences
[Oct 07 05:26 AM]: Genome loaded: 6 scaffolds; 3,776,588 bp; 19.75% repeats masked
[Oct 07 05:26 AM]: Mapping 1,065 proteins to genome using diamond and exonerate
[Oct 07 05:26 AM]: Found 1,505 preliminary alignments with diamond in 0:00:01 --> generated FASTA files for exonerate in 0:00:00
     Progress: 1505 complete, 0 failed, 0 remaining
[Oct 07 05:26 AM]: Exonerate finished in 0:00:10: found 1,270 alignments
[Oct 07 05:26 AM]: Running GeneMark-ES on assembly
[Oct 07 05:27 AM]: 1,558 predictions from GeneMark
[Oct 07 05:27 AM]: Running BUSCO to find conserved gene models for training ab-initio predictors
[Oct 07 05:30 AM]: 370 valid BUSCO predictions found, validating protein sequences
[Oct 07 05:31 AM]: 367 BUSCO predictions validated
[Oct 07 05:31 AM]: Training Augustus using BUSCO gene models
[Oct 07 05:31 AM]: Augustus initial training results:
  Feature       Specificity   Sensitivity
  nucleotides   99.4%         83.8%
  exons         63.2%         52.6%
  genes         76.7%         51.4%
[Oct 07 05:31 AM]: Running Augustus gene prediction using awesome_busco parameters
     Progress: 11 complete, 0 failed, 0 remaining
[Oct 07 05:31 AM]: 1,284 predictions from Augustus
[Oct 07 05:31 AM]: Pulling out high quality Augustus predictions
[Oct 07 05:31 AM]: Found 306 high quality predictions from Augustus (>90% exon evidence)
[Oct 07 05:31 AM]: Running SNAP gene prediction, using training data: annotate/predict_misc/busco.final.gff3
[Oct 07 05:32 AM]: 1,391 predictions from SNAP
[Oct 07 05:32 AM]: Running GlimmerHMM gene prediction, using training data: annotate/predict_misc/busco.final.gff3
[Oct 07 05:32 AM]: 1,775 predictions from GlimmerHMM
[Oct 07 05:32 AM]: Summary of gene models passed to EVM (weights):
  Source         Weight   Count
  Augustus       1        978
  Augustus HiQ   2        306
  GeneMark       1        1558
  GlimmerHMM     1        1775
  snap           1        1391
  Total          -        6008

As can be seen, both versions find the same number of BUSCO predictions (370), but 1.8.17 can only validate 189 of them, crashing the pipeline as AUGUSTUS requires at least 200.
1.8.15 however manages to validate 367.

Versions:

-------------------------------------------------------
Checking dependencies for 1.8.17
-------------------------------------------------------
You are running Python v 3.9.19. Now checking python packages...
biopython: 1.79
goatools: 1.4.12
matplotlib: 3.9.2
natsort: 8.4.0
numpy: 1.26.4
pandas: 2.2.3
psutil: 6.0.0
requests: 2.32.3
scikit-learn: 1.5.2
scipy: 1.13.1
seaborn: 0.13.2
All 11 python packages installed


You are running Perl v b'5.032001'. Now checking perl modules...
Carp: 1.50
Clone: 0.46
DBD::SQLite: 1.72
DBD::mysql: 4.050
DBI: 1.643
DB_File: 1.858
Data::Dumper: 2.183
File::Basename: 2.85
File::Which: 1.24
Getopt::Long: 2.58
Hash::Merge: 0.302
JSON: 4.10
LWP::UserAgent: 6.67
Logger::Simple: 2.0
POSIX: 1.94
Parallel::ForkManager: 2.03
Pod::Usage: 1.69
Scalar::Util::Numeric: 0.40
Storable: 3.15
Text::Soundex: 3.05
Thread::Queue: 3.14
Tie::File: 1.06
URI::Escape: 5.17
YAML: 1.30
local::lib: 2.000029
threads: 2.25
threads::shared: 1.61
All 27 Perl modules installed


Checking Environmental Variables...
$FUNANNOTATE_DB=/data/databases/
$PASAHOME=/data/mamba_envs/envs/funannotate/opt/pasa-2.5.3
$TRINITY_HOME=/data/mamba_envs/envs/funannotate/opt/trinity-2.15.2
$EVM_HOME=/data/mamba_envs/envs/funannotate/opt/evidencemodeler-2.1.0
$AUGUSTUS_CONFIG_PATH=/data/mamba_envs/envs/funannotate/config/
$GENEMARK_PATH=/opt/genemark/current/
All 6 environmental variables are set
-------------------------------------------------------
Checking external dependencies...
PASA: 2.5.3
CodingQuarry: 2.0
Trinity: 2.15.2
augustus: 3.5.0
bamtools: bamtools 2.5.2
bedtools: bedtools v2.31.1
blat: BLAT v39x1
diamond: 2.1.8
emapper.py: 2.1.12
ete3: 3.1.3
exonerate: exonerate 2.4.0
fasta: 36.3.8g
glimmerhmm: 3.0.4
gmap: 2024-09-18
gmes_petap.pl: 4.71_lic
hisat2: 2.2.1
hmmscan: HMMER 3.4 (Aug 2023)
hmmsearch: HMMER 3.4 (Aug 2023)
java: 22.0.1-internal
kallisto: 0.46.1
mafft: v7.526 (2024/Apr/26)
makeblastdb: makeblastdb 2.16.0+
minimap2: 2.28-r1209
pigz: 2.8
proteinortho: 6.3.2
pslCDnaFilter: no way to determine
salmon: salmon 1.10.3
samtools: samtools 1.21
signalp: 6.0
snap: 2006-07-28
stringtie: 2.2.3
tRNAscan-SE: 2.0.12 (Nov 2022)
tantan: tantan 50
tbl2asn: 25.8
tblastn: tblastn 2.16.0+
trimal: trimAl v1.5.rev0 build[2024-05-27]
trimmomatic: 0.39
All 37 external dependencies are installed
-------------------------------------------------------
Checking dependencies for 1.8.15
-------------------------------------------------------
You are running Python v 3.8.15. Now checking python packages...
biopython: 1.81
goatools: 1.2.3
matplotlib: 3.4.3
natsort: 8.3.1
numpy: 1.24.3
pandas: 1.5.3
psutil: 5.9.5
requests: 2.29.0
scikit-learn: 1.2.2
scipy: 1.10.1
seaborn: 0.12.2
All 11 python packages installed


You are running Perl v b'5.032001'. Now checking perl modules...
Carp: 1.50
Clone: 0.46
DBD::SQLite: 1.72
DBD::mysql: 4.046
DBI: 1.643
DB_File: 1.858
Data::Dumper: 2.183
File::Basename: 2.85
File::Which: 1.24
Getopt::Long: 2.54
Hash::Merge: 0.302
JSON: 4.10
LWP::UserAgent: 6.67
Logger::Simple: 2.0
POSIX: 1.94
Parallel::ForkManager: 2.02
Pod::Usage: 1.69
Scalar::Util::Numeric: 0.40
Storable: 3.15
Text::Soundex: 3.05
Thread::Queue: 3.14
Tie::File: 1.06
URI::Escape: 5.12
YAML: 1.30
local::lib: 2.000029
threads: 2.25
threads::shared: 1.61
All 27 Perl modules installed


Checking Environmental Variables...
$FUNANNOTATE_DB=/data/databases/
$PASAHOME=/home/ubuntu/mambaforge/envs/funannotate/opt/pasa-2.5.2
$TRINITY_HOME=/home/ubuntu/mambaforge/envs/funannotate/opt/trinity-2.8.5
$EVM_HOME=/home/ubuntu/mambaforge/envs/funannotate/opt/evidencemodeler-1.1.1
$AUGUSTUS_CONFIG_PATH=/home/ubuntu/mambaforge/envs/funannotate/config/
$GENEMARK_PATH=/opt/genemark/
All 6 environmental variables are set
-------------------------------------------------------
Checking external dependencies...
PASA: 2.5.2
CodingQuarry: 2.0
Trinity: 2.8.5
augustus: 3.5.0
bamtools: bamtools 2.5.1
bedtools: bedtools v2.30.0
blat: BLAT v36x2
diamond: 2.1.6
emapper.py: 2.1.12
ete3: 3.1.2
exonerate: exonerate 2.4.0
fasta: 36.3.8g
glimmerhmm: 3.0.4
gmap: 2023-03-24
gmes_petap.pl: 4.71_lic
hisat2: 2.2.1
hmmscan: HMMER 3.3.2 (Nov 2020)
hmmsearch: HMMER 3.3.2 (Nov 2020)
java: 17.0.3-internal
kallisto: 0.46.1
mafft: v7.520 (2023/Mar/22)
makeblastdb: makeblastdb 2.13.0+
minimap2: 2.26-r1175
pigz: 2.6
proteinortho: 6.2.3
pslCDnaFilter: no way to determine
salmon: salmon 0.14.1
samtools: samtools 1.16.1
signalp: 4.1
snap: 2006-07-28
stringtie: 2.2.1
tRNAscan-SE: 2.0.11 (Oct 2022)
tantan: tantan 40
tbl2asn: 25.8
tblastn: tblastn 2.13.0+
trimal: trimAl v1.4.rev15 build[2013-12-17]
trimmomatic: 0.39
All 37 external dependencies are installed

Any suggestions?

Cheers.

@nextgenusfs
Copy link
Owner

Are these two separate augustus installs? Can you look at the specific build numbers from conda? I'm guess there is an issue with the build version in 1.8.17 install. If BUSCO fails its almost always an Augustus issue....

@JWDebler
Copy link
Author

Yes, they're installed on separate virtual machines, both were setup via conda with what was the latest funannotate version at the time.
The one in the 1.8.15 install: augustus 3.5.0 pl5321hf46c7bb_1 bioconda
The one in the 1.8.17 install: augustus 3.5.0 pl5321h95201ac_4 bioconda

@nextgenusfs
Copy link
Owner

Okay so _4 build is the problem. Force install _1 build and should work.

@JWDebler
Copy link
Author

I just tried that and it complained

mamba install augustus=3.5.0=pl5321hf46c7bb_1

image

@nextgenusfs
Copy link
Owner

Would this work? mamba install "augustus==3.5.0,!=3.5.0=pl5321h95201ac_4"

@nextgenusfs
Copy link
Owner

In your particular case, you can just install funannotate via pip in your v1.8.15 environment which has a working augustus installation. ie

python -m pip install "funannotate==1.8.17"

@JWDebler
Copy link
Author

Would this work? mamba install "augustus==3.5.0,!=3.5.0=pl5321h95201ac_4"

The following package could not be installed
└─ augustus ==3.5.0,!=3.5.0 pl5321h95201ac_4 does not exist (perhaps a typo or a missing channel).

That syntax is incorrect.

In your particular case, you can just install funannotate via pip in your v1.8.15 environment which has a working augustus installation. ie

I probably could, but that machine is about to get deleted, which is why I am setting everything up on a new one.

I have tried installing other Augustus builds, but keep running into the same libboost dependency problems.
If I remember correctly when I set up this envirionment that was a problem and had to be installed separately after installing funannotate via conda.

I am currently running a few full genomes through the pipeline and everything works fine, it's just the BUSCO validation that returns fewer validated genes than the previous version did.

image

@nextgenusfs
Copy link
Owner

Frustrating!

You can certainly compile Augustus manually and link it to the conda environment. I actually use a dockerized version locally as none of them will work on apple-silicon..... https://github.com/nextgenusfs/dockerized-augustus. This is hacky but it works, I just put the dockerized scripts in the PATH....

@mencian
Copy link

mencian commented Oct 24, 2024

Jumping in here; I've rebuilt augustus here, could you see if the new Augustus build fixes the issue?

@JWDebler
Copy link
Author

Installed your latest Augustus build, but the test pipeline still fails due to validating too few BUSCO models.

image

@nextgenusfs
Copy link
Owner

Okay digging into this more. I setup a docker install of funannotate v1.8.17 installed via conda in order to test. What I'm seeing in the filtering step (which extracts the protein sequences and then does an all-vs-all to ensure that all gene calls are more than 80% divergent) is this (which is clearly wrong).

>gene364.t1 gene364
MCGIFAAFKHEDIHNFKPKALQLSKKIRHRGPDWSGNAVMNSTIFVHERLAIVGLDSGAQPITSADGEYMLGVNGEIYNH
IQLREMCSDYKFQTFSDCEPIIPLYLEHDIDAPKYLDGMFAFCLYDSKKDRIVAARDPIGVVTLYMGRSSQSPETVYFAS
ELKCLTDVCDSIISFPPGHVYDSETDKITRYFTPDWLDEKRIPSTPVDYHAIRHSLEKAVRKRLMAEVPYGVLLSGGLDS
SLIAAIAARETEKANADANEDNNVDEKQLAGIDDQGHLHTSGWSRLHSFAIGLPNAPDLQAARKVAKFIGSIHHEHTFTL
QEGLDALDDVIYHLETYDVTTIRASTPMFLLSRKIKAQGVKMVLSGEGSDEIFGGYLYFAQAPSAAEFHTESVQRVKNLH
LADCLRANKSTMAWGLEARVPFLDKDFLQLCMNIDPNEKMIKPKEGRIEKYILRKAFDTTDEPDVKPYLPEEILWRQKEQ
FSDGVGYSWIDGLRDTAERAISDAMFANPKADWGDDIPTTKEAYWYRLKFDAWFPQKTAADTVMRWIPKADWGCAEDPSG
RYAKIHEKHVSA**
>gene365.t1 gene365
N
>gene366.t1 gene366
D
>gene367.t1 gene367
S
>gene368.t1 gene368
MGEKRNRNGKDANSQNRKKFKVSSGFLDPGTSGIYATCSRRHERQAAQELQLLFEEKFQELYGDIKEGEDESENDEKKDL
SIEDQIKKELQELKGEETGKDLSSGETKKKDPLAFIDLNCECVTFCKTRKPIVPEEFVLSIMKDLADPKNMVKRTRYVQK
LTPITYSCNAKMEQLIKLANLVIGPHFHDPSNVKKNYKFAVEVTRRNFNTIERMDIINQVVKLVNKEGSEFNHTVDLKNY
DKLILVECFKSNIGMCVVDGDYKTKYRKYNVQQLYESKFRKDEDKSVKQ**
>gene369.t1 gene369
N
>gene370.t1 gene370
D

Still trying to figure out if related to Augustus build or the other possibility is this is a python 3.8 vs python 3.9 issue, ie in relation to how the code is parsing the Augustus results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants