Skip to content

bioinfoUQAM/fungalbgcdata

Repository files navigation

Fungal BGC datasets

Datasets built to support the development of supervised learning approaches to identify fungal BGCs.

Files pos%neg% are named by their % of positive and negative instances.

Each dataset is divided into:

  • train (80% of dataset instances)
  • validation (20% of dataset instances)

Reference

To cite our work:
http://arxiv.org/abs/2001.03260
H. Almeida , A. Tsang., A.B. Diallo. Supporting supervised learning in fungal Biosynthetic Gene Cluster discovery: new benchmark datasets. Machine Learning and Artificial Intelligence in Bioinformatics and Medical Informatics (MABM2019) workshop at the International Conference on Bioinformatics and Biomedicine (IEEE BIBM), 2019.

Dataset instances

The number of instances in fungal BGC datasets are splitted as follows:

Dataset Phase Pos Neg Phase Pos Neg
50%-50% Train 160 160 Validation 40 40
40%-60% Train 160 240 Validation 40 60
30%-70% Train 160 373 Validation 40 93
20%-80% Train 160 640 Validation 40 160
10%-90% Train 160 1,440 Validation 40 360
05%-95% Train 160 3,040 Validation 40 760
01%-99% Train 160 15,840 Validation 40 3,960

License

This work is licensed under a license Creative Commons Attribution 4.0 International License.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published