Datasets built to support the development of supervised learning approaches to identify fungal BGCs.
Files pos%neg%
are named by their %
of positive and negative instances.
Each dataset is divided into:
- train (80% of dataset instances)
- validation (20% of dataset instances)
To cite our work:
http://arxiv.org/abs/2001.03260
H. Almeida , A. Tsang., A.B. Diallo. Supporting supervised learning in fungal Biosynthetic Gene Cluster discovery: new benchmark datasets. Machine Learning and Artificial Intelligence in Bioinformatics and Medical Informatics (MABM2019) workshop at the International Conference on Bioinformatics and Biomedicine (IEEE BIBM), 2019.
The number of instances in fungal BGC datasets are splitted as follows:
Dataset | Phase | Pos | Neg | Phase | Pos | Neg |
---|---|---|---|---|---|---|
50%-50% | Train | 160 | 160 | Validation | 40 | 40 |
40%-60% | Train | 160 | 240 | Validation | 40 | 60 |
30%-70% | Train | 160 | 373 | Validation | 40 | 93 |
20%-80% | Train | 160 | 640 | Validation | 40 | 160 |
10%-90% | Train | 160 | 1,440 | Validation | 40 | 360 |
05%-95% | Train | 160 | 3,040 | Validation | 40 | 760 |
01%-99% | Train | 160 | 15,840 | Validation | 40 | 3,960 |
This work is licensed under a Creative Commons Attribution 4.0 International License.