A supervised learning framework to predict Biosynthetic Gene Clusters (BGCs) in fungi based on a combination of feature types (k-mers, Pfam protein domains, and GO terms).
Make a copy of /src/config.init.DEFAULT
, and rename it to /src/config.init
. Update the [default] home
to the current project root path.
At the [prediction]
section in the config.init
file, specify the minimum parameters accordingly:
- the
task
:train
,validation
, ortest
- indicate the corpus location in
source.path
- (if using sequences) indicate the
source.type
:nucleotide
oraminoacid
- specify the positive instances % in
pos.perc
- indicate the
feat.type
askmers
,domains
orgo
(if combining multiple features, separate them with a-
, as ingo-kmers-domains
) - set the minimum occurrences to consider a feature in
feat.minOcc
- set the k-mer length in
feat.size
- select a
classifier
:logit
,mlp
,linearsvc
,nusvc
,svc
,randomforest
To run the classification task from the project virtualenv
simply:
(.env) user@foo:~fungalbgcs/src$ python -m pipeprediction.ML
The train
task will generate a /metrics
folder, with:
- the (re-load-able) model file
(classifier)_(featuretype).model.pkl
- a list of features file
(featuretype).feat
The validation
task will also generate in the /metrics
folder:
- a performance file
(classifier)_(featuretype).valid
with P, R, F-m and a confusion matrix - a list of {valid_instance_IDs, predicted label} file
(classifier)_(featuretype).IDs.valid
The test
task requires either train
or validation
to have been performed, since it will read from the model *.model.pkl
and feature *.feat
files. It generates in the /metrics
folder:
- a performance file
(classifier)_(featuretype).test
with P, R, F-m and a confusion matrix - a list of {test_instance_IDs, predicted label} file
(classifier)_(featuretype)_(testfolder).IDs.test
, used as input for evaluation against gold clusters
Datasets: Openly available fungal BGC datasets to train and validate models (details here).
External software: To set up Pfam for protein domain annotation locally, please refer to the steps on /extSoftware/
.