-
Notifications
You must be signed in to change notification settings - Fork 0
AMR phenotyping benchmarking tutorial
Kaixin edited this page Jan 20, 2024
·
20 revisions
Our AMR phenotyping benchmarking workflow is a collection of benchmarking datasets with predefined partitioning, adapted AMR phenotyping methods, each assigned with nested cross-validation or its derivatives evaluation approaches depending on the software's intrinsic design, a set of metrics, and summary tables and visualizations generation procedures.
In this tutorial, we provide a step-by-step fundamental guide for using our workflow to evaluate a model currently out of our benchmarking choice or that could be released in the feature by any third party. Two types of machine-learning (ML)-based AMR phenotyping methods are provided as examples.
- The first is a classic ML model 1) that can use the scikit-learn module for training, 2) in which the feature matrix could be built without phenotype information, 3) encompasses several classifiers. Among the benchmarked methods in our study, Seq2Geno2Pheno falls into this category. Those methods could be evaluated via nested cross-validations, in which hyperparameter and classifier selections are performed at each iteration.
- The second ML model 1) possesses an inherent hyperparameter optimization process, 2) encompasses several classifiers, 3) which are not supported by scikit-learn model. Among the benchmarked methods in our study, Kover falls into this category. Those methods could be evaluated via an iterative evaluation approach, accompanied by cross-validation to select the optimal classifier.
- 1. Setup
- 2.The first example: ML-based methods nested cross-validation
- 3. The second example: ML-based methods with iterative evaluation
- Download our AMR benchmarking workflow
git clone https://github.com/hzi-bifo/AMR_benchmarking.git
- Set the configure file
Config.yaml
in the root folder. You only need to set the following items; please maintain the default values for the rest.
option | action | values ([default]) |
---|---|---|
dataset_location | To where the PATRIC dataset will be downloaded. ~246G | /vol/projects/BIFO/patric_genome |
output_path | To where to generate the Results folder for the direct results of each software and further visualization. |
./ |
log_path | To where to generate the log folder for the intermediate files (~10 TB, while regularly cleaning files related to completed benchmarking species). |
./ |
n_jobs | CPU cores (>1) to use. | 10 |
amr_env_name | conda env for general use | amr_env |
- Linux OS and conda. Miniconda2 4.8.4 was used by us
- Load the options set in
Config.yaml
in Linux shell
function parse_yaml {
local prefix=$2
local s=[[:space:]]*' w='[a-zA-Z0-9_]*' fs=$(echo @|tr @ '\034')
sed -ne "s|^\($s\):|\1|" \
-e "s|^\($s\)\($w\)$s:$s[\"']\(.*\)[\"']$s\$|\1$fs\2$fs\3|p" \
-e "s|^\($s\)\($w\)$s:$s\(.*\)$s\$|\1$fs\2$fs\3|p" $1 |
awk -F$fs '{
indent = length($1)/2;
vname[indent] = $2;
for (i in vname) {if (i > indent) {delete vname[i]}}
if (length($3) > 0) {
vn=""; for (i=0; i<indent; i++) {vn=(vn)(vname[i])("_")}
printf("%s%s%s=\"%s\"\n", "'$prefix'",vn, $2, $3);
}
}'
}
eval $(parse_yaml Config.yaml)
export PATH="$PWD"/src:$PATH
export PYTHONPATH=$PWD
SCRIPT=$(realpath "$0")
SCRIPTPATH=$(dirname "$SCRIPT")
- Create a conda environment, and install packages with conda in Linux shell
conda env create -n ${amr_env_name} -f ./install/amr_env.yml python=3.7
conda activate ${amr_env_name}
- Download genome sequences via Linux shell
bash ./scripts/data_preprocess/retrieve_PATRIC_data.sh ${dataset_location}
- Create a folder under the
AMR_software
directory, preferably named as the software's name, e.g.Pseudo
in this tutorial - Copy Python files
benchmarking.py
andevaluate_ncv.py
fromPseudo
directory to your newly created fold
- Integrate the feature-building procedure into
benchmarking.py
viaprepare_feature
function
def prepare_feature(self):
df_species,antibiotics=extract_infor(self.level,self.f_all,self.s,self.temp_path,self.software_name)
for species, antibiotics in zip(df_species, antibiotics):
antibiotics, _, _ = load_data.extract_info(species, False, self.level)
for anti in antibiotics:
name,_,_ = name_utility.GETname_model(self.software_name,self.level, species, anti,'',self.temp_path)
phenotype_list = pd.read_csv(name, index_col=0, dtype={'genome_id': object}, sep="\t")
phenotype_list['path']=str(self.path_sequence) +'/'+ phenotype_list['genome_id'].astype(str)+'.fna'
################################################################
'''
phenotype_list is pandas matrix with each row representing a genome. E.G.
genome_id resistant_phenotype path
0 470.2428 1 <path_to_sequence>/470.2428.fna
1 480119.5 0 <path_to_sequence>/480119.5.fna
2 470.2433 1 <path_to_sequence>/470.2433.fna
3 470.2159 1 <path_to_sequence>/470.2159.fna
4 470.2166 1 <path_to_sequence>/470.2166.fna
...
...
![Please here add your codes for building features based on above information for your model]
'''
## save the feature matrix to a folder under temp_path
data_feature.to_csv('<path_to_feature>', sep="\t")
################################################################
- Integrate the model into
benchmarking.py
vianested_cv
function- Load the feature matrix
- Specify the classifiers
- Specify the model's classifiers' hyperparameter selection range
def nested_cv(self): ### nested CV
df_species,antibiotics=extract_infor(self.level,self.f_all,self.s,self.temp_path,self.software_name)
for species, antibiotics in zip(df_species, antibiotics):
## antibiotics is the python list of benchmarked antibiotics for that species
## ID is the python list of PATRIC ID, e.g. [1352.10013,1352.10014, ..., 1354.10,1366.10]
## Y is the python list of phenotype for each sample in the same order as ID list
antibiotics, ID, Y = load_data.extract_info(species, False, self.level)
i_anti = 0
for anti in antibiotics: ## evaluate each species-antibiotic dataset sequentially
id_all = ID[i_anti]
y_all = Y[i_anti]
i_anti+=1
'''
! [Please specifiy the model's classifiers and feature matrix location here. ]
For example:
CLASSIFIERS=['svm','lr', 'rf','lsvm']
data_feature=pd.read_csv('<path_to_feature>', index_col=0,sep="\t")
'''
X_all = pd.concat([X_all, data_feature.reindex(X_all.index)], axis=1) ## load the feature matrix
id_all = np.array(id_all)
y_all = np.array(y_all)
## load in folds for nested CV
p_names = name_utility.GETname_meta(species,anti,self.level)
folds_txt=name_utility.GETname_folds(species,anti,self.level,self.f_kma,self.f_phylotree)
folders_sample = json.load(open(folds_txt, "rb"))
folders_index=name2index.Get_index(folders_sample,p_names) # CV folds
for chosen_cl in CLASSIFIERS: # evaluate each classifier sequentially
_, _,save_name_score=name_utility.GETname_model(self.software_name, self.level,species, anti,chosen_cl,self.temp_path)
file_utility.make_dir(os.path.dirname(save_name_score))
### metrics of 10 folds
mcc_test = []
f1_test = [] ## F1-macro
## score_report could be used to extract metrics like precision-positive,
## recall-positive, F1-positive, precision-negative, recall-negative, F1-negative, and accuracy
score_report_test = []
aucs_test = []
### other outputs from nested CV
estimator_test=[] ##scikit-learn
hyperparameters_test = [] ## hyperparameters selected from inner loop CV for training in each of the 10 outer loop iteration
score_InnerLoop = [] ## the best metric scores of inner loops for each of the 10 outer loop iteration
index_InnerLoop=[] ## the order of the best hyperparameters in grid search
cv_results_InnerLoop=[] ## cv_results_ attributes of scikit-learn sklearn.model_selection.GridSearchCV
### testing results
sampleNames_test=[] ## sample PATRIC ID for a species-antibiotic combination/dataset
predictY_test=[] ## predicted AMR phenotype for each sample, ordered the same as sampleNames_test
true_Y=[] ## ground truth AMR phenotype for each sample, ordered the same as sampleNames_test
for out_cv in range(self.cv): ## outer loop of nested CV
print(species,anti,'. Starting outer: ', str(out_cv), '; chosen_cl: ', chosen_cl)
test_samples_index = folders_index[out_cv] ## a list of index of the test fold
id_test = id_all[test_samples_index] ## sample name list of the test fold
y_test = y_all[test_samples_index] ## ground truth phenotype
train_val_index =folders_index[:out_cv] +folders_index[out_cv + 1:] ## 9 folds involved in inner loop for CV
id_val_train = id_all[list(itertools.chain.from_iterable(train_val_index))] ## sample name list of the 9 folds
y_val_train = y_all[list(itertools.chain.from_iterable(train_val_index))] ## phenotype of the 9 folds
X_val_train=X_all.loc[id_val_train,:] ## feature matrix of samples from the 9 folds
X_test=X_all.loc[id_test,:] ## feature matrix of samples of the test fold
'''
! [Please specify the model's classifiers' hyperparameter selection range here. ]
For example:
cl = RandomForestClassifier(random_state=1)
hyper_space hyper_space = [
{
"cl__n_estimators": [100, 200, 500, 1000],
"cl__criterion": ["entropy", "gini"],
"cl__max_features": ["auto"],
"cl__min_samples_split": [2,5,10],
"cl__min_samples_leaf": [1],
"cl__class_weight": ["balanced", None]
}
]
'''
###############################################################################################
## typical procedures for nested CV inner loop hyperparameter selection.
## note: this part can be vaired based on specific techniques, e.g. for Kover, this can be replaced with bound selection,
## for neural networks model, this should be replaced with a hyperparameter optimization procedure accompanied by early stopping
## (see function training in AMR_software/AytanAktug/nn/hyperpara.py)
###############################################################################################
### Grid search for hyperparameter selection
pipe = Pipeline(steps=[('cl', cl)])
search = GridSearchCV(estimator=pipe, param_grid=hyper_space, n_jobs=self.n_jobs,
scoring='f1_macro',
cv=create_generator(train_val_index), refit=True)
search.fit(X_all, y_all)
hyperparameters_test_sub=search.best_estimator_
scores_best=search.best_score_
index_best=search.best_index_
cv_results=search.cv_results_
current_pipe=hyperparameters_test_sub
# retrain on train and validation data using the optimal hyperparameters
current_pipe.fit(X_val_train, y_val_train)
y_test_pred = current_pipe.predict(X_test)
###############################################################################################
# calculate metric scores
f1 = f1_score(y_test, y_test_pred, average='macro')
report = classification_report(y_test, y_test_pred, labels=[0, 1], output_dict=True)
mcc = matthews_corrcoef(y_test, y_test_pred)
fpr, tpr, _ = roc_curve(y_test, y_test_pred, pos_label=1)
roc_auc = auc(fpr, tpr)
f1_test.append(f1)
score_report_test.append(report)
aucs_test.append(roc_auc)
mcc_test.append(mcc)
hyperparameters_test.append(hyperparameters_test_sub)
score_InnerLoop.append(scores_best)
index_InnerLoop.append(index_best)
cv_results_InnerLoop.append(cv_results)
predictY_test.append( y_test_pred.tolist())
true_Y.append(y_test.tolist())
sampleNames_test.append(folders_sample[out_cv])
estimator_test.append(current_pipe)
### Save metric scores and other model evaluation information
score ={'f1_test':f1_test,'score_report_test':score_report_test,'aucs_test':aucs_test,'mcc_test':mcc_test,
'predictY_test':predictY_test,'ture_Y':true_Y,'samples':sampleNames_test}
score2= {'hyperparameters_test':hyperparameters_test,'estimator_test':estimator_test,
'score_InnerLoop':score_InnerLoop,'index_InnerLoop':index_InnerLoop,'cv_results_InnerLoop':cv_results_InnerLoop}
with open(save_name_score + '_KMA_' + str(self.f_kma) + '_Tree_' + str(self.f_phylotree) + '.json',
'w') as f: # overwrite mode
json.dump(score, f)
with open(save_name_score + '_kma_' + str(self.f_kma) + '_tree_' + str(self.f_phylotree) + '_model.pickle',
'wb') as f: # overwrite mode
pickle.dump(score2, f)
- Usage
usage: evaluate_ncv.py [-h] -software SOFTWARE_NAME -sequence PATH_SEQUENCE [-temp TEMP_PATH] [-l LEVEL] [-f_all] [-f_phylotree] [-f_kma] [-s SPECIES [SPECIES ...]] [-f_ml] [-cv CV_NUMBER]
[-n_jobs N_JOBS]
options:
-h, --help show this help message and exit
-software SOFTWARE_NAME, --software_name SOFTWARE_NAME
software_name
-sequence PATH_SEQUENCE, --path_sequence PATH_SEQUENCE
Path of the directory with PATRIC sequences
-temp TEMP_PATH, --temp_path TEMP_PATH
Directory to store temporary files, like features.
-l LEVEL, --level LEVEL
Quality control: strict or loose
-f_all, --f_all Benchmarking on all the possible species.
-f_phylotree, --f_phylotree
phylogeny-aware evaluation
-f_kma, --f_kma homology-aware evaluation
-s SPECIES [SPECIES ...], --species SPECIES [SPECIES ...]
species to run: e.g.'Pseudomonas aeruginosa' 'Klebsiella pneumoniae' 'Escherichia coli' 'Staphylococcus aureus' 'Mycobacterium tuberculosis' 'Salmonella enterica'
'Streptococcus pneumoniae' 'Neisseria gonorrhoeae'
-f_ml, --f_ml Perform neseted CV on ML models
-cv CV_NUMBER, --cv_number CV_NUMBER
CV splits number. Default=10
-n_jobs N_JOBS, --n_jobs N_JOBS
Number of jobs to run in parallel.
- Example: phylogeny-based evaluation of software named "Pseudo" on E. coli in the Linux shell
python AMR_software/Pseudo/evaluate_ncv.py -software 'Pseudo' -s 'Escherichia coli' -sequence '/vol/projects/patric_genome' -temp ${log_path} -f_phylotree -cv 10 -n_jobs 10
- Usage
usage: result_analysis.py [-h] -software SOFTWARENAME [-cl_list CL_LIST [CL_LIST ...]] [-temp TEMP_PATH] [-out OUTPUT_PATH] [-l LEVEL] [-cv CV_NUMBER] [-f_all] [-s SPECIES [SPECIES ...]] [-f_phylotree] [-f_kma]
options:
-h, --help show this help message and exit
-software SOFTWARENAME, --softwareName SOFTWARENAME
Software name.
-cl_list CL_LIST [CL_LIST ...], --cl_list CL_LIST [CL_LIST ...]
classifiers.
-temp TEMP_PATH, --temp_path TEMP_PATH
Directory to store temporary files.
-out OUTPUT_PATH, --output_path OUTPUT_PATH
Directory to store CV scores.
-l LEVEL, --level LEVEL
Quality control: strict or loose
-cv CV_NUMBER, --cv_number CV_NUMBER
CV splits number. Default=10
-f_all, --f_all all the possible species
-s SPECIES [SPECIES ...], --species SPECIES [SPECIES ...]
species to run: e.g.'seudomonas aeruginosa' 'Klebsiella pneumoniae' 'Escherichia coli' 'Staphylococcus aureus' 'Mycobacterium tuberculosis' 'Salmonella enterica'
'Streptococcus pneumoniae' 'Neisseria gonorrhoeae'
-f_phylotree, --f_phylotree
phylogeny-aware evaluation
-f_kma, --f_kma homology-aware evaluation
- Example
- The software is equipped with a support vector machine and a logistic regression.
- Generate a summary table of software named "Pseudo" on E. coli, under phylogeny-based evaluation.
- Output file
Escherichia_coli_kma_False_tree_True_SummaryBenchmarking.txt
can be found at<output_path>/Results/software/Pseudo/Escherichia_coli
in the form shown below.
python ./src/analysis_utility/result_analysis.py -software 'Pseudo' -f_phylotree -cl_list 'svm' 'lr' -cv ${cv_number} -temp ${log_path} -o ${output_path} -s 'Escherichia coli'
f1_macro | f1_positive | f1_negative | accuracy | precision_macro | recall_macro | precision_negative | recall_negative | precision_positive | recall_positive | clinical_f1_negative | clinical_precision_negative | clinical_recall_negative | |||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
amoxicillin | 0.86±0.03 | 0.89±0.05 | 0.83±0.05 | 0.87±0.03 | 0.86±0.04 | 0.87±0.04 | 0.80±0.09 | 0.88±0.10 | 0.93±0.05 | 0.85±0.08 | 0.843507214 | 0.798319328 | 0.894117647 | ||
amoxicillin/clavulanic acid | 0.69±0.07 | 0.59±0.14 | 0.79±0.08 | 0.74±0.07 | 0.69±0.07 | 0.69±0.07 | 0.81±0.07 | 0.77±0.11 | 0.57±0.13 | 0.62±0.17 | 0.798613518 | 0.818763326 | 0.779431664 | ||
ampicillin | 0.85±0.08 | 0.91±0.04 | 0.78±0.15 | 0.89±0.05 | 0.88±0.06 | 0.84±0.09 | 0.82±0.11 | 0.79±0.23 | 0.94±0.06 | 0.89±0.08 | 0.84501845 | 0.831215971 | 0.859287054 | ||
aztreonam | 0.74±0.18 | 0.61±0.35 | 0.87±0.22 | 0.90±0.11 | 0.79±0.18 | 0.74±0.17 | 0.93±0.06 | 0.86±0.27 | 0.65±0.38 | 0.62±0.38 | 0.935217903 | 0.956626506 | 0.914746544 | ||
cefotaxime | 0.73±0.19 | 0.40±0.37 | 0.96±0.05 | 0.94±0.06 | 0.73±0.21 | 0.68±0.18 | 0.94±0.08 | 0.98±0.03 | 0.51±0.46 | 0.37±0.39 | 0.964926591 | 0.948676824 | 0.981742739 | ||
ceftazidime | 0.71±0.15 | 0.48±0.31 | 0.95±0.05 | 0.92±0.08 | 0.76±0.17 | 0.71±0.16 | 0.95±0.08 | 0.96±0.04 | 0.58±0.37 | 0.46±0.34 | 0.956521739 | 0.948026316 | 0.965170797 | ||
ceftriaxone | 0.80±0.09 | 0.84±0.07 | 0.76±0.21 | 0.87±0.05 | 0.83±0.10 | 0.80±0.09 | 0.79±0.19 | 0.76±0.24 | 0.88±0.10 | 0.83±0.13 | 0.863157895 | 0.863157895 | 0.863157895 | ||
cefuroxime | 0.58±0.13 | 0.25±0.28 | 0.90±0.05 | 0.84±0.07 | 0.58±0.14 | 0.58±0.12 | 0.89±0.07 | 0.92±0.06 | 0.27±0.30 | 0.24±0.27 | 0.908468777 | 0.892436975 | 0.925087108 | ||
ciprofloxacin | 0.76±0.17 | 0.50±0.38 | 0.92±0.07 | 0.90±0.09 | 0.73±0.19 | 0.70±0.16 | 0.92±0.10 | 0.92±0.08 | 0.55±0.40 | 0.48±0.39 | 0.933075435 | 0.919908467 | 0.946624804 | ||
gentamicin | 0.87±0.15 | 0.76±0.30 | 0.98±0.02 | 0.97±0.03 | 0.90±0.15 | 0.86±0.16 | 0.98±0.02 | 0.98±0.03 | 0.82±0.30 | 0.74±0.32 | 0.983481576 | 0.982857143 | 0.984106802 | ||
piperacillin/tazobactam | 0.53±0.09 | 0.19±0.18 | 0.87±0.14 | 0.80±0.15 | 0.55±0.07 | 0.57±0.10 | 0.93±0.04 | 0.84±0.19 | 0.16±0.15 | 0.30±0.30 | 0.884810621 | 0.933278418 | 0.841128434 | ||
tetracycline | 0.80±0.08 | 0.83±0.10 | 0.77±0.12 | 0.82±0.07 | 0.83±0.07 | 0.81±0.08 | 0.77±0.15 | 0.82±0.19 | 0.90±0.08 | 0.79±0.18 | 0.795081967 | 0.746153846 | 0.850877193 | ||
trimethoprim | 0.84±0.14 | 0.78±0.28 | 0.90±0.06 | 0.90±0.04 | 0.85±0.15 | 0.84±0.13 | 0.89±0.08 | 0.92±0.09 | 0.81±0.31 | 0.76±0.29 | 0.913978495 | 0.894736842 | 0.934065934 |
- Create a folder under the
AMR_software
directory, preferably named as the software's name, e.g.Pseudo
in this tutorial - Copy Python files
benchmarking.py
,evaluate_iter.py
,run_iteration.sh
, andrun_val.sh
fromPseudo
directory to your newly created fold
- Integrate the feature-building procedure for iteration evaluation into
benchmarking.py
viaprepare_feature
function (see above 2.1 Feature building) - Integrate the feature-building procedure for classifier selection in the CV into
benchmarking.py
viaprepare_feature_val
function
def prepare_feature_val(self):
'''
This feature-preparing procedure is for ML models
1) possessing an inherent hyperparameter optimization process, which means nested CV is not needed;
but 2) encompassing several classifiers, which means we should still design a CV on the 9 training folds at each iteration
to select the optimal classifier
'''
df_species,antibiotics=extract_infor(self.level,self.f_all,self.s,self.temp_path,self.software_name)
for species, antibiotics in zip(df_species, antibiotics):
### load antibiotics for this species; load sample PATRIC IDs for this species-antibiotic combination
antibiotics, ID, _ = load_data.extract_info(species, False, self.level)
i_anti = 0
for anti in antibiotics:
id_all = ID[i_anti] # sample names of all the 10 folds
i_anti += 1
id_all = np.array(id_all)
### 1. load CV folds
p_names = name_utility.GETname_meta(species,anti,self.level)
folds_txt=name_utility.GETname_folds(species,anti,self.level,self.f_kma,self.f_phylotree)
folders_sample = json.load(open(folds_txt, "rb"))
folders_index=name2index.Get_index(folders_sample,p_names) # CV folds
for out_cv in range(self.cv):
test_samples_index = folders_index[out_cv]
train_val_train_index = folders_index[:out_cv] + folders_index[out_cv + 1:]
for inner_cv in range(self.cv-1):
val_index=train_val_train_index[inner_cv]
train_index=train_val_train_index[:inner_cv] + train_val_train_index[inner_cv+1 :]
id_train = id_all[list(itertools.chain.from_iterable(train_index))] ## sample name list of CV training set
id_val = id_all[val_index] ## sample name list of CV test set
#########################################################################################################
## genome sequence location. Please use this information to load in sequence for building feature matrix.
sequence_train=[str(self.path_sequence) +'/'+ genome_id.astype(str)+'.fna' for genome_id in id_train]
sequence_testn=[str(self.path_sequence) +'/'+ genome_id.astype(str)+'.fna' for genome_id in id_val]
'''
2. prepare meta/feature files for this iteration of training samples
only retain those in the training and validation CV folds
![Please here add your codes for building features]
'''
## save the feature matrix to a folder under temp_path
data_feature.to_csv('<path_to_feature>', sep="\t")
#########################################################################################################
- Integrate the model into
run_iteration.sh
#!/bin/bash
species="$1"
feature_path="$2"
n_jobs="$3"
readarray -t Anti_List < ${feature_path}/${species}/anti_list
for anti in ${Anti_List[@]};do
for j in {0..9};do
##########################################################################################
#### Software-specific command here to train the 9 training folds and test on the test fold
##########################################################################################
done
wait
done
- Integrate the model into
run_val.sh
#!/bin/bash
species="$1"
feature_path="$2"
n_jobs="$3"
readarray -t Anti_List < ${feature_path}/${species}/anti_list
for anti in ${Anti_List[@]};do
for j in {0..9};do
for i in {0..8};do
##########################################################################################
#### Software-specific command here to train the 8 training folds and test on the test fold
##########################################################################################
done
done
done
- Example: phylogeny-based evaluation of software named "Pseudo" on E. coli in the Linux shell
- Prepare features
- Iterative evaluation over the 10 folds
- Cross-validation within the training set at each iteration to select the optimal classifier
python AMR_software/Pseudo/evaluate_iter.py -software 'Pseudo' -s 'Escherichia coli' -sequence '/vol/projects/patric_genome' -temp ${log_path} -f_phylotree -cv 10 -n_jobs ${n_jobs}
bash AMR_software/Pseudo/run_iteration.sh 'Escherichia coli' <feature_path> ${amr_env_name}
bash AMR_software/Pseudo/run_val.sh 'Escherichia coli' <feature_path> ${amr_env_name}
- Need to design the codes depending on software-specific report format