-
Notifications
You must be signed in to change notification settings - Fork 0
Dataset overview
Kaixin edited this page Jan 20, 2024
·
1 revision
- Each genome (sample) is represented by its unique PATRIC ID.
- In total, there are 78 datasets, each corresponding to a species-antibiotic combination.
-
Sample list of each species-antibiotic combination. The files are named as
Data_<species>_<antibiotic>
. Each file contains all the genome samples for a dataset, i.e. each file corresponds to a dataset. -
Sample phenotype metadata of each dataset. The files are named as
Data_<species>_<antibiotic>_pheno.txt
. 1 represents the resistance phenotype; 0 represents the susceptibility phenotype. Each file contains all the genome samples for a dataset. - Single-species-antibiotic evaluation folds in the form of [ [sample list of fold 1], [sample list of fold 2],...[sample list of fold 10] ].
Table 1 Species and antibiotics
Species | Antibiotics | Number of antibiotics | Number of genomes |
---|---|---|---|
Mycobacterium tuberculosis | amikacin , capreomycin, ethambutol, ethiomide, ethionamide, isoniazid, kanamycin, ofloxacin, pyrazinamide, rifampicin, streptomycin | 11 | 13550 |
Salmonella enterica | amoxicillin/clavulanic acid, ampicillin, cefoxitin, ceftiofur, ceftriaxone, chloramphenicol, gentamicin, nalidixic acid, streptomycin, sulfisoxazole, tetracycline | 11 | 1922 |
Streptococcus pneumoniae | chloramphenicol, erythromycin, penicillin, tetracycline, trimethoprim/sulfamethoxazole | 5 | 5266 |
Neisseria gonorrhoeae | azithromycin, cefixime | 2 | 680 |
Escherichia coli | amoxicillin, amoxicillin/clavulanic acid, ampicillin, aztreonam, cefotaxime, ceftazidime, ceftriaxone, cefuroxime, ciprofloxacin, gentamicin, piperacillin/tazobactam, tetracycline, trimethoprim | 13 | 2493 |
Staphylococcus aureus | cefoxitin, ciprofloxacin, clindamycin, erythromycin, fusidic acid, gentamicin, methicillin, penicillin, tetracycline | 9 | 3325 |
Klebsiella pneumoniae | amikacin, aztreonam, cefepime, cefoxitin, ciprofloxacin, gentamicin, imipenem, levofloxacin, meropenem, piperacillin/tazobactam, tetracycline, tobramycin, trimethoprim/sulfamethoxazole | 13 | 1761 |
Enterococcus faecium | vancomycin | 1 | 277 |
Acinetobacter baumannii | amikacin, ampicillin/sulbactam, imipenem, levofloxacin, meropenem, tobramycin, trimethoprim/sulfamethoxazole | 7 | 1144 |
Pseudomonas aeruginosa | ceftazidime, ciprofloxacin, levofloxacin, meropenem, tobramycin | 5 | 891 |
Campylobacter jejuni | tetracycline | 1 | 395 |
- Each genome (sample) is represented by its unique PATRIC ID.
- In total, there are nine datasets, each corresponding to a species (M. tuberculosis, E. coli, S. aureus, S. enterica, K. pneumoniae, P. aeruginosa, A. baumannii, S. pneumoniae, N. gonorrhoeae).
- Multi-antibiotic evaluation folds in the form of [ [sample list of fold 1], [sample list of fold 2],...[sample list of fold 10] ] for each species.
- For each dataset, please refer to corresponding species'
metadata
for the phenotype of each genome mentioned in above folds. The files are namedData_<species>_<antibiotic>_pheno.txt
. 1 represents the resistance phenotype; 0 represents the susceptibility phenotype. If a genome (for example, in the E. coli multi-antibiotic dataset) is absent in a Data_Escherichia_coli__pheno.txt file, it means there is no phenotype information of this specific antibiotic for this genome.
- Each genome (sample) is represented by its unique PATRIC ID.
- In total, the dataset is composed of 54 species-antimicrobial combinations (see Fig. 1). Nine species, M. tuberculosis, E. coli, S. aureus, S. enterica, K. pneumoniae, P. aeruginosa, A. baumannii, S. pneumonia, C. jejuni, are involved.
- Multi-species-antibiotic evaluation folds for Aytan-Aktug control multi-species model solely (see Example 1), in the form of [ [sample list of fold 1], [sample list of fold 2],...[sample list of fold 10] ]. Phenotype metadata for each relevant genome can be found the same way as multi-antibiotic dataset usage.
- Leave-one-species-out multi-species-antibiotic evaluation folds. Each file in the folder contains samples associated with a specific species, thus making up a fold. Each time use all the samples in one file as the test set, and use all the samples in the rest files, that are associated with the other eight species, as the training set. The training set can be further split into folds for hyperparameter selection based on two methods. (1) For multi-species-antibiotic model (see Example 2). In each sample list file we have split samples into 5 folds for the corresponding species. One can build the 5 folds for a multi-species-antibiotic training set sequentially, each time extracting a different fold from each of the training species to form a new combined multi-species fold. (2) For multi-species single-antibiotic model (see Example 3). For example, when testing models on M. tuberculosis using AN-dataset (represented by the AN column), genome samples from K. pneumoniae-AN, A. baumannii-AN combinations are used as the training set; the corresponding sample splits for the two species-antibiotic combinations, respectively, can be found in single-species-antibiotic evaluation folds, according to the file name; then one can build the 10 folds for a multi-species training set sequentially, each time extracting a different fold from each of the training species to form a new combined multi-species fold.
Figure 1 Multi-species-antibiotic dataset overview. Green indicates that the corresponding species-antibiotic combination is included in the dataset. The last row counts the number of species-antibiotic combinations w.r.t. the corresponding antibiotic.