Dataset overview

Single-species-antibiotic datasets

Each genome (sample) is represented by its unique PATRIC ID.
In total, there are 78 datasets, each corresponding to a species-antibiotic combination.
Sample list of each species-antibiotic combination. The files are named as Data_<species>_<antibiotic>. Each file contains all the genome samples for a dataset, i.e. each file corresponds to a dataset.
Sample phenotype metadata of each dataset. The files are named as Data_<species>_<antibiotic>_pheno.txt. 1 represents the resistance phenotype; 0 represents the susceptibility phenotype. Each file contains all the genome samples for a dataset.
Single-species-antibiotic evaluation folds in the form of [ [sample list of fold 1], [sample list of fold 2],...[sample list of fold 10] ].

Table 1 Species and antibiotics

Species	Antibiotics	Number of antibiotics	Number of genomes
Mycobacterium tuberculosis	amikacin , capreomycin, ethambutol, ethiomide, ethionamide, isoniazid, kanamycin, ofloxacin, pyrazinamide, rifampicin, streptomycin	11	13550
Salmonella enterica	amoxicillin/clavulanic acid, ampicillin, cefoxitin, ceftiofur, ceftriaxone, chloramphenicol, gentamicin, nalidixic acid, streptomycin, sulfisoxazole, tetracycline	11	1922
Streptococcus pneumoniae	chloramphenicol, erythromycin, penicillin, tetracycline, trimethoprim/sulfamethoxazole	5	5266
Neisseria gonorrhoeae	azithromycin, cefixime	2	680
Escherichia coli	amoxicillin, amoxicillin/clavulanic acid, ampicillin, aztreonam, cefotaxime, ceftazidime, ceftriaxone, cefuroxime, ciprofloxacin, gentamicin, piperacillin/tazobactam, tetracycline, trimethoprim	13	2493
Staphylococcus aureus	cefoxitin, ciprofloxacin, clindamycin, erythromycin, fusidic acid, gentamicin, methicillin, penicillin, tetracycline	9	3325
Klebsiella pneumoniae	amikacin, aztreonam, cefepime, cefoxitin, ciprofloxacin, gentamicin, imipenem, levofloxacin, meropenem, piperacillin/tazobactam, tetracycline, tobramycin, trimethoprim/sulfamethoxazole	13	1761
Enterococcus faecium	vancomycin	1	277
Acinetobacter baumannii	amikacin, ampicillin/sulbactam, imipenem, levofloxacin, meropenem, tobramycin, trimethoprim/sulfamethoxazole	7	1144
Pseudomonas aeruginosa	ceftazidime, ciprofloxacin, levofloxacin, meropenem, tobramycin	5	891
Campylobacter jejuni	tetracycline	1	395

Single-species multi-antibiotic datasets

Each genome (sample) is represented by its unique PATRIC ID.
In total, there are nine datasets, each corresponding to a species (M. tuberculosis, E. coli, S. aureus, S. enterica, K. pneumoniae, P. aeruginosa, A. baumannii, S. pneumoniae, N. gonorrhoeae).
Multi-antibiotic evaluation folds in the form of [ [sample list of fold 1], [sample list of fold 2],...[sample list of fold 10] ] for each species.
For each dataset, please refer to corresponding species' metadata for the phenotype of each genome mentioned in above folds. The files are named Data_<species>_<antibiotic>_pheno.txt. 1 represents the resistance phenotype; 0 represents the susceptibility phenotype. If a genome (for example, in the E. coli multi-antibiotic dataset) is absent in a Data_Escherichia_coli__pheno.txt file, it means there is no phenotype information of this specific antibiotic for this genome.

Multi-species-antibiotic dataset

Each genome (sample) is represented by its unique PATRIC ID.
In total, the dataset is composed of 54 species-antimicrobial combinations (see Fig. 1). Nine species, M. tuberculosis, E. coli, S. aureus, S. enterica, K. pneumoniae, P. aeruginosa, A. baumannii, S. pneumonia, C. jejuni, are involved.
Multi-species-antibiotic evaluation folds for Aytan-Aktug control multi-species model solely (see Example 1), in the form of [ [sample list of fold 1], [sample list of fold 2],...[sample list of fold 10] ]. Phenotype metadata for each relevant genome can be found the same way as multi-antibiotic dataset usage.
Leave-one-species-out multi-species-antibiotic evaluation folds. Each file in the folder contains samples associated with a specific species, thus making up a fold. Each time use all the samples in one file as the test set, and use all the samples in the rest files, that are associated with the other eight species, as the training set. The training set can be further split into folds for hyperparameter selection based on two methods. (1) For multi-species-antibiotic model (see Example 2). In each sample list file we have split samples into 5 folds for the corresponding species. One can build the 5 folds for a multi-species-antibiotic training set sequentially, each time extracting a different fold from each of the training species to form a new combined multi-species fold. (2) For multi-species single-antibiotic model (see Example 3). For example, when testing models on M. tuberculosis using AN-dataset (represented by the AN column), genome samples from K. pneumoniae-AN, A. baumannii-AN combinations are used as the training set; the corresponding sample splits for the two species-antibiotic combinations, respectively, can be found in single-species-antibiotic evaluation folds, according to the file name; then one can build the 10 folds for a multi-species training set sequentially, each time extracting a different fold from each of the training species to form a new combined multi-species fold.

Figure 1 Multi-species-antibiotic dataset overview. Green indicates that the corresponding species-antibiotic combination is included in the dataset. The last row counts the number of species-antibiotic combinations w.r.t. the corresponding antibiotic.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset overview

Single-species-antibiotic datasets

Single-species multi-antibiotic datasets

Multi-species-antibiotic dataset

Clone this wiki locally