Skip to content

Dataset overview

Kaixin edited this page Jan 20, 2024 · 1 revision

Single-species-antibiotic datasets

  • Each genome (sample) is represented by its unique PATRIC ID.
  • In total, there are 78 datasets, each corresponding to a species-antibiotic combination.
  • Sample list of each species-antibiotic combination. The files are named as Data_<species>_<antibiotic>. Each file contains all the genome samples for a dataset, i.e. each file corresponds to a dataset.
  • Sample phenotype metadata of each dataset. The files are named as Data_<species>_<antibiotic>_pheno.txt. 1 represents the resistance phenotype; 0 represents the susceptibility phenotype. Each file contains all the genome samples for a dataset.
  • Single-species-antibiotic evaluation folds in the form of [ [sample list of fold 1], [sample list of fold 2],...[sample list of fold 10] ].

Table 1 Species and antibiotics

Species Antibiotics Number of antibiotics Number of genomes
Mycobacterium tuberculosis amikacin , capreomycin, ethambutol, ethiomide, ethionamide, isoniazid, kanamycin, ofloxacin, pyrazinamide, rifampicin, streptomycin 11 13550
Salmonella enterica amoxicillin/clavulanic acid, ampicillin, cefoxitin, ceftiofur, ceftriaxone, chloramphenicol, gentamicin, nalidixic acid, streptomycin, sulfisoxazole, tetracycline 11 1922
Streptococcus pneumoniae chloramphenicol, erythromycin, penicillin, tetracycline, trimethoprim/sulfamethoxazole 5 5266
Neisseria gonorrhoeae azithromycin, cefixime 2 680
Escherichia coli amoxicillin, amoxicillin/clavulanic acid, ampicillin, aztreonam, cefotaxime, ceftazidime, ceftriaxone, cefuroxime, ciprofloxacin, gentamicin, piperacillin/tazobactam, tetracycline, trimethoprim 13 2493
Staphylococcus aureus cefoxitin, ciprofloxacin, clindamycin, erythromycin, fusidic acid, gentamicin, methicillin, penicillin, tetracycline 9 3325
Klebsiella pneumoniae amikacin, aztreonam, cefepime, cefoxitin, ciprofloxacin, gentamicin, imipenem, levofloxacin, meropenem, piperacillin/tazobactam, tetracycline, tobramycin, trimethoprim/sulfamethoxazole 13 1761
Enterococcus faecium vancomycin 1 277
Acinetobacter baumannii amikacin, ampicillin/sulbactam, imipenem, levofloxacin, meropenem, tobramycin, trimethoprim/sulfamethoxazole 7 1144
Pseudomonas aeruginosa ceftazidime, ciprofloxacin, levofloxacin, meropenem, tobramycin 5 891
Campylobacter jejuni tetracycline 1 395

Single-species multi-antibiotic datasets

  • Each genome (sample) is represented by its unique PATRIC ID.
  • In total, there are nine datasets, each corresponding to a species (M. tuberculosis, E. coli, S. aureus, S. enterica, K. pneumoniae, P. aeruginosa, A. baumannii, S. pneumoniae, N. gonorrhoeae).
  • Multi-antibiotic evaluation folds in the form of [ [sample list of fold 1], [sample list of fold 2],...[sample list of fold 10] ] for each species.
  • For each dataset, please refer to corresponding species' metadata for the phenotype of each genome mentioned in above folds. The files are named Data_<species>_<antibiotic>_pheno.txt. 1 represents the resistance phenotype; 0 represents the susceptibility phenotype. If a genome (for example, in the E. coli multi-antibiotic dataset) is absent in a Data_Escherichia_coli__pheno.txt file, it means there is no phenotype information of this specific antibiotic for this genome.

Multi-species-antibiotic dataset

  • Each genome (sample) is represented by its unique PATRIC ID.
  • In total, the dataset is composed of 54 species-antimicrobial combinations (see Fig. 1). Nine species, M. tuberculosis, E. coli, S. aureus, S. enterica, K. pneumoniae, P. aeruginosa, A. baumannii, S. pneumonia, C. jejuni, are involved.
  • Multi-species-antibiotic evaluation folds for Aytan-Aktug control multi-species model solely (see Example 1), in the form of [ [sample list of fold 1], [sample list of fold 2],...[sample list of fold 10] ]. Phenotype metadata for each relevant genome can be found the same way as multi-antibiotic dataset usage.
  • Leave-one-species-out multi-species-antibiotic evaluation folds. Each file in the folder contains samples associated with a specific species, thus making up a fold. Each time use all the samples in one file as the test set, and use all the samples in the rest files, that are associated with the other eight species, as the training set. The training set can be further split into folds for hyperparameter selection based on two methods. (1) For multi-species-antibiotic model (see Example 2). In each sample list file we have split samples into 5 folds for the corresponding species. One can build the 5 folds for a multi-species-antibiotic training set sequentially, each time extracting a different fold from each of the training species to form a new combined multi-species fold. (2) For multi-species single-antibiotic model (see Example 3). For example, when testing models on M. tuberculosis using AN-dataset (represented by the AN column), genome samples from K. pneumoniae-AN, A. baumannii-AN combinations are used as the training set; the corresponding sample splits for the two species-antibiotic combinations, respectively, can be found in single-species-antibiotic evaluation folds, according to the file name; then one can build the 10 folds for a multi-species training set sequentially, each time extracting a different fold from each of the training species to form a new combined multi-species fold.

Figure 1 Multi-species-antibiotic dataset overview. Green indicates that the corresponding species-antibiotic combination is included in the dataset. The last row counts the number of species-antibiotic combinations w.r.t. the corresponding antibiotic.