Skip to content

snayfach/UHGV

Repository files navigation

Unified Human Gut Virome Catalog (UHGV)

The UHGV is a comprehensive genomic resource of viruses from the human microbiome. Genomes were derived from 12 independent data sources and annotated using a uniform bioinformatics pipeline:

Table of contents

  1. Methods
  2. Data availability
  3. Bioinformatics tools that use the UHGV

Methods

Data sources

We constructed the UHGV by integrating gut virome collections from a number of recent studies:

Bioinformatics pipeline

Sequences from these studies were combined and run through the following bioinformatics pipeline:

  • geNomad, viralVerify, and CheckV were used to remove sequences from cellular organisms and plasmids, as necessary
  • CheckV was used to trim remaining bacterial DNA from virus ends, estimate completeness, and identify closed genomes. Sequences >10Kb or >50% complete were retained and classified as either complete, high-quality (>90% complete), medium-quality (50-90% complete), or low-quality (<50% complete)
  • BLASTN was used to calculate the average nucleotide identity between viruses using a custom script
  • DIAMOND was used to blast proteins between viral genomes. Pairwise alignments were used to calculate a genome-wide protein-based similarity metric.
  • MCL was used to cluster genomes into viral operational taxonomic units (vOTUs) at approximately the species, subgenus, genus, subfamily, and family-level ranks using a combination of genome-wide ANI for the species level and genome-wide proteomic similarity for higher ranks
  • A representative genome was selected for each species level vOTU based on: presence of terminal repeats, completeness, and ratio of viral:non-viral genes
  • ICTV taxonomy was inferred using a best-genome-hit approach to phage genomes from INPHARED and using taxon-specific marker genes from geNomad
  • CRISPR spacer matching and kmer matching with PHIST were used to connect viruses and host genomes. A voting procedure was used to then identify the host taxon at the lowest taxonomic rank comprising at least 70% of connections
  • HumGut genomes and MAGs from a Hadza hunter-gatherer population were used for host prediction and read mapping (HumGut contains all genomes from the UHGG v1.0 combined with NCBI genomes detected in gut metagenomes)
  • GTDB r207 and GTDB-tk were used to assign taxonomy to all prokaryotic genomes
  • BACPHLIP was used for prediction of phage lifestyle together with integrases from the PHROG database and prophage information from geNomad. Note: BACPHLIP tends to over classify viral genome fragments as lytic
  • Prodigal-gv was used to identify protein-coding genes and alternative genetic codes
  • eggNOG-mapper, PHROGs, KOfam, Pfam, UniRef_90, PADLOC, and the AcrCatalog were used for phage gene functional annotation
  • PhaNNs were used to infer phage structural genes
  • DGRscan was used to identify diversity-generating retroelements on viruses containing reverse transcriptases
  • Bowtie2 was used to align short reads from 1798 whole-metagenomes and 673 viral-enriched metagenomes against the UHGV and database of prokaryotic genomes. ViromeQC was used to select human gut viromes. CoverM was used to estimate the breadth of coverage and we applied a 50% threshold for classifying virus presence-absence

For additional details, please refer to our manuscript: (in preparation).

Data availability

The entire resource is freely available at: https://portal.nersc.gov/UHGV

We provide genomes for three quality tiers:

  • Full: >50% complete or >10Kbp, high-confidence & uncertain viral predictions
  • Medium-quality: >50% complete, high-confidence viral predictions
  • High-quality : >90% complete, high-confidence viral predictions

Additionally, we provide data for:

  • vOTU representatives
  • All genomes in each vOTU

Recommended files

For most analyses, we recommend using these files:

All available files:

  • metadata/

    • uhgv_full_metadata.tsv : detailed information on each of the 874,104 UHGV genome sequences
    • votus_full_metadata.tsv : detailed information on each of the 168,570 species level viral clusters
    • votus_metadata_extended.tsv: additional information on each vOTU
    • host_metadata.tsv : taxonomy and other info for prokaroytic genomes (completeness, contamination, n50)
  • genome_catalogs/

    • uhgv_full.[fna|faa].gz : sequences for all genomes >10kb or >50% completeness
    • uhgv_mq_plus.[fna|faa].gz : sequences for all genomes with >50% completeness
    • uhgv_hq_plus.[fna|faa].gz : sequences for all genomes with >90% completeness
    • votus_full.[fna|faa].gz : sequences for for vOTU representatives >10kb or >50% completeness
    • votus_mq_plus.[fna|faa].gz : sequences for for vOTU representatives with >50% completeness
    • votus_hq_plus.[fna|faa].gz : sequences for vOTU representatives with >90% completeness
  • votu_reps/

    • [genome_id].fna : DNA sequence FASTA file of the genome assembly of the species representative
    • [genome_id].faa : protein sequence FASTA file of the species representative
    • [genome_id].gff : genome GFF file with various sequence annotations
    • [genome_id]_emapper.tsv : eggNOG-mapper annotations of the protein-coding sequences
    • [genome_id]_annotations.tsv : tab-delimited file containing diverse protein-coding annotations (PHROG, Pfam, UniRef90, eggNOG-mapper, PhANNs, KEGG)
  • host_predictions/

    • crispr_spacers.fna : 5,318,089 CRISPR spacers from UHGG (3,143,456), NCBI (1,568,807), and Hadza genomes (605,826)
    • host_genomes_info.tsv : GTDB r207 taxonomy for genomes from the UHGG (286,387), NCBI (123,500), and Hadza genomes (54,779)
    • host_assignment_crispr.tsv : detailed information for host prediction with CRISPR spacers
    • host_assignment_kmers.tsv : detailed information for host prediction with PHIST kmer matching
  • annotations/

    • functional annotation matrices: vOTUs x functions (PHROG, Pfam, KOfam, PADLOC)
  • read_mapping/

    • metagenomes_prok_vir_counts_matrix.tsv.gz : coverM mapping statistics for viruses and bacteria across bulk metagenomes

    • viromes_prok_vir_counts_matrix.tsv.gz : coverM mapping statistics for viruses and bacteria across viral-enriched metagenomes

    • sample_metadata.tsv: human sample metadata (country, lifestyle, age, gender, bmi, study)

    • fastq_summary.tsv: information on sequencing reads (sra, bulk/virome metagenome, viromeQC enrichment, read counts)

    • study_metadata.tsv: information on individual studies for read mapping

    • bowtie2_indexes/

      • prokaryote_reps.fna.gz: FASTA of prokaryotic genomes used for read mapping
      • prokaryote_metadata_table.tsv.gz: prok genome metadata
      • prokaryote_reps.1.bt*: bowtie2 indexes

Code availability

Contig-level taxonomic classification with the UHGV toolkit

  • Code to assign viral genomes to taxonomic groups from the UHGV
  • View the README for download and usage instructions.

Read-level abundance profiling with Phanta

  • Phanta (https://github.com/bhattlab/phanta) is a fast and accurate virus-inclusive profiler of human gut metagenomes based on the classification of short reads with Kraken2.
  • Follow the instructions to install the software at the Phanta Github page
  • Download a custom-built UHGV database for Phanta:
    • HQ plus: wget http://ab_phanta.os.scg.stanford.edu/Phanta_DBs/humgut_uhgv_hqplus_v1.tar.gz
    • MQ plus: wget http://ab_phanta.os.scg.stanford.edu/Phanta_DBs/humgut_uhgv_mqplus_v1.tar.gz
    • These databases are similar to Phanta's default database as described in Phanta's manuscript but replacing the viral portion of Phanta’s default DB with UHGV.
  • Phanta can be executed based on the instructions on its GitHub page.

Genome visualization