This is a nifH database formated for the dada2 pipeline.
The database is based on the nifH ARB database from the Zehr Lab (Heller 2014) (version June 2017), which was exported from ARB in XML format. The XML file was modified, updated, and reformatted. A log of modifications can be found in this repository (Taxonomic_modifications.pdf). Modifications were based on information available on NCBI.
v1.1.0 has been updated with additional sequences, including nifH homologs from Cluster V (chlL/bchL) (see Documentation).
The modified XML file included in this repository (nifH_mod_v1.xml.zip) was used to generate a dada2 formatted database. The XML file contains detailed information for each entry, including accession number, full taxon name, taxonomy, amino acid and nucleotide sequences. The script for parsing data and generating the fasta file is also included in this repository (nifH_xml_to_dada2_fasta.R).
Taxonomy in the fasta file is in the following format:
>Domain; Phylum; Class; Order; Family; Genus ACCTAGAAAGTCGTAGATCGAAGTTGAAGCATCGCCCGATGATCGTCTGAAGCTGTAGCATGAGTCGATTTTCACATTCAGGGATACCATAGGATAC
Three versions of the database are included:
- **Only sequences idetified to the Phylum level or more (nifH_dada2_phylum_v1.1.0.fasta) (**recommended version)
- Only sequences identified to the Domain level or more (nifH_dada2_domain_v1.1.0.fasta)
- All sequences from the original nifH database (nifH_dada2_all_v1.1.0.fasta)
M. A. Moynihan. 2020. nifHdada2 GitHub repository. Zenodo. http://doi.org/10.5281/zenodo.3958370
Heller, P., Tripp, H. J., Turk-Kubo, K., & Zehr, J. P. (2014). ARBitrator: A software pipeline for on-demand retrieval of auto-curated nifH sequences from GenBank. Bioinformatics, btu417.
ARB nifH database: https://wwwzehr.pmc.ucsc.edu/nifH_Database_Public/
NCBI Resource Coordinators. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2018;46(D1):D8-D13. doi:10.1093/nar/gkx1095