GMSC-mapper is a command line tool to query the Global Microbial smORFs Catalog (GMSC).
GMSC-mapper can be used to
- Find query smORFs (< 100aa) homologous to Global Microbial smORFs Catalogue (GMSC) by alignment.
- Support 3 types of input:
- contigs (GMSC-mapper will predict smORFs from contigs first)
- amino acid sequences
- nucleotide gene sequences
- Support 3 types of input:
- Annotate query / predicted smORFs with quality, habitat and taxonomy information constructed manually in detail.
If you use this tool in a publication, please cite:
Duan, Y., Santos-Júnior, C.D., Schmidt, T.S. et al. A catalog of small proteins from the global microbiome. Nat Commun 15, 7563 (2024). https://doi.org/10.1038/s41467-024-51894-6
Clone GMSC-mapper repository
git clone https://github.com/BigDataBiology/GMSC-mapper.git
Create conda environment
conda create -n gmscmapper python
conda activate gmscmapper
You will need the following dependencies:
The easiest way to install the dependencies is with conda:
conda install -c bioconda -c conda-forge mmseqs2
conda install -c bioconda -c conda-forge diamond=2.0.13
Once the dependencies are installed, you can install GMSC-mapper by running:
cd GMSC-mapper
python setup.py install
As the whole GMSC database is large and takes some minutes to process. To check if the installation works well, you can test with mock datasets easily and fast.
Please make GMSC-mapper
as your work directory.
cd GMSC-mapper
- Create GMSC database index
Default alignment tool is DIAMOND.
gmsc-mapper createdb -i ./examples/target.faa -o ./examples/ -m diamond
- When input is genome contig sequences:
gmsc-mapper -i ./examples/example.fa -o ./examples_output/ --dbdir ./examples/
- When input is amino acid sequences:
gmsc-mapper --aa-genes ./examples/example.faa -o ./examples_output/ --dbdir ./examples/
- When input is nucleotide gene sequences:
gmsc-mapper --nt-genes ./examples/example.fna -o ./examples_output/ --dbdir ./examples/
- Check another alignment tool: MMseqs2
The default alignment tool is DIAMOND, if you want to use MMseqs2 as your alignment tool, you need to create GMSC database index in MMseqs2 format.
gmsc-mapper createdb -i ./examples/target.faa -o ./examples/ -m mmseqs
After index creation, you can specify tool as mmseqs and other usage is the same as above.
gmsc-mapper -i ./examples/example.fa -o ./examples_output/ --dbdir ./examples/ --tool mmseqs
We recommend to use GMSC-mapper
as your current work directory. You can derectly follow the commonds below.
cd GMSC-mapper
Download GMSC database
gmsc-mapper downloaddb --dbdir ./db
The default --dbdir
is ./db
. If you want to use custom --dbdir
directory, it should be consistent with -o
in the next creating database index step.
Create GMSC database index
gmsc-mapper createdb -i ./db/GMSC10.90AA.faa.gz -o ./db -m diamond
The input (i
) is the fasta file (GMSC10.90AA.faa.gz
) downloaded to the dbdir (default: ./db
) in the downloading step.
The default -o
is ./db
. If you want to use custom -o
directory, it should be consistent with --dbdir
in the last downloading database step.
GMSC Database directory (--dbdir
) and output directory (-o
) can be assigned on your own. Default is ./db
and ./output
.
If you use GMSC-mapper
as your current work directory. You can derectly follow the commonds below. Otherwise, you need to assign your custom --dbdir
which contains database files.
cd GMSC-mapper
- Input is genome contig sequences.
gmsc-mapper -i ./examples/example.fa --dbdir ./db
- Input is amino acid sequences.
gmsc-mapper --aa-genes ./examples/example.faa --dbdir ./db
- Input is nucleotide gene sequences.
gmsc-mapper --nt-genes ./examples/example.fna --dbdir ./db
If you don't want to annotate habitat / taxonomy / quality you can use --no-habitat
/--no-taxonomy
/--no-quality
/ --no-domain
.
gmsc-mapper -i ./examples/example.fa --dbdir ./db --no-habitat --no-taxonomy --no-quality --no-domain
The default alignment tool is DIAMOND, if you want to use MMseqs2 as your alignment tool, you need to create GMSC database index in MMseqs2 format.
gmsc-mapper createdb -i ./db/GMSC10.90AA.faa.gz -o ./db -m mmseqs
Then you can assign--tool
as mmseqs.
gmsc-mapper -i ./examples/example.fa --dbdir ./db --tool mmseqs
The output folder will contain
-
Outputs of smORFs prediction (predicted.filterd.smorf.faa)
A FASTA file with the sequences of the predicted smORFs. It is generated when the input file is contigs.
-
Complete alignment result table (diamond.out.smorfs.tsv / mmseqs.out.smorfs.tsv)
A file listing all the query hits of GMSC, from Diamond or MMseqs2.
The file format is followed by a space-separated list of these keywords:
qseqid
: Query seq idsseqid
: Target seq id (in GMSC)full_qseq
: Query sequencesfull_sseq
: Target sequences (in GMSC)qlen
: Query sequences lengthslen
: Target sequences lengthlength
: Alignment lengthqstart
: Start of alignment in queryqend
: End of alignment in querysstart
: Start of alignment in targetsend
: End of alignment in targetbitscore
: Bit scorepident
: Percentage of identical matchesevalue
: Expect valueqcovhsp
: Query Coveragescovhsp
: Target Coverage -
Total smORFs homologous to GMSC (mapped.smorfs.faa)
A FASTA file with the sequences of query/predicted smORFs homologous to GMSC.
-
Habitat annotation of smORFs (optional) (habitat.out.smorfs.tsv)
This file lists the habitat annotations of the query/predicted sequence, where the habitat is obtained from the sequence annotations of its homologous origin in GMSC.
There are two columns in the file:
qseqid
: Query seq idhabitat
: Habitat, ',' separated if the sequence is from multiple habitats -
Taxonomy annotation of smORFs (optional) (taxonomy.out.smorfs.tsv)
This file lists the taxonomy annotations of the query/predicted sequence, where the taxonomy is obtained from the sequence annotations of its homologous origin in GMSC.
There are two columns in the file:
qseqid
: Query seq idtaxonomy
: Taxonomy, ';' separated between each taxonomy rank -
Quality annotation of smORFs (optional) (quality.out.smorfs.tsv)
This file lists the quality annotations of the query/predicted sequence, where the quality is obtained from the sequence annotations of its homologous origin in GMSC.
qseqid
: Query seq idquality
: Quality label -
Conserved domain annotation of smORFs (optional) (domain.out.smorfs.tsv)
This file lists the conservative domain annotations of the query/predicted sequence, where the conservative domain is obtained from the sequence annotations of its homologous origin in GMSC.
qseqid
: Query seq idcdd
: Identifiers from Conserved domain database, ',' separated if the sequence is annotated with multiple conserved domains. -
Summary (summary.txt)
A file providing a human-readable summary of the results.
-
-i/--input
: Path to the input genome contig sequence FASTA file (possibly .gz compressed). -
--aa-genes
: Path to the input amino acid sequence FASTA file (possibly .gz compressed). -
--nt-genes
: Path to the input nucleotide gene sequence FASTA file (possibly .gz compressed). -
--dbdir
: Path to the GMSC database directory. (default:./db
) -
-o/--output
: Output directory (will be created if non-existent). (default:./output
) -
--tool
: Sequence alignment tool (Diamond / MMseqs). (default: diamond) -
-s/--sensitivity
: Sensitivity. (default: --more-sensitive (Diamond) 5.7 (mmseqs)) -
--id
: Minimum identity to report an alignment (range 0.0-1.0). (default: 0.0) -
--cov
: Minimum coverage to report an alignment (range 0.0-1.0). (default: 0.9) -
-e/--evalue
: Maximum e-value to report alignments. (default: 1e-05) -
-t/--threads
: Number of CPU threads. (default: 1) -
--filter
: Use this to filter <100 aa or <303 nt input sequences. (default: False) -
--no-habitat
: Use this if no need to annotate habitat. (default: False) -
--no-taxonomy
: Use this if no need to annotate taxonomy. (default: False) -
--no-quality
: Use this if no need to annotate quality. (default: False) -
--no-domain
: Use this if no need to annotate conserved domain. (default: False) -
--quiet
: Disable alignment console output. (default:False)
Subcommands: gmsc-mapper downloaddb
-
--dbdir
: Path to GMSC database annotation index files. (default:./db
. IfGMSC-mapper
is your current work directory, the database files will be downloaded atGMSC-mapper/db
) -
--all
: Download all database -
-f
: Force download even if the files exist
Subcommands: gmsc-mapper createdb
-
-i
: Path to the GMSC FASTA file. -
-o/--output
: Path to database index output of Diamond and MMseqs2. (default:./db
. IfGMSC-mapper
is your current work directory, the database files will be created atGMSC-mapper/db
) -
-m/--mode
: Alignment tool (Diamond / MMseqs2). -
--quiet
: Disable alignment console output. (default:False)