This software extracts biological sequences (such as spliced transcripts, cds, or peptides) from specific regions of a genomic fasta file based on a given GFF3
file.
- Free software: license
- Incorporation of gff3.py:
gff3.py
is contributed by Han Lin which uses simple data structures to parse aGFF3
file into a structure composed of simple pythondict
andlist
. - Validation: Validate the GFF3 formatting errors utilizing QC methods contributed by the I5K Workspace@NAL team. Provide
WARNING
messages for gene models that may have incorrect biological sequences generated because ofGFF3
formatting errors. - Easy extraction of biological sequences: Provide options for extracting six types of biological sequences.
gene
: Gene sequence for each record in theFASTA
output. Gene or pseudogene features need to be included in the gff fileexon
: Exon sequence for each record in theFASTA
output. Exon features need to be included in the gff filepre_trans
: Genomic region of a transcript model, namely premature transcript (exon and intron regions included), for each record in theFASTA
output. Transcript-level features (such as mRNA, rRNA, pseudogenic transcripts) need to be included in the gff file.trans
: Spliced transcript (only exons included) for each record in theFASTA
output. Exon features are mainly used for splicing. CDS features are used instead if exon features are absent. If both cds and exon features are absent, the transcript is not generated and aWARNING
message is shown with the transcript ID.cds
: Coding sequence (utr exons and introns excluded) for each record in theFASTA
output. CDS features need to be included in the gff file.pep
: Translated peptide sequences (translation based on cds regions) for each record in theFASTA
output. CDS features need to be included in the gff file.
translator
method for universal translation: Thetranslator
method is feasible for- translation from 64 combitions of standard codons (Only standard codons and universal stop condons are considered.)
- translation from codons with IUB Depiction
- translation from mRNA (U contained) or CDS (T, instead of U contained)
Sequence extraction based on genome annotation file in GFF3
format might generate incorrect sequences if the GFF3
file contains certain errors. This program can automatically validate GFF3
format and list the detected errors for users along with the extracted sequences by simple command as follows:
python gff3_to_fasta.py -g example_file/annotations3.gff -f example_file/sample.fa -st cds -d simple -o sample_output
If you would like to ignore the QC step for checking errors in gff file, you can use the below command:
python gff3_to_fasta.py -g example_file/annotations3.gff -f example_file/sample.fa -st cds -d simple -o sample_output -noQC
If you would like to incorporate a specfic method in gff3_to_fasta
into other python script, below is a suggested script.
import gff3_to_fasta
seq = 'GTGGCTCGTTTGATTGAACAAATATGTACTAACCCAGTTGGATTATCTGGATCTGGATTTTTTCTGGTGACAAAGAATTTTCTACTTCAGATGGCAGGAACGATAGTTACATTTGAACTGATGCTGTTTCAATTTGCCCCAGTAAATGCACAGCAAAAACCCATGAAGTCATATGACTGTATTTAA'
pep = gff3_to_fasta.translator(seq)
print 'Input: {0:s}'.format(seq)
print 'Translation: {0:s}'.format(pep)
usage: gff3_to_fasta.py [-h] [-g GFF] [-f FASTA] [-st SEQUENCE_TYPE]
[-d DEFLINE] [-o OUTPUT_PREFIX] [-noQC] [-v]
-h
(--help): Show help message and exit-g
(GFF, --gff GFF): Genome annotation file in GFF3 format. The sequence IDs in the GFF3 file must correspond to the sequence IDs in the given FASTA file.-f
(FASTA, --fasta): Genome sequence file in FASTA format.-st
(SEQUENCE_TYPE, --sequence_type): Type of sequences you would like to extract. It must be one of the options below.- all - FASTA files for all types of sequences listed below
- gene - gene sequence for each record
- exon - exon sequence for each record
- pre_trans - genomic region of a transcript model (premature transcript)
- trans - spliced transcripts (only exons included)
- cds - coding sequences
- pep - peptide seuqences
-d
(DEFLINE, --defline): Specify defline format. It must be one of the options below.- simple - only ID is shown in the defline; eg.
>LDEC008409-PA
- complete - complete information of the feature is shown in the defline; eg.
>Scaffold194:420160..444997:-|peptide|Parent=LDEC008409|ID=LDEC008409-PA|Name=LDEC008409-PA
- simple - only ID is shown in the defline; eg.
-o
(OUTPUT_PREFIX, --output_prefix): Prefix of output file name-noQC
(--quality_control): Specify this option if you do not want to execute quality control of the gff3 file, otherwise QC will be executed while running the program. (default: QC is executed)-v
(--version): Show program's version number and exit