PhyLTR was designed and created mainly in the Der Lab at California State University, Fullerton. The main program is written in Python 3 and tested on Scientific and Ubuntu Linux. Many of the routines in PhyLTR are parallelized, but not for MPI. As the pipeline runs, paths to intermediate results are stored in the file PhyLTR.output/status
. If the execution is interrupted and restarted, this file is used to allow PhyLTR to resume approximately where it left off.
A nucleotide FASTA file (e.g. genome assembly)
- LTR-R annotations in GFF3 format
- Putative solo LTR annotations GFF3 format
- Clusterings
- Rooted, bootstrapped phylogenies as phylograms and ultrametric trees for each cluster in newick format
- LTR divergence estimates
- Gene conversion assessments
- Visualizations and modeling of branching dynamics (external scripts)
PhyLTR populates a directory structure, keeping results for different clusterings separate. The results obtained prior to clutering are used for any post-clustering analysis. Within PhyLTR.output
(default):
LTRharvest/ LTRharvest results
suffixerator/ suffix array used by LTRharvest
LTRdigest/ LTRdigest results
AnnotateORFs/ intermediate files for ORF annotation
Circos/ Circos intermediate files and plots
DfamClassification/ nhmmer search of Dfam for LTR-R homologs
RepbaseClassification/ tblastx search of Repbase for LTR-R homologs
FASTA_output/ some intermediate FASTA files
GFF_output/ various intermediate and final GFF3 files
WickerFamDir/ WickerFam clusterings and downstream analyses
MCL/ MCL clusterings and downstream analyses
Within each clustering directory, e.g MCL/I6/
, are:
Alignments/ All alignments
Clusters/ Results of clustering
FASTAs/ Intermediate FASTA files
GENECONV/ Gene conversion analyses
GFFs/ Intermediate GFF3 files
LTR_divergence/ LTR divergence estimation analyis
Modeltest/ Model testing files
SoloLTRsearch/ "Solo LTR" search files
Trees/ Phylogenetic analyses
If none of the 11 flags below (excluding --fasta and --procs) are specified, then the default settings are used: All 11 flags are included. Other flags may be used to modify settings in either default or specific mode. The processes specified by the flags are explained below with additional optional flags. If any of these 11 flags are included in the call then it and all prerequisite processes are carried out, but not any of the others unless they are also included.
phyltr --fasta <input> --procs <int> \
--ltrharvest \
--ltrdigest \
--wicker \
--mcl \
--geneconvclusters \
--circos \
--sololtrsearch \
--geneconvltrs \
--ltrdivergence \
--phylo \
--LTT
--keep_files Keeps more intermediate files.
--output_dir (PhyLTR.output) Output directory.
--logfile (<output_dir>/log.txt) Desired file for log
--min_clust_size (7) Do not align/infer phylogenies from clusters smaller than this.
--nosmalls Do not combine and assemble all clusters smaller than --min_clust_size
These apply to all steps that do alignments. Aligning can fail to complete if the input data are too big and/or not enough RAM is available, for example, MAFFT exhausted 256 Gb RAM with ~2.7k seqs of length >5kb. You can cap the size of clusters to align using --mafft_largeAln_maxclustsize
. The MAFFT algorthim FFT-NS-2 is used for small and medium clusters and FFT_NS-1, which is much more inaccurate, for large clusters. The alignment of clusters has been the limiting process in terms of computation time in my experience, but aligning can be sped up by reducing the number of improvement iterations performed. You can use the following options to designate ranges of cluster sizes and set the number of alignment improvement iterations for each size (fewer iterations = faster).
--mafft_align_region (entire) Sequence region to align. entire or internal. Internal = between LTRs
--maxiterate_small_clusters (30) MAFFT iterations for small clusters. (more = better alignment = slower)
--maxiterate_medium_clusters (3) MAFFT iterations for medium clusters. (more = better alignment = slower)
--mafft_smallAln_maxclustsize (50) Max elements to consider a cluster small.
--mafft_mediumAln_maxclustsize (500) Max elements to consider a cluster medium.
--mafft_largeAln_maxclustsize (1000) Max elements to consider a cluster large. Clusters larger than this will not be aligned.
LTRharvest finds loci with the expected structure of full-length LTR-Rs, TSD-LTR-LTR-TSD (TSD = targest site duplication). LTRharvest searches a suffix array to make things fast, which PhyLTR creates from the FASTA input with the GenomeTools program suffixerator.
- Annotations of possible LTR-Rs, including the subfeatures TSDs, LTRs, primer-binding sites (GFF3)
--minlenltr (100) Minimum LTR length (bp)
--maxlenltr (1000) Maximum LTR length (bp)
--mindistltr (1000) Minimum distance between LTRs (bp)
--maxdistltr (15000) Maximum distance between LTRs (bp)
--similar (0.0) Minimum % similarity between LTRs
--vic (60) Distance (bp) beyond LTRs to search for TSDs
--mintsd (4) Minimum length for each TSD
--maxtsd (20) Maximum length for each TSD
--xdrop (5) xdropbelow score for extension-alignment
--mat (2) Match score for extension-alignment
--mis (-2) Mismatch score for extension-alignment
--insi (-3) Insertion score for extension-alignment
--del (-3) Deletion score for extension-alignment
- GenomeTools
If both are run, LTRdigest runs first, then the ORF-finding routine.
LTRdigest coordinates HMMER3 searches for transposable element protein coding sequence homologs in the internal regions (between the LTRs) of the putative LTR-Rs from step 1. A set of TE-related pHMMs is included with PhyLTR.
- Modified version of step 1 GFF3 that also has protein-coding domain annotations (GFF3)
- Nucleotide sequences for each domain (FASTA)
- TSD motifs and domain orders for each element (TSV)
--ltrdigest_hmms (PhyLTR/RepeatDatabases/LTRdigest_HMMs/hmms) path to pHMMs
- GenomeTools
- HMMER3
- pHMMs
Internal regions are searched for ORFs that don't overlap any preexisting annotation from step 2A and are longer than a user-defined threshold.
- Modified version of step 2A or 1 output with ORF annotations + translated sequences (GFF3)
--min_orf_len (300) The minimum length (bp) of ORF to annotate
- BEDtools
- EMBOSS
3. Classify elements using homology to LTR-Rs in Dfam and/or Repbase and remove false positives: on by default
Both methods use homology-based evidence for classifying elements as one of the classifications obtained from the database records: BEL, Copia, DIRS, Endogenous Retrovirus, ERV1, ERV2, ERV3, ERV4, Gypsy, Lentivirus (Repbase) and Copia, DIRS, ERV?, ERV1, ERV1?, ERV-Foamy, ERVK, ERVK?, ERVL, ERVL?, ERVL-MaLR, Gypsy, Gypsy?, Ngaro, Pao, Undefined, Unknown (Dfam Superfamily). As currently implemented, Dfam hits trump Repbase because they are expected to be longer, and only the highest scoring hits are considered. This worked well for our test genomes and classifications matched domain annotations (e.g. gypsy elements always and only had gypsy domains).
- LTR-R annotations with false positives removed and Dfam and/or Repbase annotations (GFF3)
- LTR-R annotations separated by classification (GFF3s)
Finds homologs in Dfam using nhmmer
- LTR-R annotations with best Dfam hit in attributes (GFF3)
--keep_no_classifications Retain elements without homology to known LTR-Rs
--nhmmer_reporting_evalue (10) See HMMER3 documentation: nhmmer -E
--nhmmer_inclusion_evalue (1e-2) See HMMER3 documentation: nhmmer -incE
- BEDtools
- HMMER3
- Dfam
Finds homologs in Repbase using tblastx
- LTR-R annotations with best Repbase hit in attributes (GFF3)
--keep_no_classifications Retain elements without homology to known LTR-Rs
--repbase_tblastx_evalue (1e-5) Maximum E-value for tblastx hits
- BEDtools
- NCBI BLAST+
- Repbase
An implementation of the method suggested by Wicker et al. (2007) for circumscribing putative LTR-R families. Elements are assigned to the same cluster if at least A percentage of an alignment of either their LTRs or internal regions at least B bp in length has at least C % identity, with defaults being 80%, 80 bp, and 80%. Separate all-by-all blastns are performed for LTRs and/or internal regions of elements in each classification and used to construct a graph, subject to constraints (see options below). Clusters are assigned the elements in the connected components in the graph which are discovered using depth first search.
- Cluster assignments for each element (TSV)
- Subsets of main annotation file for every cluster (GFF3s)
--wicker_minLen (80) Minimum length of blastn alignment
--wicker_pAln (80) Minimum percent of LTR or internal region required in alignment
--wicker_pId (80) Minimum %identity in alignment
--wicker_no_internals Turns off use of internal region alignments for clustering
--wicker_no_ltrs Turns off use of LTR alignments for clustering
- BEDtools
- NCBI Blast+
An implementation of the protocol Clustering similarity graphs encoded in BLAST results in the protocols section on the MCL website. Separate all-by-all blastns of entire LTR-R sequences are performed for elements in each classification and MCL programs are used to carry out MCL clustering. A single parameter set by --I controls the output. See MCL documentation for a discussion of setting this parameter.
- Cluster assignments for each element (TSV)
- Subsets of main annotation file for every cluster (GFF3s)
--I (6) MCL inflation paramter. Larger values result in smaller clusters.
- BEDtools
- NCBI Blast+
- MCL
Finds evidence of gene conversion between the LTRs of each element using GENECONV as in Cossu et al. (2017). If this step is run, gene conversion estimates will be used to scale LTR divergence estimates as in Casola et al. (2010) in the next step (5B).
- Aggregated GENECONV results (TSV)
--geneconv_g (g1,g2,g3) Comma-separated list, g1, g2, and/or g3. Stringency for mismatch-free gene conversion tracts: g0 > g2 > g1
- BEDtools
- MAFFT
- trimAl
- GENECONV
Estimates sequence divergences (substitutions per site) between LTRs for each element. Models for nucleotide subsitution are selected using jModeltest2. Input for jModeltest2 is a multiple sequence alignment of LTRs (generated by MAFFT) and a tree. Trees are inferred using FastTree2 with GTR from multiple sequence alignments of entire elements from each cluster performed by MAFFT and trimmed by trimAl. PAUP* is used to estimate divergences using the best models and corresponding ML-estimated parameters. Divergences are scaled linearly by the proportion of aligment represented by a g0 GENECONV tract.
- Raw and scaled divergence estimates and best supported models (TSV)
--remove_GC_from_modeltest_aln Remove elements with gene conversion (--geneconvclusters)
- BEDtools
- MAFFT
- trimAl
- FastTree2
- jModelTest2
- PAUP*
Finds candidate solo LTRs in the input fasta by performs blastn of each LTR to the input FASTA and retains the highest scoring hits that do not overlap a full-length LTR-R or other blastn hits. These hits may be referred to as "solo LTRs", but they may represent the following categories of features: (1) LTRs formed by unequal recombination (2) LTRs from truncated elements formed by illegitimate recombination (3) LTRs of elements that were discarded as false positives. (4) LTRs of elements that were missed by LTRharvest. The fraction of candidate solo LTRs in category 3 and 4 is expected to be small. The constraints of this search are consistent with those in the WickerFam classification.
- Summary file with cluster, cluster size (full-length elements), and number of solo LTRs (TSV)
- Separate solo LTR annotation files for each classification and for each cluster (GFF3s)
--soloLTRminPid (80.0) Minimum %identity in blastn alignment to associate LTR with a cluster
--soloLTRminLen (80.0) Minimum % of length of LTR required in alignment to associate LTR with a cluster
--soloLTRmaxEvalue (1e-3) Maximum E-value for blastn
- BEDtools
- NCBI BLAST+
Finds evidence of gene conversion in multiple alignments of entire elements for each cluster using GENECONV.
Aggregated GENECONV results (TSV) Summary file (TSV)
--geneconv_g (g1,g2,g3) Comma-separated list, g1, g2, and/or g3. Stringency for mismatch-free gene conversion tracts: g0 > g2 > g1
- BEDtools
- MAFFT
- trimAl
- GENECONV
Runs Circos to make plots showing gene conversion tracts between elements as links. (g0=orange, g2=blue, g1=purple)
- Circos plots showing the position of LTR-Rs on their parent sequence with GC tracts as links
- Circos plots showing extent of GC tracts along lengths of elements (i.e. the parent sequences are the elements in a cluster.
- BEDtools
- Circos
Infers phylogenies from alignments of entire elements for each cluster, optionally with an outgroup, using FastTree2 w/GTR. The alignment is resampled for bootstrapping using SEQBOOT from PHYLIP and optionally converted into an ultrametric tree using PATHd8.
- Trees (Newick/parenthetical)
--min_clust_size (7) Do not align clusters smaller than this.
--nosmalls Do not analyze clusters smaller than --min_clust_size
--rmhomoflank Exclude elements with non-unique flanking sequences from alignments.
--bpflank (500) Length (bp) of flank searched
--flank_evalue (1e-5) Maximum E-value for blastn for flank search
--flank_pId (70.0) Minimum %identity in blastn alignment for flank search
--flank_plencutoff (70.0) Minimum % of length of flank required in alignment
--convert_to_ultrametric Convert tree to ultrametric (can be used for LTT plots)
--auto_outgroup Include as an outgroup a random element from the largest available other cluster in classification (e.g. gypsy)
--bootstrap_reps (100) Number of bootstrap replicates to perform
--LTT Turns on --rmhomoflank, --convert_to_ultrametric, and --auto_outgroup.
- BEDtools
- MAFFT
- trimAl
- FastTree2
- PATHd8
- PHYLIP
Usage:
------------
domainSearch.py -gff <gff3> -ref <fasta> -prot <fasta> -procs <int>
Description:
------------
This script was written for comparing putative protein-coding domains
and ORFs in a GFF3 as output from PhyLTR (LTRharvest+LTRdigest+AnnotateORFs).
The required programs are bedtools getfasta, and makeblastdb and blastp from
BLAST+. Their paths should be in a text file named CONFIG file located in the
same directory as this script using the same format used for PhyLTR.
Process:
-----------
1. Putative domain sequences encoded in -gff are extracted from -ref and
translated
2. Translated sequences are compared to protein sequences in -prot, which could
be, for example, the protein sequences for a
Mandatory flags:
-----------
-gff <path> GFF3 file with LTRdigest-format LTR retrotransposon features
-ref <path> Nucleotide FASTA file that is the reference associated with -gff
-prot <path> Protein FASTA file to convert to a blast database and
-procs <int> Number of processors to use for blastp
-out <path> Output file path - you may need to give full path
Optional flags:
-----------
-evalue <num> max E-value for blastp. Default 1e-2
- Python 3
- BEDtools
- NCBI BLAST+
Usage:
ete3_tree.py -t <newick_tree> -d <divergences> -g <gff> [options]
Description:
Draws trees for long terminal repeat retrotransposon phylogenies with diagrams of LTR elements' domain
architecture.
-lflabel
Show element IDs (integers) as leaf labels.
-classif
Add the superfamily classification for each element (which is obtained from the LTR divergence file)
above each LTR RT diagram.
-geneconv
Add the word 'Yes' or 'No' to the immediately to the right of each leaf depending on whether intra-element
gene conversion tracts were detected between the LTRs of that element.
-reroot <int>|auto
Two options are possible for -reroot: 'auto', or an <int> corresponding to the element name (i.e.
LTR_retrotransposon<int>) to position as the earliest diverging lineage. Only use -reroot auto if
the newick filename contains the outgroup in the format output by PhyLTR.
-ultrametric
Draw tree after applying ete3's convert_to_ultrametric() function. In my experience using this option, I
have always seen terminal taxa drawn at different horizontal positions making the tree not look ultrametric.
-1, 2, ..., -n <path> <str>
Coloring ORFs. <path> is list of ORF IDs, str is ETE3 color
-transcribed <path>
A file containing a list of element IDs (e.g. LTR_retrotransposon123) to mark with a green asterisk.
Instead of an asterisk, a 'T' is shown as if -classif is used.
-round
Draw ORFs with rounded corners.
-outfmt <str>
One of (pdf, png, svg). Default=pdf
Colors
protease yellow
gag deepskyblue
copia gag mediumblue
DUF4219 lime
rt red
rnaseh darkviolet
int magenta
zf-h2c2 pink
other silver w/black text
orf dimgray
annotated orf custom
- Python 3
- Python 3 modules: ETE3
- Distributions
- Lineage through time plots
- Transposition/deletion rate modeling
- Tree shape analysis
General | Default |
---|---|
-h or --h | |
-help or --help | |
--logfile | log.txt |
--procs | 1 |
--output_dir | PhyLTR.output |
--keep_files | |
--min_clust_size | 7 |
--nosmalls |
Input | |
---|---|
--fasta |
LTRharvest | Default |
---|---|
--ltrharvest | |
--del | -3 |
--ins | -3 |
--mis | -2 |
--mat | 2 |
--xdrop | 5 |
--minlenltr | 100 |
--maxlenltr | 1000 |
--mindistltr | 1000 |
--maxdistltr | 15000 |
--similar | 0.0 |
--vic | 60 |
--mintsd | 4 |
--maxtsd | 20 |
LTRdigest | Default |
---|---|
--ltrdigest | |
--ltrdigest_hmms | PhyLTR/RepeatDatabases/LTRdigest_HMMs/hmms |
ORFs | Default |
---|---|
--findORFs | |
--min_orf_len | 300 |
Classification | Default |
---|---|
--classify | |
--classify_dfam | |
--nhmmer_reporting_evalue | 10 |
--nhmmer_inclusion_evalue | 1e-5 |
--classify_repbase | |
--repbase_tblastx_evalue | 1e-5 |
--keep_conflicting_classifications | |
--keep_no_classification |
Cluster | Default |
---|---|
--wicker | |
--wicker_pId | 80 |
--wicker_pAln | 80 |
--wicker_minLen | 80 |
--wicker_no_ltrs | |
--wicker_no_internals | |
--mcl | |
--I | 6 |
Find solo LTRs | Default |
---|---|
--soloLTRsearch | |
--soloLTRminPid | 80.0 |
--soloLTRminLen | 80.0 |
--soloLTRmaxEvalue | 1e-3 |
Alignment settings | Default |
---|---|
--maxiterate_small_clusters | 20 |
--maxiterate_medium_clusters | 3 |
--mafft_smallAln_maxclustsize | 50 |
--mafft_mediumAln_maxclustsize | 500 |
--mafft_largeAln_maxclustsize | 1000 |
LTR divergence and gene conversion | Default |
---|---|
All alignment settings | |
--geneconvltrs | |
--geneconvclusters | |
--geneconv_g | g0,g1,g2 |
--remove_GC_from_modeltest_aln | |
--ltrdivergence | |
--circos |
Phylogenetics | Default |
---|---|
All alignment settings | |
--phylo | |
--LTT | |
--rmhomoflank | |
--bpflank | 500 |
--flank_evalue | 1e-5 |
--flank_pId | 70.0 |
--flank_plencutoff | 70.0 |
--auto_outgroup | |
--bootstrap_reps | 100 |
--convert_to_ultrametric |
Bao, W., Kojima, K. K., & Kohany, O. (2015). Repbase Update, a database of repetitive elements in eukaryotic genomes. Mobile DNA, 6(1), 11. http://doi.org/10.1186/s13100-015-0041-9
Britton, T., Anderson, C. L., Jacquet, D., Lundqvist, S., & Bremer, K. (2007). Estimating divergence times in large phylogenetic trees. Systematic Biology, 56(5), 741-752. http://doi.org/10.1080/10635150701613783
Camacho, C., Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J., Bealer, K., & Madden, T. L. (2009). BLAST+: architecture and applications. BMC Bioinformatics, 10(1), 421. http://doi.org/10.1186/1471-2105-10-421
Capella-Gutiérrez, S., Silla-Martínez, J. M., & Gabaldón, T. (2009). trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics, 25(15), 1972-1973. http://doi.org/10.1093/bioinformatics/btp348
Casola, C., Ganote, C. L. & Hahn, M. W. Nonallelic Gene Conversion in the Genus Drosophila. Genetics 185, 95-103 (2010).
Cossu, R. M., Casola, C., Giacomello, S., Vidalis, A., Scofield, D. G., & Zuccolo, A. (2017). LTR Retrotransposons Show Low Levels of Unequal Recombination and High Rates of Intraelement Gene Conversion in Large Plant Genomes. Genome Biology and Evolution, 9(12), 3449-3462. http://doi.org/10.1093/gbe/evx260
Darriba, D., Taboada, G. L., Doallo, R., & Posada, D. (2012). jModelTest 2: more models, new heuristics and parallel computing. Nature Methods, 9(8), 772-772. http://doi.org/10.1038/nmeth.2109
van Dongen, Stijn. Graph Clustering by Flow Simulation. PhD thesis, University of Utrecht, May 2000.
Eddy, S. R. A new generation of homology search tools based on probabilistic inference. Genome Inform 23, 205–211 (2009).
Ellinghaus, D., Kurtz, S., & Willhoeft, U. (2008). LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons. BMC Bioinformatics, 9(1), 18-14. http://doi.org/10.1186/1471-2105-9-18
Felsenstein, J. 2005. PHYLIP (Phylogeny Inference Package) version 3.6. Distributed by the author. Department of Genome Sciences, University of Washington, Seattle.
Finn, R. D., Coggill, P., Eberhardt, R. Y., Eddy, S. R., Mistry, J., Mitchell, A. L., et al. (2016). The Pfam protein families database: towards a more sustainable future. Nucleic Acids Research, 44(D1), D279-85. http://doi.org/10.1093/nar/gkv1344
Gremme, G., Steinbiss, S., & Kurtz, S. (2013). GenomeTools: a comprehensive software library for efficient processing of structured genome annotations. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 10(3), 645-656. http://doi.org/10.1109/TCBB.2013.68
Hubley, R., Finn, R. D., Clements, J., Eddy, S. R., Jones, T. A., Bao, W., et al. (2016). The Dfam database of repetitive DNA families. Nucleic Acids Research, 44(D1), D81-9. http://doi.org/10.1093/nar/gkv1272
Huerta-Cepas, J., Serra, F., & Bork, P. (2016). ETE 3: Reconstruction, Analysis, and Visualization of Phylogenomic Data. Molecular Biology and Evolution, 33(6), 1635-1638. http://doi.org/10.1093/molbev/msw046
Katoh, K., & Standley, D. M. (2013b). MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Molecular Biology and Evolution, 30(4), 772-780. http://doi.org/10.1093/molbev/mst010
Krzywinski, M., Schein, J., Birol, I., Connors, J., Gascoyne, R., Horsman, D., et al. (2009). Circos: an information aesthetic for comparative genomics. Genome Research, 19(9), 1639-1645. http://doi.org/10.1101/gr.092759.109
Llorens, C., Futami, R., Covelli, L., Domínguez-Escribá, L., Viu, J. M., Tamarit, D., et al. (2011). The Gypsy Database (GyDB) of mobile genetic elements: release 2.0. Nucleic Acids Research, 39(Database issue), D70-4. http://doi.org/10.1093/nar/gkq1061
Price, M. N., Dehal, P. S. & Arkin, A. P. FastTree 2--approximately maximum-likelihood trees for large alignments. PLoS ONE 5, e9490 (2010).
Quinlan, A. R., & Hall, I. M. (2010). BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics, 26(6), 841-842. http://doi.org/10.1093/bioinformatics/btq033
Rice, P., Longden, I., & Bleasby, A. (2000). EMBOSS: the European Molecular Biology Open Software Suite. Trends in Genetics, 16(6), 276-277.
Sawyer, S.A. (1999) GENECONV: A computer package for the statistical detection of gene conversion. Distributed by the author, Department of Mathematics, Washington University in St. Louis, available at http://www.math.wustl.edu/~sawyer.
Steinbiss, S., Willhoeft, U., Gremme, G., & Kurtz, S. (2009). Fine-grained annotation and classification of de novo predicted LTR retrotransposons. Nucleic Acids Research, 37(21), 70027013. http://doi.org/10.1093/nar/gkp759
Swofford, D. L. 2003. PAUP*. Phylogenetic Analysis Using Parsimony (* = and Other Methods). Version 4. Sinauer Associates, Sunderland, Massachusetts.
Wicker, T. et al. A unified classification system for eukaryotic transposable elements. Nat Rev Genet 8, 973–982 (2007).