A tutorial on the analysis of hybridization and introgression with whole-chromosome alignments
Phylogenetic approaches based on a large number of sequence alignments from across the genome can serve as a useful complement to SNP-based analyses for the detection of past introgression events. In contrast to SNP-based analyses such as the ABBA-BABA test or the TWISST method, the use of sequence alignments permits the inference of time calibrated phylogenies which then allow flexible tests of introgression based on the estimated divergence times between species. On the other hand, the identification of suitable alignment blocks across the genome can be difficult, for example when chromosome-length alignments were generated by read mapping to a distantly related reference species, a process that may result in a large proportion of missing data. One way to compensate for missing data is the use of particularly long alignment blocks for phylogenetic analyses, given that there is usually no shortage in overall sequence data in genomic studies. However, the use of long alignment blocks can also be problematic because the probability of within-block recombination, which violates the models of most phylogenetic approaches, increases with the length of the alignment block. To account for this, methods can be applied to identify recombination breakpoints and to filter the alignment blocks accordingly. The accuracy of these methods, however, so far remains questionable, and the consequences of undetected recombination have not been tested thoroughly. Despite these potential caveats, a growing number of phylogenomic studies uses time calibrated phylogenies based on long sequence alignments from aross the chromosome to detect introgression and the results of these studies indicate that the methodology works well in general.
- Outline
- Dataset
- Requirements
- Inferring recombination breakpoints with Saguaro
- Identifying alignment blocks for phylogenetic analysis
- Automating BEAST2 analyses
- Disentangling incomplete lineage sorting and introgression
- Simulating recombination with c-genie
In this tutorial I am going to demonstrate how time calibrated phylogenies from different regions of the genome can be used to infer past introgression events. The set of time calibrated phylogenies will be generated by running Bayesian phylogenetic inference with BEAST2 exclusively on the command line to allow a large number of analyses without manual intervention. The alignment blocks used for phylogenetic inference will be extracted from a chromosome-length alignment, and they will be filtered according to missing data and phylogenetic informativeness to identify the most suitable alignment blocks for phylogenetic analysis. Importantly, two different approaches, Saguaro and Phi Test, will be applied to identify recombination breakpoints in the chromosome-length alignment so that the impact of within-blocks recombination on phylogenetic inference can be minimized. Finally, simulations of genome evolution with recombination will be performed with the program c-genie to assess the probability of undetected within-block recombination.
The dataset used in this tutorial will be a chromosome-length alignment with phased sequences for the same 14 cichlid species that were already used in tutorials Species-Tree Inference with SNP Data and Analysis of Introgression with SNP Data. As in these other tutorials, only data homologous to chromosome 5 of the tilapia (Oreochromis niloticus) genome assembly (Conte et al. 2017) will be used to reduce the computational demand of the analyses. The chromosome-length alignment will be generated in the first part of this tutorial based on three different types of input: The VCF file NC_031969.f5.sub1.phased.masked.mod.vcf.gz
with phased SNP data produced in tutorial Analysis of Introgression with SNP Data, the reference sequence for the chromosome from the tilapia assembly, and a set of files in BED format that specify, for each sample, where on the chromosome SNPs could have been detected if there were any of them. Further information on the origin of the genomic dataset can be found in the instructions to tutorial Species-Tree Inference with SNP Data. Note, however, that unlike in that tutorial, only data from a single sample per species will be used here to limit the run time of the analyses. As in tutorial Divergence-Time Estimation with SNP Data, the sample selected per species is the one with the least amount of missing data. These samples are listed in the table shown below:
Sample ID | Species ID | Species name | Tribe |
---|---|---|---|
IZC5 | astbur | Astatotilapia burtoni | Haplochromini |
AUE7 | altfas | Altolamprologus fasciatus | Lamprologini |
JBD6 | telvit | Telmatochromis vittatus | Lamprologini |
JUH9 | neobri | Neolamprologus brichardi | Lamprologini |
LJC9 | neocan | Neolamprologus cancellatus | Lamprologini |
KHA7 | neochi | Neolamprologus chitamwebwai | Lamprologini |
IVE8 | neocra | Neolamprologus crassus | Lamprologini |
JWH2 | neogra | Neolamprologus gracilis | Lamprologini |
JWG9 | neohel | Neolamprologus helianthus | Lamprologini |
JWH4 | neomar | Neolamprologus marunguensis | Lamprologini |
JWH6 | neooli | Neolamprologus olivaceous | Lamprologini |
ISB3 | neopul | Neolamprologus pulcher | Lamprologini |
ISA8 | neosav | Neolamprologus savoryi | Lamprologini |
KFD2 | neowal | Neolamprologus walteri | Lamprologini |
-
Saguaro: Saguaro (Zamani et al. 2013) implements a hidden-Markov model for the detection of recombination breakpoints in chromosome-length alignments. Installation instructions can be found at Saguaro's webpage, and the download is available from the associated Sourceforge repository (click "Download Snapshot" to download). Note that Saguaro runs on Linux, and while in principle the installation should also be possible on a Mac, this does not seem to be easy and is not supported by the authors. The installation is not possible on Windows. If you have access to a Linux server with a Saguaro installation, but you would like to run the rest of the tutorial on your own machine, you can do so by transferring input and ouput files of Saguaro via scp between your machine and the Linux server. To check whether the Saguaro installation succeeded, just type
Saguaro
on the command line. If you should fail to install Saguaro, you can skip the tutorial part with the Saguaro analysis and continue with ready-made Saguaro result files afterwards. -
RAxML: If you followed tutorials Maximum-Likelihood Phylogenetic Inference and Maximum-Likelihood Species-Tree Inference, you should have RAxML (Stamatakis 2014) installed on your machine already. If not, you will find source code for Mac OS X and Linux, as well as precompiled executables for Windows, on RAxML's github page https://github.com/stamatak/standard-RAxML. See tutorial Maximum-Likelihood Phylogenetic Inference for more details on the installation of RAxML.
-
Phi Test: Phy Test (Bruen et al. 2006) implements a statistical test for the presence of recombination in a sequence alignment. Source code is available from David Bryant's software webpage. Note that in order to compile it on a Mac, you may have to replace
CXX = gcc
withCXX = clang
on the third line of the fileMakefile
(but try the compilation without this change first). -
BEAST2: Like RAxML, you likely have the BEAST2 package installed already if you followed other tutorials in this collection. If not, you can downloaded it from the BEAST2 website. As all these programs are written in Java, compilation is not required, and all programs should work on Mac OS X, Linux, and Windows. I recommend downloading the program versions that include the Java Runtime Environment, which may prevent conflicts with Java versions that may already be installed on your machine.
-
Libraries for Python 3.x: The Python libraries dendropy (Sukumaran and Holder 2010) and msprime (Kelleher et al. 2016) will be required for analyses of topology frequencies and for simulations of recombination, respectively. The two libraries can be installed with pip for Python 3, using the following commands:
python3 -m pip install --user msprime python3 -m pip install --user dendropy
The installations can be tested with these commands:
python3 -c 'import msprime' python3 -c 'import dendropy'
-
Libraries for R: Two R libraries will be required in this tutorial; these are ape (Paradis et al. 2004) and coda (Plummer et al. 2006). If you're already familiar with R, just install these packages in the usual way. If not, the easiest way to do so might be via the command line. Type
R
to open the R environment interactively. Then, run the following commands:install.packages("ape", repos="http://ftp.gwdg.de/pub/misc/cran/", dependencies=T) install.packages("coda", repos="http://ftp.gwdg.de/pub/misc/cran/", dependencies=T)
To ensure that both packages have been successfully installed, type these commands:
library(ape) library(coda)
If both commands result in no error messages, the two packages are ready to be used. Then, quit the R environment with
quit(save="no")
.
In this part of the tutorial, the software Saguaro (Zamani et al. 2013) will be used to detect breakpoints between chromosome regions that are characterized by different phylogenetic histories due to recombination. However, for computational reasons, Saguaro does not infer these phylogenetic histories directly. Instead, the analysis performed by Saguaro is based on what the authors call "cacti", sets of distance matrices that describe how different each pair of genomes is relative to all others. These cacti can to some extent be considered as proxies for phylogenetic histories, as the genetic difference between pairs of chromosomes is obviously linked to their phylogenetic relatedness. Assuming (naively) that Saguaro accurately detects all recombination breakpoints, we applied this method in Gante et al. (2016) to select chromosomal blocks of 500 kb that were not broken up by recombination, and we used these blocks for subsequent phylogenetic analysis.
However, it has been shown based on simulations (Martin and van Belleghem 2017) that Saguaro may not always be able to discriminate chromosomal regions characterized by different phylogenetic histories and that subtle changes caused by incomplete lineage sorting may remain undetected. If this was the case for the chromosomal blocks used in Gante et al. (2016) for phylogenetic analysis, each of the inferred trees may have been affected by the issues that potentially result from concatenation, including exaggerated node support (Kubatko and Degnan 2007) and overestimated divergence times (Ogilvie et al. 2017).
To avoid recombination in the alignments used for phylogenetic analysis in this tutorial, we are therefore going to use Saguaro only as a first attempt of identifying recombination breakpoints. The breakpoints identied by Saguaro will be taken into account when generating alignment blocks for phylogenetic analysis; however, each of these alignment blocks will further be filtered according to a second test for recombination with the software Phy Test (Bruen et al. 2006).
At the beginning of the analysis, Saguaro will calculate a single cactus for the entire alignment, and a score will be calcuated for each variable alignment position, describing the fit between this site and the first cactus. Based on these scores, genomic regions with a poor fit to the current cactus are identied with the hidden Markov model implemented in Saguaro, and a new cactus is defined for these. This process is repeated multiple times, thus further partitioning the alignment into segments, and at the same time assigning a cactus out of a growing set of cacti to each segment. Details of this procedure are described in (Zamani et al. 2013).
As in tutorial Species-Tree Inference with SNP Data and other tutorials, we are here going to focus on chromosome 5 of tilapia only to reduce the computational demand of all analyses.
-
To generate a chromosome-length sequence alignment for the 14 cichlid species included in the dataset, we'll need several different input files:
-
The VCF file
NC_031969.f5.sub1.phased.masked.mod.vcf.gz
with phased SNP variation generated in tutorial Analysis of Introgression with SNP Data. Either copy this file from the analysis directory of that tutorial if you followed it already, or download fileNC_031969.f5.sub1.phased.masked.mod.vcf.gz
by clicking on the link and copy it to the analysis directory for this tutorial. -
As the VCF file
NC_031969.f5.sub1.phased.masked.mod.vcf.gz
does not contain information about invariant sites but these are required for phylogenetic inference with BEAST2, we assume that the sites between those included in the VCF file are in fact unchanged between the Lamprologini species and the tilapia reference sequence. Thus, we'll use the reference sequence for chromosome 5 of tilapia to fill the gaps between SNPs included in the VCF. To download the sequence for chromosome 5 of tilapia from GenBank, you can use the following command:wget 'http://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?tool=portal&sendto=on&log$=seqview&db=nuccore&dopt=fasta&val=NC_031969&extrafeat=0&maxplex=1' -O NC_031969.fasta
The downloaded sequence will in in Fasta format in a file named
NC_031969.fasta
. -
The assumption that sites between SNPs are invariant is only justified for sites in which SNPs could have been called at all if there had been any. This is not the case for all sites between the SNPs, due to low sequencing coverage in some regions or low mapping quality in repetitive regions. It would also have been impossible to call SNPs very close to indel variation, because these were filtered when the original VCF file
NC_031969.f5.sub1.vcf.gz
(see tutorial Species-Tree Inference with SNP Data) was prepared. Thus, we will take into account information about the chromosomal regions in which SNP variation could not have been called in the first place, and we will conservatively set these chromosomal regions to missing, coded by "N", in the generated alignment. The information about callable chromosomal regions is contained in a set of files in BED format; these files are within a compressed directory namedmasks.tgz
. Download this directory by clicking on the link and copy it to your analysis directory.
-
-
Unless the directory with the files in BED format has already been uncompressed automatically, do so with the following command:
tar -xzf masks.tgz
-
Have a look at the content of one of the files in BED format, for example using the following command:
less masks/AUE7.NC_031969.f5.merged.bed
You'll see that these files contain three columns, as shown in the text below:
NC_031969 0 45144 NC_031969 45169 88297 NC_031969 88299 88363 NC_031969 88365 88370 NC_031969 88372 98367 NC_031969 98368 98403 NC_031969 98405 98409 NC_031969 98411 98429 NC_031969 98431 100023 ...
The second and third of these columns indicate the beginning and the end of chromosomal regions that were masked because SNPs in these regions would not have been detected due to low coverage or other reasons. You'll see that these uncallable regions are quite long compared to the callable regions in between them. In the example shown above, only few callable sites were found within the first 100 kb of the chromosome. Overall, about two thirds of the chromosome are uncallable, which can be explained by a combination of several causes, including mapping to a distant reference, masking of repetitive regions, and rigorous filtering of sites close to insertions and deletions. However, even after conservatively setting all these regions to missing, the remaining sequence information will be more than sufficient for phylogenetic analysis.
-
The three different types of information (the SNP data, the reference sequence, and the information on uncallable regions) will be combined to generate chromosome-length sequences in Fasta format by the Ruby script
fill_seq.rb
. To reduce the memory requirement of this script, we are going to run it separately for each of the samples included in the dataset, and as input for the script we are going to use individual VCF files that contain SNP data only for the respective sample. Prepare these individual VCF files from the full VCF fileNC_031969.f5.sub1.phased.masked.mod.vcf.gz
, using the following command (if you copied the VCF file from the analysis directory of tutorial Analysis of Introgression with SNP Data, it might be uncompressed; if so, just remove the file extension.gz
in this command):for i in IZC5 AUE7 JBD6 JUH9 LJC9 KHA7 IVE8 JWH2 JWG9 JWH4 JWH6 ISB3 ISA8 KFD2 do bcftools view -s ${i} -o ${i}.NC_031969.f5.masked.mod.vcf NC_031969.f5.sub1.phased.masked.mod.vcf.gz done
The above command should have written 14 separate uncompressed VCF files. To make sure that these have been generated, you could use
ls *.NC_031969.f5.masked.mod.vcf
. -
We can now run the script
fill_seq.rb
to combine, for each sample, the SNP data with the reference sequence and the information on callable regions, using the following command (this will take about ten minutes to finish):for i in *.NC_031969.f5.masked.mod.vcf do sample_id=`basename ${i%.NC_031969.f5.masked.mod.vcf}` echo -n "Translating file ${i}..." ruby fill_seq.rb ${i} NC_031969.fasta masks/${sample_id}.NC_031969.f5.merged.bed ${sample_id}.NC_031969.f5.masked.mod.fasta echo " done." done
As you can see from the command above, the script
fill_seq.rb
expects four command-line arguments. These are, in this order,- the VCF file with SNP information,
- the reference sequence in Fasta format,
- the file in BED format containing information about callable regions,
- the name of an output file, which will be written in Fasta format.
The script expects that all input files are for a single chromosome only, so we would have to run it multiple times if we would use information from several chromosomes.
-
Have a look at one of the Fasta files generated by script
fill_seq.rb
, for example using the following command:less -S AUE7.NC_031969.f5.masked.mod.fasta
You'll see that it contains two sequences, one for each haplotype of the phased VCF file.
-
Next, combine all sequences into a single file in Fasta format named
NC_031969.f5.masked.mod.fasta
. Because all sequences are already aligned in the same way to the tilapia reference sequence, the combined file will already be perfectly aligned and no realignment with a tool like MAFFT (see tutorial Multiple Sequence Alignment) will be necessary. The combined Fasta fileNC_031969.f5.masked.mod.fasta
will have a file size of about 1 Gb, thus make sure that you have sufficient disk space left before executing the following command:cat *.NC_031969.f5.masked.mod.fasta > NC_031969.f5.masked.mod.fasta
-
You could now remove the per-sample Fasta files to save space, using the following command:
rm *.NC_031969.f5.masked.mod.fasta
-
As you've already seen, the Fasta file
NC_031969.f5.masked.mod.fasta
contains two sequences for each sample (and, given that we're here using a single sample per species, this means that it also contains two sequences per species). If we would use this dataset for analyses with Saguaro, these analyses might infer recombination breakpoints at which only the relationships between the two sequences per species change but not the relationships among species. While it would be a more conservative approach to also account for these recombination breakpoints, one could argue that their influence on the phylogenetic inference of the species tree is likely minimal. Therefore, and in order to shorten the run time of the Saguaro analysis, we are going to reduce the dataset for the Saguaro analysis to a single sequence per species, meaning that only recombination breakpoints affecting the relationships among species will be detected. To write the first sequence of each sample to a new file namedNC_031969.f5.masked.mod.A.fasta
, use the following command:cat NC_031969.f5.masked.mod.fasta | grep -A 1 "_A" | grep -v "^--$" > NC_031969.f5.masked.mod.A.fasta
-
The next command can only be executed if Saguaro has been installed. Thus, if you could not install Saguaro on your own computer but on a remote Linux server, you should now copy the file
NC_031969.f5.masked.mod.A.fasta
to the Linux server, with thescp
orrsync
command-line tools. -
To prepare the sequence alignment for analysis with Saguaro, it needs to be converted from Fasta format into the "Feature" format of Saguaro. This can be done with the tool Fasta2HMMFeature that is part of the Saguaro installation. To see the available options of Fasta2HMMFeature, just type the program name on the command line:
Fasta2HMMFeature
This will show the following text:
Fasta2HMMFeature: Converts multi-fasta data into a Saguaro-digestable file of features Available arguments: -i<string> : input fasta file (multiple alignment) -o<string> : binary output file -nosame<bool> : skip positions in which all calls are the same (def=0) -m<int> : minimum coverage (def=2) -d<int> : minimum disagreeing (def=2) -n<string> : chromosome name (def=mult)
With the options
-m
and-d
we can specify that Saguaro should only use positions with a minimum coverage (here, this means non-missing sequence information) or to use only those where at least two sequences are different from the others. The default values for both is 2, which means that only parsimony-informative sites are considered. -
Use the Fasta file
NC_031969.f5.masked.mod.A.fasta
as the input for the Saguaro analysis with option-i
and name the outputNC_031969.f5.masked.mod.A.feature
with option-o
. Also make sure to set the chromosome name toNC_031969.f5.masked.mod
(without the.A
at the end) with option-n
; this will be important when the Saguaro results are later used to generate alignment blocks. Thus, execute Fasta2HMMFeature with the following command:Fasta2HMMFeature -i NC_031969.f5.masked.mod.A.fasta -o NC_031969.f5.masked.mod.A.feature -n NC_031969.f5.masked.mod
-
Next, have a look at the available options for Saguaro, by typing the program name on the command line:
Saguaro
This should output the following text:
Saguaro: Smoothly, automatically and generically uncover the ancestry of related organisms. Available arguments: -f<string> : Feature vector (def=) -l<string> : Feature vector list file (def=) -o<string> : output directory -cycle<int> : iterations per cycle (def=2) -iter<int> : iterations with split (def=40) -t<double> : transition penalty (def=150) -resume<int> : resume w/ iteration # (def=0) -neurons<int> : number of neurons in the SOM (def=800)
You'll see that input can either be provided with option
-f
or option-l
. Of these,-f
followed by the name of a Feature file should be used when you have just a single such file (as in our case), and a list of file names can be provided with with-l
if you have multiple Feature files for different chromosomes. The options-cycle
,-iter
, and-neurons
determine settings for the hidden Markov model of Saguaro. The default values for these settings are relatively thorough and usually appropriate. To reduce the run time of Saguaro, we will here, however, use less cycles and iterations as well as a smaller number of neurons. Another parameter that might be worth modifying is the transition penalty, specified with option-t
. By reducing the transition penalty, the sensitivity of Saguaro to detect recombination breakpoints can be increased. This means that Saguaro would then be less likely to miss true recombination breakpoints but it might also lead to the detection of false positives. To run Saguaro withNC_031969.f5.masked.mod.A.feature
as the input file,saguaro_results
as the name of the output directory, 10 iterations with 1 cycle per iterations, and 100 neurons, use the following command:Saguaro -f NC_031969.f5.masked.mod.A.feature -o saguaro_results -iter 10 -cycle 1 -neurons 100
-
Once the Saguaro analysis has finished, results should have been written to directory
saguaro_results
. Have a look at the content of this directory. You'll see the following files:HMMTrain.out.0 HMMTrain.out.1 HMMTrain.out.2 ... LocalTrees.out saguaro.cactus saguaro.config saguaro.garbage saguaro.garbage.vec
Of these, only
LocalTrees.out
andsaguaro.cactus
are relevant for us. If you ran the Saguaro analysis on a Linux server because you could not install Saguaro on your own computer, you could now copy these files back to your own computer withscp
orrsync
. If you were unable to run the Saguaro analysis at all, you can download the ready-made result filesLocalTrees.out
andsaguaro.cactus
with the two links. -
Open file
saguaro.cactus
in a text editor, or from the command line with the following command:less saguaro.cactus
You should see more or less the following content:
cactus0 AUE7_A ISA8_A ISB3_A IVE8_A IZC5_A JBD6_A JUH9_A JWG9_A JWH2_A JWH4_A JWH6_A KFD2_A KHA7_A LJC9_A AUE7_A -0.000000 0.793221 0.872338 0.839333 0.524633 0.687894 0.843126 0.866756 0.819825 0.839036 0.877009 0.834673 0.834047 0.069426 ISA8_A 0.793221 -0.000000 0.304833 0.368500 0.444325 0.384680 0.335022 0.323023 0.322399 0.358284 0.331484 0.486985 0.495132 0.906134 ISB3_A 0.872338 0.304833 -0.000000 0.356743 0.493709 0.430859 0.232075 0.288845 0.309689 0.357865 0.222982 0.526403 0.524468 0.987994 IVE8_A 0.839333 0.368500 0.356743 -0.000000 0.478407 0.410530 0.337957 0.340104 0.314575 0.189210 0.340049 0.490898 0.504986 0.935056 IZC5_A 0.524633 0.444325 0.493709 0.478407 -0.000000 0.390289 0.503307 0.485551 0.459490 0.475014 0.496944 0.486223 0.500198 0.650715 JBD6_A 0.687894 0.384680 0.430859 0.410530 0.390289 -0.000000 0.427936 0.410087 0.411593 0.412935 0.439920 0.422403 0.432646 0.782539 JUH9_A 0.843126 0.335022 0.232075 0.337957 0.503307 0.427936 -0.000000 0.261218 0.301458 0.370035 0.287994 0.537055 0.533601 0.951846 JWG9_A 0.866756 0.323023 0.288845 0.340104 0.485551 0.410087 0.261218 -0.000000 0.299211 0.368540 0.292421 0.512795 0.524726 0.958697 JWH2_A 0.819825 0.322399 0.309689 0.314575 0.459490 0.411593 0.301458 0.299211 -0.000000 0.313155 0.343014 0.504644 0.509835 0.921273 JWH4_A 0.839036 0.358284 0.357865 0.189210 0.475014 0.412935 0.370035 0.368540 0.313155 -0.000000 0.360356 0.475263 0.494985 0.951263 JWH6_A 0.877009 0.331484 0.222982 0.340049 0.496944 0.439920 0.287994 0.292421 0.343014 0.360356 -0.000000 0.525858 0.537397 0.963240 KFD2_A 0.834673 0.486985 0.526403 0.490898 0.486223 0.422403 0.537055 0.512795 0.504644 0.475263 0.525858 -0.000000 0.079350 0.938385 KHA7_A 0.834047 0.495132 0.524468 0.504986 0.500198 0.432646 0.533601 0.524726 0.509835 0.494985 0.537397 0.079350 -0.000000 0.929866 LJC9_A 0.069426 0.906134 0.987994 0.935056 0.650715 0.782539 0.951846 0.958697 0.921273 0.951263 0.963240 0.938385 0.929866 -0.000000 cactus1 ...
This is the distance matrix for the first cactus (cactus0), followed by distance matrices for all other inferred cacti.
-
Next, open file
LocalTrees.out
in a text editor or on the command line with theless
command. You'll see something like this:Reading features... done! Reading models. Dynprog'ing... Setting up HMM... Adding word cactus0 as # 0 ...
Scroll down a bit to this part:
... Processed: 80000 (95.6034 %) REPORTING Traceback and Update cactus7 NC_031969.f5.masked.mod: 204750 - 3479153 length: 3274403 (frames 1-2704 l=2703) score=62.9241 AUE7_A ISA8_A ISB3_A IVE8_A IZC5_A JBD6_A JUH9_A JWG9_A JWH2_A JWH4_A JWH6_A KFD2_A KHA7_A LJC9_A AUE7_A -0.00 0.77 0.92 0.88 0.65 0.69 0.89 0.87 0.87 0.84 0.88 0.80 0.76 0.09 ISA8_A 0.77 -0.00 0.46 0.47 0.59 0.49 0.42 0.42 0.43 0.45 0.42 0.53 0.55 0.87 ISB3_A 0.92 0.46 -0.00 0.34 0.55 0.52 0.30 0.40 0.34 0.40 0.24 0.52 0.55 0.97 IVE8_A 0.88 0.47 0.34 -0.00 0.57 0.49 0.34 0.39 0.36 0.25 0.38 0.56 0.57 0.92 IZC5_A 0.65 0.59 0.55 0.57 -0.00 0.53 0.61 0.63 0.61 0.61 0.62 0.54 0.54 0.74 JBD6_A 0.69 0.49 0.52 0.49 0.53 -0.00 0.52 0.51 0.53 0.52 0.51 0.46 0.48 0.73 JUH9_A 0.89 0.42 0.30 0.34 0.61 0.52 -0.00 0.38 0.33 0.42 0.33 0.53 0.58 0.95 JWG9_A 0.87 0.42 0.40 0.39 0.63 0.51 0.38 -0.00 0.37 0.39 0.39 0.54 0.57 0.94 JWH2_A 0.87 0.43 0.34 0.36 0.61 0.53 0.33 0.37 -0.00 0.42 0.37 0.57 0.56 0.95 JWH4_A 0.84 0.45 0.40 0.25 0.61 0.52 0.42 0.39 0.42 -0.00 0.43 0.53 0.55 0.91 JWH6_A 0.88 0.42 0.24 0.38 0.62 0.51 0.33 0.39 0.37 0.43 -0.00 0.55 0.56 0.94 KFD2_A 0.80 0.53 0.52 0.56 0.54 0.46 0.53 0.54 0.57 0.53 0.55 -0.00 0.16 0.85 KHA7_A 0.76 0.55 0.55 0.57 0.54 0.48 0.58 0.57 0.56 0.55 0.56 0.16 -0.00 0.84 LJC9_A 0.09 0.87 0.97 0.92 0.74 0.73 0.95 0.94 0.95 0.91 0.94 0.85 0.84 -0.00 cactus4 NC_031969.f5.masked.mod: 3481523 - 3581323 length: 99800 (frames 2705-2808 l=103) score=1692.5 ...
These lines contain information on the first segment identified by Saguaro. It is assigned to cactus7, and is located between position 204750 and position 3479153 on the chromosome named "NC_031969.f5.masked.mod". In contrast to distance matrices given in file
saguaro.cactus
, the distance matrices given in this inLocalTrees.out
are calculated per segment, not per cactus. Nevertheless, the distance matrix of the first segment between position 204750 and position 3479153 is likely very similar to that of cactus7, otherwise, this segment would not have been assigned to this cactus.As we're not particularly interested in the distance matrices of segments, but more in the placement of segment boundaries so that we can select alignment regions for phylogenetic analyses that are not broken up by boundaries, the most imporant information for us is in the header lines for each segment.
-
To see only header information for each cactus, type this command:
cat LocalTrees.out | grep length
You should see output like this:
cactus7 NC_031969.f5.masked.mod: 204750 - 3479153 length: 3274403 (frames 1-2704 l=2703) score=62.9241 cactus4 NC_031969.f5.masked.mod: 3481523 - 3581323 length: 99800 (frames 2705-2808 l=103) score=1692.5 cactus7 NC_031969.f5.masked.mod: 3590084 - 6158718 length: 2568634 (frames 2809-8397 l=5588) score=104.99 cactus4 NC_031969.f5.masked.mod: 6158744 - 6166115 length: 7371 (frames 8398-8429 l=31) score=18425 cactus5 NC_031969.f5.masked.mod: 6166197 - 6167838 length: 1641 (frames 8430-8438 l=8) score=65580.2 cactus6 NC_031969.f5.masked.mod: 6167956 - 6182408 length: 14452 (frames 8439-8507 l=68) score=8637.73 cactus5 NC_031969.f5.masked.mod: 6182675 - 6188272 length: 5597 (frames 8508-8523 l=15) score=37313 ...
-
To find out how many segments Saguaro has identified, you could type the following command:
cat LocalTrees.out | grep length | wc -l
-
In order to visualize recombination breakpoints and the cacti assigned to the different chromosomal regions, you can use the Ruby script
paint_chromosomes.rb
. To find out how to run this script, you could type this command:ruby paint_chromosomes.rb
This should show the following help text:
paint_chromosomes.rb This script uses output from the software Saguaro to paint chromosomes according to Saguaro cacti. The output will be written in svg format. This script should be run e.g. with ruby paint_chromosomes.rb LocalTrees.out LocalTrees.svg where 'LocalTrees.out' should be replaced with the actual path to the Saguaro output file.
-
Now, run the script
paint_chromosomes.rb
with the following command:ruby paint_chromosomes.rb LocalTrees.out
This will generate a vector graphic file in SVG format that will be written to the same directory in which
LocalTrees.out
is placed, and it will be namedLocalTrees.svg
. -
Open the vector graphic file
LocalTrees.svg
with a program capable of reading files in SVG format, for example with a browser such as Firefox or with Adobe Illustrator. This plot should look as shown below:In this visualization, chromosomal regions assigned to the most common cactus are drawn in dark gray, and segments assigned to other cacti are shown in red, orange, cyan, and light green, purple (in decreasing frequency). With more than six different cacti, all remaining cacti are shown in light gray. You'll notice that the most frequent cactus (in dark gray) is found in the first part of the chromosome, that the last part of the chromosome is dominated by the third-most frequent cactus (in orange), and that the regions in the center of the chromosome are generally shorter than those at the chromosome ends. This observation disagrees with the pattern expected if recombination was more frequent in in the peripheries of the chromosome, which is commonly observed in animals and plants (Haenel et al. 2018). But as we will see below, the alignment contains a higher amount of missing data towards the chromosome ends, which is likely responsible for the lower number of detected recombination breakpoints those parts of the chromosome. However, keep in mind that these results may not be particularly accurate, as we've only performed a short Saguaro analysis.
After having inferred recombination breakpoints with Saguaro, we can now cut the breakpoint-free parts of the chromosome-length alignment into alignment blocks of a fixed length, and we can try to identify the most suitable of these blocks for subsequent phylogenetic analysis. Ideally, the blocks used for phylogenetic analysis should have as little missing data as possible, be as informative as possible, and show no signs of recombination that may have remained undetected in the Saguaro analysis. In this part of the tutorial, we are going to quantify these characteristics for all alignment blocks to select a set of blocks for phylogenetic analysis with BEAST2 in the next part of the tutorial.
As the length of each block, we here use 50 kb, assuming that this length is a good compromise between increasing probability of undetected recombination with longer blocks and decreasing phylogenetic signal with shorter blocks. For a more thorough analysis, however, it might be worth testing this assumption with blocks of different sizes.
-
To cut the breakpoint-free parts of the alignment into individual alignment blocks, we can use the Ruby script
generate_alignments.rb
. This script was specifically written to read output by Saguaro as well as the input file(s) used in the Saguaro analysis. It expects four command-line arguments; these are- the name of the
LocalTrees.out
output file of Saguaro, - the name of a directory in which the alignment files used for the Saguaro analysis are located,
- the name of a new directory to which all output files will be written,
- the length of the alignment blocks.
Run the script with the following command:
ruby generate_alignments.rb LocalTrees.out . alignment_blocks 50000
This will generate as many non-overlapping alignment blocks of 50 kb as possible without including the recombination breakpoints identified by Saguaro. With the settings used in the command above, the script will read file
LocalTrees.out
and will identify chromosome names based on the information in this file. In our case, it will identify "NC_031969.f5.masked.mod" as a chromosome name, and it will search for fileNC_031969.f5.masked.mod.fasta
in the directory that's specified as the second command line argument, which here simply is.
, the shortcut for the current directory. Output files in Nexus format will be written to a directory namedalignment_blocks
, which will be created inside the current directory. - the name of the
-
Have a look at the new directory named
alignment_blocks
. Note that the files in this directory are named according to the name of the linkage group and the first and the last position of the alignment block:NC_031969.f5.masked.mod_00666951_00716950.nex NC_031969.f5.masked.mod_01316951_01366950.nex NC_031969.f5.masked.mod_01366951_01416950.nex NC_031969.f5.masked.mod_01566951_01616950.nex NC_031969.f5.masked.mod_01716951_01766950.nex NC_031969.f5.masked.mod_01816951_01866950.nex NC_031969.f5.masked.mod_02066951_02116950.nex ...
-
Find out how many alignment blocks were written, using the following command:
ls alignment_blocks/*.nex | wc -l
-
To facilitate the phylogenetic analyses, it will help to simplify the alignment names with the following set of commands.
for i in alignment_blocks/*.nex do new_name=`echo ${i} | sed 's/\.f5\.masked\.mod//g'` mv ${i} ${new_name} done
-
The files in directory
alignment_blocks
should now be named like this:NC_031969_00666951_00716950.nex NC_031969_01316951_01366950.nex NC_031969_01366951_01416950.nex NC_031969_01566951_01616950.nex NC_031969_01716951_01766950.nex NC_031969_01816951_01866950.nex NC_031969_02066951_02116950.nex ...
-
Pick one of the files in this directory at random, and open it either in a text editor or in AliView just to get a feeling for the size of the alignment block, as well as for its sequence variation and the amount of missing data. You should see something like this:
Like the alignment shown above, many alignments contain a very large proportion of missing data due to the strict filtering based on coverage, mapping quality, and indel proximity. Some alignment may even consist exclusively of missing data.
-
We are going to use mean overall node support resulting from maximum-likelihood tree inference as one measure of phylogenetic informativeness. This inference will be done with the software RAxML, which you will be familiar with if you followed the tutorials Maximum-Likelihood Phylogenetic Inference or Maximum-Likelihood Species-Tree Inference. If you didn't, you may want to look up the instructions for these tutorials to learn more about RAxML analyses. However, to avoid RAxML analyses of alignment blocks without sequences data or with very little sequence data, we are first going to remove those alignment blocks with a proportion of missing data above 90%. We can do so with the following set of commands that uses the Ruby script
get_proportion_of_missing_data.rb
. The input and output for this script is quite simple: It the name of a file in Nexus format as input and reports nothing but the proportion of missing data on the screen. With the set of commands below, this proportion is assigned to the variable "p_missing" and if this variable is above 0.9, the Nexus-format alignment file is removed. Execute this set of commands:for i in alignment_blocks/*.nex do p_missing=`ruby get_proportion_of_missing_data.rb ${i}` if [[ ${p_missing} > 0.9 ]] then rm ${i} fi done
-
Now, count the number of alignment files in directory
alignment_blocks
once again with the following command:ls alignment_blocks/*.nex | wc -l
Question 1: How many alignments were removed due to a proportion of missing data greater than 90%? (see answer)
-
Next, we are going to run maximum-likelihood tree inference with the software RAxML for each alignment file. We are going to do so with the following set of commands that first of all uses the Python script
convert.py
to convert each alignment file from Nexus to Phylip format, as RAxML does not accept Nexus format. The RAxML analysis will be performed with the "GTRCAT" model of Stamatakis 2006; however, as were are now only interested in the mean node support and not in the inferred tree itself, the choice of substitution model is not going to have a strong influence on our results. By using the options-f a -x ${RANDOM} -N 100
, we specify that the maximum-likelihood analysis should be followed by bootstrapping to assess node support with 100 replicates. With the commands below, RAxML is set to use four CPUs with the option-T 4
to speed up the analysis, but if you have less CPUs available on your machine, you should change this number to e.g.-T 2
. Then, run this set of commands:for i in alignment_blocks/*.nex do # Get the alignment id. id=`basename ${i%.nex}` # Convert the alignment to Phylip format. python3 convert.py -f phylip ${i} tmp.phy # Run maximum-likelihood tree inference and assess node support by bootstrapping. raxml -T 4 -m GTRCAT -n ${id} -s tmp.phy --silent -p ${RANDOM} -f a -x ${RANDOM} -N 100 # Clean up RAxML output files. rm tmp.phy* rm RAxML_bipartitionsBranchLabels.* rm RAxML_bestTree.* rm RAxML_bootstrap.* mv RAxML_bipartitions.* alignment_blocks/${id}.tre mv RAxML_info.* alignment_blocks/${id}.info done
With the about 470 alignment blocks and bootstrapping to be performed for each of them, these RAxML analyses will take a rather long time, around 7 hours. Thus, unless you have the chance to run these analyses overnight, you'll probably rather want to continue the rest of this tutorial with the ready-made RAxML results from my analysis. You can find these in the compressed directory
alignment_blocks.tgz
, which you can download by clicking on the link. This compressed directory contains not only the RAxML output but also the corresponding alignment files. As these alignment files may not be identical to those resulting from your Saguaro analysis, you should replace youralignment_blocks
directory with the downloaded version of it or the RAxML tree files would not fit with your alignment files. To uncompress directoryalignment_blocks.tgz
, use this command (but your browser might have automatically uncompressed it after downloading):tar -xzf alignment_blocks.tgz
-
Now, have a quick look at the files in the
alignment_blocks
directory. You'll see that for each alignment file named, e.g.NC_031969_00666951_00716950.nex
, there is now also a tree file namedNC_031969_00666951_00716950.tre
, as well as a file with the extension.info
. The tree file contains nothing more than the tree inferred by RAxML, in Newick format and with bootstrap support values, and the file with the.info
extension contains the screen output of RAxML for this analysis. -
The alignment files and the tree files will now be used jointly to assess how well each alignment block is suited for phylogenetic analysis. For this, we are going to use several scripts at once to calculate the proportion of missing data and the number of parsimony-informative sites for each alignment, as well as the mean bootstrap support for the corresponding tree and the probability of undetected within-block recombination. The following scripts will be required for this; thus, make sure that they are all located in your analysis directory before you continue:
- The Ruby script
get_proportion_of_missing_data.rb
, which was used already before in this tutorial, will be used once again to calculate the proportion of missing data for each alignment block. - The Ruby script
get_number_of_pi_sites.rb
will be used to calculate the number of parsimony-informative sites for each alignment; these are sites where at least two sequences differ from all others so that they support a certain grouping of taxa. - The Python script
get_mean_node_support.py
will be used to calculate the mean bootstrap support for the trees corresponding to the alignments. - The Python script
convert.py
will once again be required to convert from one alignment-file format into another; in this case from Nexus to Fasta format.
In addition to the scripts listed above, we are going to also use the program Phy Test (Bruen et al. 2006) to test for recombination within each alignment. If it is correctly installed, you should be able to start the Phi Test program simply by typing
Phi
on the command line; this should output a help text with the available options. You'll see that the input in Fasta format can be specified with the-f
option. We can use default values for all other options. To perform a test run with the Phi Test program, use it to test for recombination in the first alignment file, with the following commands:python3 convert.py alignment_blocks/NC_031969_00666951_00716950.nex -f fasta > tmp.fasta Phi -f tmp.fasta
This should result in the following screen output:
Reading sequence file tmp.fasta Found 28 sequences of length 50000 Alignment looks like a valid DNA alignment. Estimated diversity is (pairwise deletion - ignoring missing/ambig): 0.8% Found 299 informative sites. Writing alignment of informative sites to: Phi.inf.sites Writing list of informative sites to: Phi.inf.list Calculating all pairwise incompatibilities... Done: 100.0% Using a window size of 100 with k as 1 Calculating analytical mean and variance **p-Value(s)** ---------- PHI (Normal): 2.75e-01
The most important part of this output is the p-value in the last line, indicating the probability of the null hypothesis that recombination is absent within the alignment. In this case, this p-value is 0.275, meaning that the absence of recombination can be assumed.
- The Ruby script
-
To extract only the p-value from the output of the Phi Test program, you could try running this command:
Phi -f tmp.fasta | tail -n 2 | head -n 1 | cut -d ":" -f 2 | tr -d '[[:space:]]'
You should once again see the number 2.75e-01, this time without the rest of the program output.
-
Remove the file
tmp.fasta
that was temporarily needed for the test run with the program Phi Test:rm tmp.fasta
-
To execute all the above-named scripts as well as the Phi Test program and generate a table with the different measurements for each alignment, we'll use the set of commands given below. You could copy all of these commands into a text file named, for example,
run.sh
, and the execute these commands withbash run.sh
. Alternatively, you could copy the block and paste it on the command line as you probably did for the shorter code blocks given above.echo -e "block_id\tp_missing\tn_pi_sites\tmean_bootstrap_support\tphi_p" > block_stats.txt for i in alignment_blocks/*.nex do # Get the block ID. block_id=`basename ${i%.nex}` # Provide feedback on screen. echo -n "Analyzing file ${i}..." # Get the proportion of missing data. proportion_of_missing_data=`ruby get_proportion_of_missing_data.rb $i` # Get the number of parsimony-informative sites. number_of_pi_sites=`ruby get_number_of_pi_sites.rb $i` # Get the mean bootstrap support. bootstrap_value=`python3 get_mean_node_support.py ${i%.nex}.tre` bootstrap_value_strip=`echo -n $bootstrap_value` # Run PhiPack to test for recombination and get the resulting p-value. mkdir ${block_id} python3 convert.py ${i} -f fasta > ${block_id}/tmp.fasta cd ${block_id} p_value=`Phi -f tmp.fasta | tail -n 2 | head -n 1 | cut -d ":" -f 2 | tr -d '[[:space:]]'` cd ../ rm -r ${block_id} # Print all output. echo -e -n ${block_id} >> block_stats.txt echo -e -n "\t" >> block_stats.txt printf "%.3f" ${proportion_of_missing_data} >> block_stats.txt echo -e -n "\t" >> block_stats.txt echo -n ${number_of_pi_sites} >> block_stats.txt echo -e -n "\t" >> block_stats.txt printf "%.1f" ${bootstrap_value_strip} >> block_stats.txt echo -e -n "\t" >> block_stats.txt echo -e -n ${p_value} >> block_stats.txt echo >> block_stats.txt # Provide feedback on screen. echo " done." done
-
Have a brief look at the file written by the above code, named
block_stats.txt
. You'll see that it contains five columns for- the ID of the alignment block ("block_id"),
- the proportion of missing data ("p_missing"),
- the number of parsimony-informative sites ("n_pi_sites"),
- the mean bootstrap support value ("mean_bootstrap_support")
- the p-value resulting from the test for recombination with Phi Test ("phi_p").
The first lines of this file are shown below:
block_id p_missing n_pi_sites mean_bootstrap_support phi_p NC_031969_00666951_00716950 0.852 298 59.5 2.75e-01 NC_031969_01316951_01366950 0.792 202 75.1 5.82e-01 NC_031969_01366951_01416950 0.859 122 65.1 5.30e-01 NC_031969_01566951_01616950 0.864 221 71.4 4.34e-01 NC_031969_01716951_01766950 0.871 157 65.0 2.56e-01 NC_031969_01816951_01866950 0.842 149 43.7 1.13e-03 ...
-
To identify the most suitable alignment blocks for phylogenetic analysis based on these criteria, first make some exploratory plots in the R environment to find possible correlations with chromosomal positions or with other parameters. We'll use R to generate these plots. If you're familiar with R, you could use the software R Studio or another GUI program for R, but you could also run R on the command line as shown below. To start R interactively on the command line, simply type
R
. Then, write the following commands to read fileblock_stats.txt
and to plot the proportion of missing data in an alignment against the chromosomal position of the center of the alignment:t <- read.table("block_stats.txt", header=T) get_third_as_num <- function(x, split="_"){ return(as.numeric(strsplit(x, split=split)[[1]][3])) } get_fourth_as_num <- function(x, split="_"){ return(as.numeric(strsplit(x, split=split)[[1]][4])) } froms <- unlist(lapply(as.character(t$block_id), get_third_as_num)) tos <- unlist(lapply(as.character(t$block_id), get_fourth_as_num)) block_centers <- (froms + tos)/2 pdf("alignment_blocks_missing.pdf", height=7, width=7) plot(block_centers, t$p_missing, xlab="Position", ylab="Proportion missing", main="NC_031969") abline(h=0.8) rect(-5000000, 0, 40000000, 0.8, col=rgb(0.164,0.629,0.594,alpha=0.25)) dev.off()
The above commands should have written the plot to a file named
As shown in the above plot, the proportion of missing data in the alignment is generally larger towards the ends of the chromosome, and there is also a weak increase in missing data in the very center of the chromosome, leading to a "W"-shaped pattern in the plot. This could possibly be explained by repetitive sequences that are more abundant in the terminal regions of the chromosome and perhaps by an effect caused by the centromere in the middle. We are not going to investigate these causes in more detail but patterns like this should be kept in mind when interpreting the results of analyses based on chromosome-length alignments. For example, the increased amount of missing data towards the chromosome ends could have caused the apparently longer distance between recombination breakpoints in the Saguaro plot shown at the beginning of this tutorial. The area highlighted in blue in the above plot indicates an amount of missing data below 80%; this value will be used for filtering the alignment blocks for further phylogenetic analysis.alignment_blocks_missing.pdf
in the analysis directory. This plot should look as shown below: -
In a second plot we are going to compare the amount of missing data with the number of parsimony-informative sites to test if more missing data actually translates to a lower phylogenetic informativeness. With the R environment still open, now type the following commands:
pdf("alignment_blocks_pi_sites.pdf", height=7, width=7) plot(t$p_missing, t$n_pi_sites, xlab="Proportion missing", ylab="Number PI sites", main="NC_031969") abline(h=250) abline(v=0.8) rect(0, 250, 0.8, 1000, col=rgb(0.164,0.629,0.594,alpha=0.25)) dev.off()
This should have written the plot shown below to a new file named
alignment_blocks_pi_sites.pdf
.As we can see from this plot, there is clearly a negative correlation between the amount of missing data and the phylogenetic informativeness. The blue area highlight alignment blocks with less than 80% missing data and more than 250 parsimony-informative sites; these values will be used for filtering below.
-
Finally, we are going to repeat the last plot with the mean bootstrap support instead of the number of parsimony-informative sites as an alternative measure of the phylogenetic informativeness. To do so, use the following code:
pdf("alignment_blocks_bootstrap.pdf", height=7, width=7) plot(t$p_missing, t$mean_bootstrap_support, xlab="Proportion missing", ylab="Mean Bootstrap support", main="NC_031969") abline(h=70) abline(v=0.8) rect(0, 70, 0.8, 1000, col=rgb(0.164,0.629,0.594,alpha=0.25)) dev.off()
The resulting plot should have been written to file
alignment_blocks_bootstrap.pdf
and is shown below:As expected, an obvious negative correlation exists between the proportion of missing data in the alignments and the mean node support of the corresponding phylogenies. This correlation is similar to that observed between missing data and parsimony-informative sites, but appears to be less strong. The area in blue now indicates alignment blocks that have less than 80% missing data and a mean node support above 0.7 in their RAxML phylogeny. These values will be used as thresholds in the identification of the most suitable alignments for phylogenetic inference with BEAST2.
-
To exit the interactive R environment and return to the command line, type this command:
quit(save="no")
-
To filter the alignment blocks according to the criteria calculated above, we can use the Ruby script
filter_blocks.rb
. This script expects the following five command-line arguments:- the name of the
block_stats.txt
file, - the maximum proportion of missing data,
- the minimum number of parsimony-informative sites,
- the minimum bootstrap node-support value,
- the minimum p-value for rejecting the null hypothesis of no recombination within the alignment.
- the name of the
Test a couple of parameter combinations for filtering with the script filter_blocks.rb
to see how these affect the number of alignment blocks remaining in the filtered dataset. For example, test the following combinations:
ruby filter_blocks.rb block_stats.txt 0.8 400 80 0.05 | wc -l
ruby filter_blocks.rb block_stats.txt 0.9 100 60 0.01 | wc -l
ruby filter_blocks.rb block_stats.txt 0.8 250 70 0.05 | wc -l
As mentioned above, we are goint to use 0.8, 250, and 70 as threshold values for the proportion of missing data, the number of parsimony-informative sites, and the mean node support. In addition we will use the standard alpha value of 0.05 as a threshold for the p-value for absence of recombination. Thus, the parameter combination for filtering corresponds to that used in the last of the three examples above. The number of alignment blocks passing this filter should be around 140.
-
Generate a reduced version of file
block_stats.txt
, containing only those alignments that pass the applied filters:ruby filter_blocks.rb block_stats.txt 0.8 250 70 0.05 > block_stats_filtered.txt
-
Next, make a new directory named
alignment_blocks_filtered
and copy those alignment that pass the filters into separate subdirectories of this new directory. This can be done with the following set of commands:for i in `cat block_stats_filtered.txt | tail -n +2 | cut -f 1` do mkdir -p alignment_blocks_filtered/${i} cp alignment_blocks/${i}.nex alignment_blocks_filtered/${i} done
The directory
alignment_blocks_filtered
should now contain about 140 subdirectories named according to alignment-block IDs (e.g.NC_031969_02066951_02116950
), and in each of these subdirectories should be just a single alignment file with the same ID (e.g.NC_031969_02066951_02116950.nex
).
While you may have run BEAST2 on the command line in some of the other tutorials (for example in tutorial Bayesian Species-Tree Inference), we always used the graphical user interface of BEAUti to generate the XML input files for BEAST2. However, to estimate time calibrated phylogenies for each of the ~140 alignment blocks, setting up all the XML files for these analyses individually with BEAUti would be far too time-consuming as well as error-prone; therefore, it would be convenient if this could also be done on the command line. With the complex settings required for most BEAST2 analyses, command-line tools to generate the BEAST2 XML files are not widely used; however, some such tools exist, including the BEASTmasteR R package by Nicholas Matzke and the babette R package by Bilderbeek and Etienne (2018). In this part of the tutorial we are going to use another command-line utility to produce XML input files for BEAST2, the Ruby script beauti.rb
. In addition, we are also going to use the R package coda (Plummer et al. 2006) as a command-line tool replacing Tracer to assess stationarity of the MCMC chain run by BEAST2.
-
Have a look at the options available for script
beauti.rb
, by typing the following command:ruby beauti.rb -h
You'll see that a wide range of analyses can be set up with this script, including analyses with StarBEAST2 (Ogilvie et al. 2017) and model comparison based on path sampling (Baele et al. 2012). For some of these settings, the XML files produced with
beauti.rb
may not work with the latest versions of the BEAST2 add-on packages, but the XML files for the standard analyses of BEAST2 do work as expected. Of the options available forbeauti.rb
, only few will need to be specified for our analyses. These are- an ID for the analyses, to be specified with
-id
, - the name of a directory with input files in Nexus format, specified with
-n
, - the name of an output directory to which the XML file will be written, specified with
-o
, - the name of a file with constraints in XML format, to be specified with option
-c
.
Read the help text of
beauti.rb
for these four options. - an ID for the analyses, to be specified with
-
The file with age constraints required by
beauti.rb
is somewhat equivalent to the file used to specify age constraints for scriptsnapp_prep.rb
in tutorial Divergence-Time Estimation with SNP Data; however, instead of the relatively simple format used for scriptsnapp_prep.rb
, the more complicated XML format is used. The code in XML format is exactly as it would be if it was part of a larger XML file written by the GUI version of BEAUti; thus, one way to generate code in this format would be to write an XML file with the GUI interface and remove all parts of that XML file except the part for the age constraint. Here, however, this is not necessary because you can use the ready-made XML code shown in the box below. Copy this code to a new file namedconstraints.xml
:<distribution id="All.prior" monophyletic="true" spec="beast.math.distributions.MRCAPrior" tree="@tree.t:Species"> <taxonset id="All" spec="TaxonSet"> <taxon idref="IZC5_A"/> <taxon idref="IZC5_B"/> <taxon idref="AUE7_A"/> <taxon idref="AUE7_B"/> <taxon idref="JBD6_A"/> <taxon idref="JBD6_B"/> <taxon idref="JUH9_A"/> <taxon idref="JUH9_B"/> <taxon idref="LJC9_A"/> <taxon idref="LJC9_B"/> <taxon idref="KHA7_A"/> <taxon idref="KHA7_B"/> <taxon idref="IVE8_A"/> <taxon idref="IVE8_B"/> <taxon idref="JWH2_A"/> <taxon idref="JWH2_B"/> <taxon idref="JWG9_A"/> <taxon idref="JWG9_B"/> <taxon idref="JWH4_A"/> <taxon idref="JWH4_B"/> <taxon idref="JWH6_A"/> <taxon idref="JWH6_B"/> <taxon idref="ISB3_A"/> <taxon idref="ISB3_B"/> <taxon idref="ISA8_A"/> <taxon idref="ISA8_B"/> <taxon idref="KFD2_A"/> <taxon idref="KFD2_B"/> </taxonset> <LogNormal meanInRealSpace="true" name="distr" offset="0.0"> <parameter estimate="false" name="M">6.2666</parameter> <parameter estimate="false" name="S">0.13</parameter> </LogNormal> </distribution> <distribution id="Outgroup.prior" monophyletic="true" spec="beast.math.distributions.MRCAPrior" tree="@tree.t:Species"> <taxonset id="Outgroup" spec="TaxonSet"> <taxon idref="IZC5_A"/> <taxon idref="IZC5_B"/> </taxonset> </distribution> <distribution id="Ingroup.prior" monophyletic="true" spec="beast.math.distributions.MRCAPrior" tree="@tree.t:Species"> <taxonset id="Ingroup" spec="TaxonSet"> <taxon idref="AUE7_A"/> <taxon idref="AUE7_B"/> <taxon idref="JBD6_A"/> <taxon idref="JBD6_B"/> <taxon idref="JUH9_A"/> <taxon idref="JUH9_B"/> <taxon idref="LJC9_A"/> <taxon idref="LJC9_B"/> <taxon idref="KHA7_A"/> <taxon idref="KHA7_B"/> <taxon idref="IVE8_A"/> <taxon idref="IVE8_B"/> <taxon idref="JWH2_A"/> <taxon idref="JWH2_B"/> <taxon idref="JWG9_A"/> <taxon idref="JWG9_B"/> <taxon idref="JWH4_A"/> <taxon idref="JWH4_B"/> <taxon idref="JWH6_A"/> <taxon idref="JWH6_B"/> <taxon idref="ISB3_A"/> <taxon idref="ISB3_B"/> <taxon idref="ISA8_A"/> <taxon idref="ISA8_B"/> <taxon idref="KFD2_A"/> <taxon idref="KFD2_B"/> </taxonset> </distribution>
As you recognize, the above code defines three different sets of taxa, all of which are constrained to be monophyletic. The three sets of taxa contain
- the IDs of all phased sequences included in the dataset,
- only the IDs of the phased sequences of the Astatotilapia burtoni sample "IZC5" ("IZC5_A" and "IZC5_B"), which will be used as an outgroup,
- the IDs of all phased sequences of all other species; these will be used as the ingroup.
The definition of the latter two taxon sets would not have been necessary as BEAST2 would likely have inferred an outgroup position for the Astatotilapia burtoni sequences anyway. However, defining these two taxon sets may help to reach MCMC stationarity slightly faster than without. The definition of the first taxon set including all sequences is also not necessary to constrain their monophyly, as this most inclusive clade is going to be monophyletic anyway. However, the specification of this taxon set also allows placing an age constraint on the age of the most recent common ancestor of the group, which in our case is the most practical way to time calibrate the phylogeny. As an age constraint for the divergence of Astatotilapia burtoni, the above code specifies the same prior density that was already used for time calibration in tutorial Divergence-Time Estimation with SNP Data according to the results of the analysis with the multi-species coalescent model in tutorial Bayesian Species-Tree Inference: a lognormally-distributed prior density with a mean (in real space) of 6.2666 Ma and a standard deviation of 0.13. This prior density was defined in this part of the above XML code:
<LogNormal meanInRealSpace="true" name="distr" offset="0.0"> <parameter estimate="false" name="M">6.2666</parameter> <parameter estimate="false" name="S">0.13</parameter> </LogNormal>
If you would use the script
beauti.rb
for other analyses and you would like to recycle the above XML code for the constraint definitions, you would have to change- the taxon labels (obviously),
- the age constraint,
- and probably also the IDs of the individuals constraints and taxon sets, which are labelled "All", "All.prior", "Outgroup", "Outgroup.prior", "Ingroup", and "Ingroup.prior" in the code given above.
-
In addition to the options
-id
,-n
,-o
, and-c
explained above, we are also going to specify an MCMC chain length of 500,000 iterations with option-l
(the default would be ten time longer), and we will specify that the HKY substitution model without among-site rate heterogeneity should be used with option-m
. We use both of these settings to reduce the run time of the BEAST2 analysis, but for a more thorough analysis, you should probably run a longer chain with a more complex substitution model (ideally the bModelTest model of Bouckaert and Drummond 2017; see tutorial Bayesian Phylogenetic Inference). Importantly, you should also run replicate analyses for each dataset to check whether both MCMC chains not only reach stationarity but also converge to the same posterior distribution.
To generate XML files for all alignment blocks with the settings and constraints described above, use the following set of commands:for i in alignment_blocks_filtered/* do run_id=`basename ${i}` ruby beauti.rb -id ${run_id} -n ${i} -o ${i} -c constraints.xml -l 500000 -m HKY done
-
XML input files for BEAST2 should now have been added to each subdirectory of directory
alignment_blocks_filtered
. To check if this is the case, use thels
command:ls alignment_blocks_filtered/*/*.xml
-
We are now ready to run BEAST2 analyses with all of the XML files written in the last step. To automatically run one analysis after another, we also use the command-line version of BEAST2 instead of the GUI version that was used in other tutorials. You will find the command-line version of BEAST2 in the
bin
subdirectory of the BEAST2 directory. Depending on the path of the BEAST2 directory on your computer, you will probably have to replace/Applications/Beast/2.5.0/
in the commands below with the actual path to your BEAST directory. Then, execute this set of commands:home=`pwd` for i in alignment_blocks_filtered/* do block_id=`basename ${i}` cd ${i} /Applications/Beast/2.5.0/bin/beast ${block_id}.xml cd ${home} done
These BEAST2 analyses should take about one to two hours. To shorten the run time and distribute the analyses to four different CPUs, you could open four different Terminal windows, and execute the above code in each window separately after replacing this line
for i in alignment_blocks_filtered/*
with always a different one of these four lines:
for i in alignment_blocks_filtered/NC_031969_0* for i in alignment_blocks_filtered/NC_031969_1* for i in alignment_blocks_filtered/NC_031969_2* for i in alignment_blocks_filtered/NC_031969_3*
This will cause all alignment blocks starting between position 1 and 9,999,999 to be analyzed in the first Terminal window, those starting between positions 10,000,000 and 19,999,999 to be analysed in the second Terminal window, aligment blocks starting between positions 20,000,000 and 29,999,999 will be analyzed in the third Terminal window, and those starting between position 30,000,000 and 34,628,617 (the end of the chromosome) will be analyzed at the same time in the fourth Terminal window. Assuming that alignment blocks are more or less evenly distributed across the chromosome, this would split the set of input files roughly into four equally-sized sets. There are certainly smarter ways to distribute analyses to different CPUs, but the simple parallelization that we're doing here should at least divide the overall run time by three.
-
While the BEAST2 analysis is running, monitor the screen output, particularly the the third column named "ESS(posterior)".
Question 2: Do most or all analyses reach ESS values above 200 for the posterior probability? (see answer)
-
Even though the screen output indicated stationarity of the MCMC chains in most analyses, this should be confirmed with a closer look at at least one of the log files written by BEAST2, and by an analysis of the ESS values of all model parameters in all other log files. However, if we would use the program Tracer for this assessment of stationarity as in tutorial Bayesian Phylogenetic Inference and other tutorials, we would have to manually open and investigate all ~140 log files. Therefore, we are once again going to use a command-line tool instead of GUI programs, so that the assessment of MCMC chain stationarity can be at least partly automated. An excellent tool that allows this assessment from the command line is the R package coda (Plummer et al. 2006), which is here going to be our tool of choice. To see how coda works, and to investigate the MCMC chain of one analysis in more detail, start the R environment on the command line by typing
R
, or by opening R Studio or a similar tool if you are familiar with these. Then, use the following set of commands, to load the coda library, read the log fileNC_031969_02066951_02116950.log
from directoryalignment_blocks_filtered/NC_031969_02066951_02116950
, and obtain a summary of the estimates for all model parameters:# Load the coda library. library(coda) # Read the log file as a table. t <- read.table("alignment_blocks_filtered/NC_031969_02066951_02116950/NC_031969_02066951_02116950.log", header=T) # Get the chain length as the number of rows in the table. chain_length = dim(t)[1] # Convert the table into a mcmc object. MCMCdata = mcmc(data=t, start=1, end=chain_length, thin=1) # Make a subset of the mcmc object, removing the burning and some parameter traces. MCMCsub <- as.mcmc(MCMCdata[(chain_length/10):chain_length, c(2, 3, 4, 8, 9, 10)]) # Show a summary of the parameter estimates. summary(MCMCsub)
Question 3: Can you identify, for example, the mean estimate for the mutation rate? (see answer)
-
Next, calculate the ESS values for all parameters with the following command:
# Calculate effective sample sizes for all parameters. effectiveSize(MCMCsub)
-
The smalles of all these ESS values can be obtained with this command:
# Find the lowest effective sample size. min(effectiveSize(MCMCsub))
Question 4: Based on the ESS values, would you say that the MCMC chain of this analysis has reached stationarity? (see answer)
-
To also assess stationarity of this MCMC chain visually, we can also generate trace plots of the individual parameters, similar to those produced by Tracer. Use the following set of commands to write trace plots in pdf format to a new file named
mcmc_traces.pdf
:# Make a pdf plot of the selected parameter traces. pdf("mcmc_traces.pdf", height=7, width=7) plot(MCMCsub, trace=TRUE, density=FALSE, smooth=TRUE, auto.layout=TRUE) dev.off()
These trace plots should look as in the figure below:
As you'll see from the above figure, all trace plots take the shape of the "hairy caterpillar" and thus confirm that the MCMC chain has reached stationarity.
-
Then, quit the R environment with this command:
quit(save="no")
-
To find the minimum ESS value in all log files, we'll write some of the R code used above to a new file, and we will then execute this script from the command line. Write the following code to a new file named
get_min_ess.r
:# Load the coda library. library(coda) # Get the command-line arguments. args <- commandArgs(trailingOnly = TRUE) log_file_name <- args[1] # Read the log file as a table. t <- read.table(log_file_name, header=T) # Get the chain length as the number of rows in the table. chain_length = dim(t)[1] # Convert the table into a mcmc object. MCMCdata = mcmc(data=t, start=1, end=chain_length, thin=1) # Make a subset of the mcmc object, removing the burning and some parameter traces. MCMCsub <- as.mcmc(MCMCdata[(chain_length/10):chain_length, c(2, 3, 4, 8, 9, 10)]) # Find the lowest effective sample size. cat(min(effectiveSize(MCMCsub)), "\n")
-
Test the R script
get_min_ess.r
with the log fileNC_031969_02066951_02116950.log
that was already used above:Rscript get_min_ess.r alignment_blocks_filtered/NC_031969_02066951_02116950/NC_031969_02066951_02116950.log
This should again result in the same mimimum ESS value, 674.6717, as before.
-
Then, use script
get_min_ess.r
to find the lowest ESS value of each log file, with the following set of commands:for i in alignment_blocks_filtered/*/*.log do echo -n -e "${i}\t" Rscript get_min_ess.r ${i} | tail -n 1 done
You should see that the minimum ESS value of only few analyses is below 200.
-
For a more thorough investigation, we could selectively resume those MCMC chains that had ESS values below 200. A simpler solution, however, that we are going to use here, is to completely remove those alignment blocks with particularly low ESS values. To remove the directories of all blocks with ESS values below 100, use the following set of commands:
for i in alignment_blocks_filtered/*/*.log do block_id=`basename ${i%.log}` min_ess=`Rscript get_min_ess.r ${i} | tail -n 1 | cut -d "." -f 1` if [[ ${min_ess} -lt 100 ]] then rm -r alignment_blocks_filtered/${block_id} echo "Deleted directory alignment_blocks_filtered/${block_id}." fi done
Question 5: How many alignment blocks remain in directory
alignment_blocks_filtered
? (see answer) -
For the analyses of introgression based on divergence times and tree topologies, we are going to account for phylogenetic uncertainty by using the posterior tree distributions resulting from BEAST2 analyses instead of only summary trees. For the purpose of visualizing the phylogenetic variation among all alignment blocks, however, we will now also generate maximum-clade-credibility summary trees for each alignment block. If you already followed tutorial Bayesian Species-Tree Inference or Bayesian Analysis of Species Networks, you may be familiar with the use of the command-line version of TreeAnnotator. Here, we will again use this command-line version to generate summary trees, with a burnin percentage of 10% and mean ages as node heights, as specified in the following set of commands (make sure to replace
/Applications/Beast/2.5.0
with the actual BEAST2 directory on your computer):for i in alignment_blocks_filtered/*/*.trees do /Applications/Beast/2.5.0/bin/treeannotator -burnin 10 -heights mean ${i} ${i%.trees}.tre done
-
Before plotting the summary trees, first have a look at how well supported each of these trees is. To do so, we can calculate once again the mean node support of each tree with the Python script
get_mean_node_support.py
:for i in alignment_blocks_filtered/*/*.tre do echo -ne "${i}\t" python3 get_mean_node_support.py ${i} done
You should see that most trees are extremely well supported, with mean Bayesian posterior probabilities above 0.9 in almost all cases. Question 6: How can we explain that these support values are so high whereas hardly any bootstrap support values in the RAxML analyses were above 90% (see the plots shown above in the tutorial part Identifying alignment blocks for phylogenetic analysis)? (see answer)
-
To plot all maximum-clade-credibility trees jointly, we'll use the program DensiTree that has already been used in tutorial Bayesian Analysis of Species Networks and other tutorials. But since DensiTree requires that all trees are included in a single input file, we'll first use the Python script
logcombiner.py
to combine all trees into one file, using the following commands:ls alignment_blocks_filtered/*/*.tre > mcc_trees.txt python3 logcombiner.py mcc_trees.txt mcc_trees.trees
-
Then, open file
mcc_trees.trees
in DensiTree. You should see a plot of the maximum-clade-credibility trees for each alignment block, as shown in this figure:As you can see from this DensiTree plot, different regions of the chromosome have apparently diverged at different times and also in different orders, which could be due to either incomplete lineage sorting or introgression. We are going to attempt to disentangle these two processes in the next part of the tutorial.
Both incomplete lineage sorting and introgression lead to variation among the phylogenies inferred from different parts of the genome. However, the patterns of variation produced by the two processes differ in several ways, which therefore allow to discriminate between them with a large number of local phylogenies. For example, the relationships among three taxa would be assumed to be "symmetric" in the absence of introgression, meaning that the species-tree topology should be the one that is most frequently observed and that the two possible alternative rooted topologies should have similar frequencies. In contrast, introgression should lead to asymmetry in the frequencies of the three possible topologies, meaning that one of the two alternative topologies should have a higher frequency than the other one (the logic behind these assumptions about symmetric and asymmetric topology frequencies is similar to that of the ABBA-BABA test; see tutorial Analysis of Introgression with SNP Data)
In addition to different patterns of topology frequencies, incomplete lineage sorting and introgression also have a different influence on the divergence times of local phylogenies. With incomplete lineage sorting alone, the divergence times of certain regions of the genome will always be older than the divergence times of the species tree. In contrast, introgression can only occur after species have already diverged and thus leads to divergence times in certain regions of the genome that are younger than the divergence times in the species tree.
Based on the expected decrease in divergence times, we developed a method to identify putative introgression events from sets of time-calibrated phylogenies in Meyer et al. (2017; Fig. 1). This method measures the mean divergence time between two species in sets of time-calibrated local phylogenies, and the phylogenetic uncertainty can be taken into account by using the posterior tree distributions instead of summary trees. When pairwise divergence times are measured in this way for a set of three species, say "A", "B", and "C", then three different estimates are obtained; the divergence of "A" and "B", the divergence of "A" and "C", and the divergence of "B" and "C". With incomplete lineage sorting alone, the two older of these estimates are expected to be similar. Thus, if "A" and "B" are sisters in the species tree, then the mean divergence time of "A" and "C" should be similar to that of "B" and "C". If on the other hand, introgression occurred from "C" to "A", then some regions of the genome of these species should have a more recent divergence time than the same regions in species "B" and "C", and the mean divergence time between "A" and "C" should also be reduced compared to "B" and "C". In the method of Meyer et al. (2017), the absolute reduction between the mean divergence time of "B" and "C" and the mean divergence time of "A" and "C" is calculated for all possible ways in which the species included in the dataset can be assigned to the positions of "A", "B", and "C". Thus, a three-dimensional matrix is calculated where each cell of the matrix contains this absolute age reduction for a particular combination of three species. The three-dimensional matrix is then reduced to a two-dimensional matrix by keeping only the maximum of all cells that have the same position in the first and second dimension but differ in their position in third dimension. This means that for a certain combination of species "A" and "C", all other species of the dataset are considered as a possible species "B", and only the maximum absolute age reduction obtained with any species "B" is recorded. The two-dimensional matrix then has on the x-axis the species used as species "A", the species used as "C" are listed on the y-axis, and the values in the matrix cells are the maximum age reductions observed with any of the other species as "B". Finally, the matrix is plotted as a heatmap, where the darker areas indicate a greater age reduction and thus possible introgression from species "C" into species "A".
In this part of the tutorial, we are going to apply the method of Meyer et al. (2017) to the set of time-calibrated phylogenies obtained for the 14 cichlid species to identify putative past introgression events. We will then assess the support for these events in more detail based on asymmetry of topology frequencies.
-
As a first step, to reduce the computational demand of the analyses, we are going to reduce the posterior tree distributions for each alignment block by sampling 100 out of the 2,000 trees stored in the tree log files. This can be done with the Python script
logcombiner.py
, using the following command:for i in alignment_blocks_filtered/*/*.trees do python3 logcombiner.py -b 10 -n 100 --remove-comments ${i} ${i%.trees}.100.trees done
For each of the original tree log files in directory
alignment_blocks_filtered
, there should now be a second file ending in.100.trees
. -
Then, we use the R script
get_mrca_table.r
to write a text file for each tree file, with separate tables for the the node ages of the 100 trees included in the tree file. All these text files will be written to a new directory namednode_ages
. To make this directory and run th R scriptget_mrca_table.r
, use the following set of commands:mkdir node_ages for i in alignment_blocks_filtered/*/*.100.trees do block_id=`basename ${i%.100.trees}` echo -n "Reading node ages from file ${block_id}.100.trees..." Rscript get_mrca_table.r ${i} > node_ages/${block_id}.txt echo " done." echo "Wrote file node_ages/${block_id}.txt." done
These commands might take a few minutes to finish.
-
Have a look at the content of the new directory named
node_ages
, for example using thels
command:ls node_ages
-
Also have a look at the format of these text files by opening one of them in a text editor, or by using the
less -S
command:less -S node_ages/NC_031969_02066951_02116950.txt
You'll see that for each of the 100 trees in the corresponding tree file, this text file actually contains two tables, the first of which is labelled "#ancestors" and the second is labelled "node_ages". The "#ancestors" specifies the node ID for the most recent common ancestor of each pair of taxa, and the second table contains the node ages for these common ancestors.
-
To summarize the information from all files in directory
node_ages
and all tables in each of these files into one single table with only the pairwise mean node ages, we can use the Ruby scriptsummarize_mrca_tables.rb
. This script expects two arguments:- the name of a directory with text files written by script
get_mrca_table.r
, - the name of an output file to which the summary table will be written.
Thus, use the following command to read analyze all files in directory
node_ages
and write the resulting matrix to a new file namemean_node_ages.txt
:ruby summarize_mrca_tables.rb node_ages mean_node_ages.txt
- the name of a directory with text files written by script
-
Have a quick look at the pairwise mean divergence times recorded in file
mean_node_ages.txt
. -
Next, we can visualize the pairwise mean divergence times in the form of a heatmap by typing the following set of
R
commands:R table <- read.table("mean_node_ages.txt") matrix <- as.matrix(table) col_palette <- colorRampPalette(c("#848e79","#6b857a","#537b7c","#3b727d","#2a667b","#255771","#204768","#1b385f","#162955","#111a4c","#0d133d","#0a0e2d"),space="rgb")(n = 25) col_breaks=seq(0,max(matrix),length=26) pdf("mean_node_ages.pdf", height=7, width=7) heatmap(matrix, Rowv=T, symm=T, scale="none", col=col_palette, breaks=col_breaks) dev.off() quit(save="no")
The above commands should have written a new file named
In this plot, the darkest cells represent the oldest pairwise mean divergence time (6.64 Ma), which is found (unsurprisingly) between the *Astatotilapia burtoni* sequences "IZC5_A" and "IZC5_B" and all other sequences. The cells in lighter colors indicate more recent divergence times, which are found mainly within four clusters of sequences. The dendrograms shown at the top and on the left were computed with a simple clustering algorithm by the R function `heatmap()`. These are only included for orientation and should not be seen as actual phylogenies.mean_node_ages.pdf
. This file should contain a heatmap plot as shown below: -
As you will have noticed, the file
mean_node_ages.txt
as well as the heatmap generated from it in the last step use the individual sequences as units rather than the species. In order to investigate species-level introgression, it would, however, be more convenient to average over the different sequences of each species. To do so, we can use the Ruby scriptshrink_matrix.rb
. This script expects the following three arguments:- the name of an input text file containing a matrix,
- the name of a file assigning the units used in the matrix (e.g. sequence or sample IDs) to larger units (e.g. species IDs),
- the name of a new output file to which the reduced matrix will be written.
We therefore first need to write a file assigning sequence IDs to species IDs. To do so, copy the below text to a new file, and name this new file
species.txt
:IZC5_A astbur IZC5_B astbur AUE7_A altfas AUE7_B altfas JBD6_A telvit JBD6_B telvit JUH9_A neobri JUH9_B neobri LJC9_A neocan LJC9_B neocan KHA7_A neochi KHA7_B neochi IVE8_A neocra IVE8_B neocra JWH2_A neogra JWH2_B neogra JWG9_A neohel JWG9_B neohel JWH4_A neomar JWH4_B neomar JWH6_A neooli JWH6_B neooli ISB3_A neopul ISB3_B neopul ISA8_A neosav ISA8_B neosav KFD2_A neowal KFD2_B neowal
-
Then, we can apply the script
shrink_matrix.rb
with the input filesmean_node_ages.txt
andspecies.txt
, and naming the output filemean_node_ages_species.txt
:ruby shrink_matrix.rb mean_node_ages.txt species.txt mean_node_ages_species.txt
-
For comparison with the first heatmap, we can generate another heatmap for the pairwise mean divergence times averaged per species comparison, using the following set of
R
commands:R table <- read.table("mean_node_ages_species.txt") matrix <- as.matrix(table) col_palette <- colorRampPalette(c("#848e79","#6b857a","#537b7c","#3b727d","#2a667b","#255771","#204768","#1b385f","#162955","#111a4c","#0d133d","#0a0e2d"),space="rgb")(n = 25) col_breaks=seq(0,max(matrix),length=26) pdf("mean_node_ages_species.pdf", height=7, width=7) heatmap(matrix, Rowv=T, symm=T, scale="none", col=col_palette, breaks=col_breaks) dev.off() quit(save="no")
These commands should have written the heatmap plot to the new file
mean_node_ages_species.pdf
. The heatmap should look as shown in the figure below:You may notice that one of the cells on the diagonal is much darker than all others. This cell represents the mean divergence time between two sequences that are both assigned to the species Neolamprologus cancellatus ("neocan"), indicating that the within-species genetic variation in Neolamprologus cancellatus is much greater than in other species.
-
Have a look at the text file from which the last heatmap was generated, the file
mean_node_ages_species.txt
.Question 7: What is the age estimate for the divergence of the sequences within Neolamprologus cancellatus ("neocan")? (see answer)
-
As you certainly noticed, the heatmaps generated above are not the type of heatmap used by Meyer et al. (2017; Fig. 1) to identify introgression events, instead they rather represent its raw material as they visualize pairwise mean divergence times instead of the age reductions in species trios that were used by Meyer et al. (2017; Fig. 1). The maximum age reductions in species trios, as described above, can, however, easily be calculated from the pairwise mean divergence times, using the Ruby script
get_age_reduction.rb
. This script expects two arguments; these are- only the name of an input file with a matrix of divergence times,
- the name of a new file to which the matrix with maximum age reductions will be written.
Thus, to use file
mean_node_ages_species.txt
as input and write the output to a new file namedage_reductions.txt
, run the scriptget_age_reduction.rb
with the following command:ruby get_age_reduction.rb mean_node_ages_species.txt age_reductions.txt
-
Have a look at the content of file
age_reductions.txt
by opening it in a text editor or with theless
command:less age_reductions.txt
The content of this file should look more or less like this:
altfas astbur neobri neocan neochi neocra neogra neohel neomar neooli neopul neosav neowal telvit altfas 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 astbur 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 neobri 0.0 0.0 0.0 0.0 0.0 0.065 0.0975 0.05 0.13 0.0475 0.0 0.1625 0.0 0.0 neocan 0.0 0.0 0.685 0.0 0.72 0.685 0.685 0.69 0.685 0.685 0.685 0.69 0.72 1.455 neochi 0.02 0.0 0.0 0.055 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.08 neocra 0.0 0.0 0.145 0.0 0.01 0.0 0.0775 0.135 0.0 0.0675 0.1275 0.055 0.01 0.0 neogra 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0825 0.01 0.0 0.0 0.0 0.0 neohel 0.0 0.0 0.0 0.005 0.01 0.01 0.0375 0.0 0.085 0.0 0.0 0.165 0.01 0.01 neomar 0.0 0.0 0.0 0.0 0.01 0.0 0.0 0.0 0.0 0.0 0.0 0.02 0.01 0.0 neooli 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0575 0.0 0.0 0.13 0.0 0.0 neopul 0.0 0.0 0.24 0.0 0.0 0.035 0.0375 0.095 0.1175 0.0 0.0 0.185 0.0 0.0 neosav 0.0 0.0 0.0 0.005 0.03 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.03 0.02 neowal 0.02 0.0 0.0 0.055 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.08 telvit 0.06 0.0 0.685 0.0 0.71 0.685 0.685 0.69 0.685 0.685 0.685 0.7 0.71 0.0
As you'll see, most of the cells in this matrix are zero, meaning regardless of which species is used as the third species, the older two of the three mean divergence times are identical. For example consider the two species Astatotilapia burtoni ("astbur") and Altolamprologus fasciatus ("altfas"), for which a maximum age reduction of zero (thus, no age reduction at all) was inferred. This means that regardless of which species is used as the third species ("B"), the mean divergence time between Astatotilapia burtoni ("C") and Altolamprologus fasciatus ("A") is identical to the mean divergence time between Astatotilapia burtoni ("C") and that third species. Thus, no signs of introgression are found between the two species Astatotilapia burtoni and Altolamprologus fasciatus (which does not mean that introgression can be excluded though).
-
For a better overview of the information in file
age_reductions.txt
, also generate a heatmap for it with the following commands:R table <- read.table("age_reductions.txt") matrix <- as.matrix(table) max(matrix) col_palette <- colorRampPalette(c("#848e79","#6b857a","#537b7c","#3b727d","#2a667b","#255771","#204768","#1b385f","#162955","#111a4c","#0d133d","#0a0e2d"),space="rgb")(n = 25) col_breaks=seq(0,max(matrix),length=26) pdf("age_reductions.pdf", height=7, width=7) heatmap(matrix, Rowv=NA, Colv=NA, symm=F, scale="none", col=col_palette, breaks=col_breaks, xlab="Taxon C (donor)", ylab="Taxon A (recipient)", margins = c(6, 6)) dev.off() quit(save="no")
The above commands should have written the heatmap plot to a new file named
age_reductions.pdf
. The plot should look as shown in the figure below:As you can see, the heatmap is dominated by two dark bars that span most of the comparisons involving the species Neolamprologus cancellatus ("neocan") or Telmatochromis vittatus ("telvit") in the position of species "A". The darkest cell of the heatmap, however, is found for Telmatochromis vittatus ("telvit") in position of the donor species "C" and Neolamprologus cancellatus ("neocan") as the recipient of introgression, species "A". The maximum age reduction for this comparison is 1.455 million years, as we can see from file
age_reductions.txt
.Question 8: Using the information on pairwise mean divergence times in file
mean_node_ages_species.txt
, can you figure out which third species ("B"), together with Telmatochromis vittatus ("telvit") as species "C" and Neolamprologus cancellatus ("neocan") as species "A" is included in the species trio for which this maximum age reduction of 1.455 million years was found? (see answer) -
Even though the strongest signal of introgression is thus found between Telmatochromis vittatus ("telvit") and Neolamprologus cancellatus ("neocan"), the two dark bars in the heatmap seems to additionally support introgression from most Neolamprologus species into both Telmatochromis vittatus ("telvit") and Neolamprologus cancellatus ("neocan"). However, a comparison with the pairwise mean divergence times in file
mean_node_ages_species.txt
reveals that this is an artifact resulting from the introgression between Telmatochromis vittatus ("telvit") and Neolamprologus cancellatus ("neocan"): For example, in the species trio Telmatochromis vittatus ("telvit"; as species "A"), Neolamprologus cancellatus ("neocan"; as species "B"), and Neolamprologus walteri ("neowal"; as species "C") (see the second cell in the top right of the heatmap), the mean divergence time between species the two Neolamprologus species ("C" and "B") is 3.78 Ma, whereas it is 3.07 and 3.005 Ma between these two species and Telmatochromis vittatus ("telvit"; species "A"). This could be explained by a sister-group relationship between Telmatochromis vittatus ("telvit") and Neolamprologus cancellatus ("neocan") in combination with introgression from Neolamprologus walteri ("neowal") into Telmatochromis vittatus ("telvit"); however, given that an even stronger signal of supports introgression from a more diverged species (Altolamprologus fasciatus; "altfas") into Neolamprologus cancellatus ("neocan"), it is likely that the mean divergence times for Neolamprologus cancellatus ("neocan") in these comparisons are the result of its Altolamprologus fasciatus ancestry.This shows how the use of maximum age reductions in species trios can be misleading if one of these species is influenced by introgression from outside the trio. However, a good way to account for this potentially misleading influence is to iteratively remove those species with the strongest signals of introgression. To do so, we can run the Ruby script
get_age_reduction.rb
once more, this time with the species ID of Neolamprologus cancellatus, "neocan", as an optional third argument. The script will then calculate the maximum age reduction in all other trios while ignoring Neolamprologus cancellatus ini all comparisons (also as species "B"). Thus, use the following command to recalculate maximum age reductions without Neolamprologus cancellatus ("neocan"):ruby get_age_reduction.rb mean_node_ages_species.txt age_reductions_sub1.txt neocan
The new matrix without Neolamprologus cancellatus ("neocan") should then be written to a file named
age_reductions_sub1.txt
-
Make a new heatmap plot for the matrix without Neolamprologus cancellatus ("neocan"), using the following set of
R
commands:R table <- read.table("age_reductions_sub1.txt") matrix <- as.matrix(table) max(matrix) col_palette <- colorRampPalette(c("#848e79","#6b857a","#537b7c","#3b727d","#2a667b","#255771","#204768","#1b385f","#162955","#111a4c","#0d133d","#0a0e2d"),space="rgb")(n = 25) col_breaks=seq(0,max(matrix),length=26) pdf("age_reductions_sub1.pdf", height=7, width=7) heatmap(matrix, Rowv=NA, Colv=NA, symm=F, scale="none", col=col_palette, breaks=col_breaks, xlab="Taxon C (donor)", ylab="Taxon A (recipient)", margins = c(6, 6)) dev.off() quit(save="no")
The new heatmap plot should then be written to file
age_reductions_sub1.pdf
and it should look as shown in the figure below:As you can see, the dark bar in the top row for Telmatochromis vittatus ("telvit") in the position of the recipient has now completely disappeared, further suggesting that its presence in the previous heatmap was an artifact caused by introgression into Neolamprologus cancellatus ("neocan"). Other cells in the heatmap now appear darker; this, however is mostly due to adjusting the color scheme to the new maximal value.
The darkest cell in the heatmap is now the one for the comparison of Neolamprologus brichardi ("neobri") as the donor species "C" and Neolamprologus pulcher ("neopul") as the recipient species "A", with a maximum age reduction of 0.24 million years according to the information in file
age_reductions_sub1.txt
. This observation agrees with the resuls of Gante et al. (2016), where we also inferred introgression from Neolamprologus brichardi ("neobri") into Neolamprologus pulcher ("neopul"). -
To see how removing Neolamprologus brichardi ("neobri"), in addition to Neolamprologus cancellatus ("neocan"), influences the signals of introgression among the other species, we'll run the script
get_age_reduction.rb
once more, this time with "neocan,neopul" as the third argument (the script will recognize that multiple species IDs are specified):ruby get_age_reduction.rb mean_node_ages_species.txt age_reductions_sub2.txt neocan,neopul
The script should now have written another matrix of maximum age reductions in species trios, to a new file named
age_reductions_sub2.txt
. -
Use the new file
age_reductions_sub2.txt
for a third heatmap plot of maximum age reductions, this time without both Neolamprologus brichardi ("neobri") and Neolamprologus cancellatus ("neocan"):R table <- read.table("age_reductions_sub2.txt") matrix <- as.matrix(table) max(matrix) col_palette <- colorRampPalette(c("#848e79","#6b857a","#537b7c","#3b727d","#2a667b","#255771","#204768","#1b385f","#162955","#111a4c","#0d133d","#0a0e2d"),space="rgb")(n = 25) col_breaks=seq(0,max(matrix),length=26) pdf("age_reductions_sub2.pdf", height=7, width=7) heatmap(matrix, Rowv=NA, Colv=NA, symm=F, scale="none", col=col_palette, breaks=col_breaks, xlab="Taxon C (donor)", ylab="Taxon A (recipient)", margins = c(6, 6)) dev.off() quit(save="no")
The new plot should be written to file
age_reductions_sub2.pdf
, and it should look as shown below:A comparison of this heatmap with the previous one shows that removing Neolamprologus brichardi ("neobri") did not affect the pattern for other comparisons strongly. The largest value now found in the matrix in file
age_reductions_sub2.txt
corresponds to a maximum age reduction of 0.165 million years in the comparison with Neolamprologus savoryi ("neosav") as the donor species "C" and Neolamprologus helianthus ("neohel") as the recipient species "A"; however, whether this value still supports introgression or whether it results from stochasticity in the phylogenetic inference may be questioned.In any case, because the approach of Meyer et al. (2017) based on mean divergence times in species trios does not represent a formal test of introgression, the results obtained with this approach should generally be seen as indicating putative introgression events rather than providing strong support for these. This approach is thus very useful to build hypotheses of where in the phylogeny introgression events could have occurred so that these hypotheses can be further investigated with, e.g., the ABBA-BABA test, or with approaches for species-network reconstruction such as the Species Network model for BEAST2 (see tutorial Analysis of Introgression with SNP Data) or similar models implemented in the programs PhyloNet (Than et al. 2008; Wen et al. 2016) or SNAQ (Solís-Lemus and Ané 2016).
-
Additional evidence for putative introgression events can also be obtained from patterns of symmetry or assymetry in the topologies of species trios. To further investigate for example the putative introgression event from Neolamprologus brichardi ("neobri") into Neolamprologus pulcher ("neopul") we could test if this introgression event is also supported by asymmetry in the topologies of the two species plus a third species that is phylogenetically closer to the recipient species than the presumed donor. Based on the species-tree analysis with SNAPP in tutorial Divergence-Time Estimation with SNP Data, a suitable third species for this comparison would be Neolamprologus olivaceous ("neooli"), which appeared as the sister to Neolamprologus pulcher ("neopul") in the resulting species tree.
Frequencies of tree topologies can be extracted from sets of tree files in a convenient way with the Python script
get_topologies.py
. Have a look at the help text of this script by typing the following command:python3 get_topologies.py -h
As you can see from this help text, the script uses as input one or more tree files specified with option
-t
as well as one or more species IDs specified with option-s
. Instead of species IDs, the name of a file assigning sequence or sample IDs to species IDs can also be specified with the-s
option; in that case the frequencies of species topologies will be calculated by weighing the topologies of the individual sequences. -
Write the following text assigning sample IDs to species IDs for the three species Neolamprologus brichardi ("neobri"), Neolamprologus olivaceous ("neooli"), and Neolamprologus pulcher ("neopul") to a new file named
species_trio1.txt
:JUH9_A neobri JUH9_B neobri JWH6_A neooli JWH6_B neooli ISB3_A neopul ISB3_B neopul
-
Then, execute the script
get_topologies.py
to count the frequencies of the three different possible topologies for the species trio Neolamprologus brichardi ("neobri"), Neolamprologus olivaceous ("neooli"), and Neolamprologus pulcher ("neopul") in the set of reduced tree files in directoryalignment_blocks_filtered
:python3 get_topologies.py -t alignment_blocks_filtered/*/*.100.trees -s species_trio1.txt
Question 9: Are the topology frequencies asymmetric, supporting introgression between Neolamprologus brichardi and Neolamprologus pulcher? (see answer)
-
Repeat the same with another species trio to test another putative introgression event indicated by the heatmap in file
age_reductions_sub2.pdf
. When selecting a third species for the comparison, try to use one that is phylogenetically closer to the putative recipient of introgression than it is to the donor, based on the species tree inferred with SNAPP in tutorial Divergence-Time Estimation with SNP Data. This species tree is shown in the figure below:For this new text, you will need to write file similar to
species_trio1.txt
but with different species and sequence IDs. This new file could be namedspecies_trio2.txt
.Question 10: Did you identify further examples of asymmetry in the topology frequencies? (see answer)
As we have seen in this part of the tutorial, both divergence times in local phylogenies as well as topology frequencies support several instances of introgression among the cichlid species included in our dataset. In the extreme case of Neolamprologus cancellatus, the hypothesis of introgression from Telmatochromis vittatus was strongly supported by the results obtained in tutorial Analysis of Introgression with SNP Data, which even suggested that the Neolamprologus cancellatus samples are first-generation hybrids between Telmatochromis vittatus and Altolamprologus fasciatus. However, some of the other introgression events suggested here by patterns of mean divergence times and and topology frequencies have not yet been explicitly tested (with e.g. the ABBA-BABA test) and might need to be further investigated. Considering that we only used data from a single chromosome and just a single sample per species, more definite conclusions could certainly be gained from more extensive analyses!
One assumption underlying all phylogenetic analyses in this tutorial is that the alignment blocks used for tree inference are free of recombination. If this assumption should be violated, the consequences for the accuracy of the phylogenetic analyses as as well as all analyses of introgression based on these phylogenies would be difficult to predict. Given that we performed two different tests for recombination, with Saguaro and Phi Test, and that the alignment blocks used for phylogenetic analyses passed both tests, one could probably argue that we minimized the possible effect of recombination sufficiently so that our conclusions based on the inferred phylogenies may well be reliable. However, how certain can we be that recombination is really absent from the alignments? And how would our inference be influenced if it were not?
These questions are so far rather poorly answered, even though they are actively debated in the recent literature (e.g. Springer and Gatesy 2016; Edwards et al. 2016). One way to approach this question is with simulations: We could simulate how probable it would be to actually have not even a single recombination breakpoint in an alignment of a certain length, and simulated sequences could also be used to test the reliability of phylogenetic inference under these conditions. In one such study, Lanier and Knowles (2012) concluded that within-alignment recombination had no noticeable effect on species-tree reconstruction; however, the simulations used in this study were for a rather small set of species, and it is not clear if the results would hold for larger phylogenies. Furthermore, the possible effects of within-alignment recombination on phylogenomic tests for introgression have so far not been tested.
To address at least the question of how probable the absence of recombination is in alignment blocks of 50 kb, we can use the Python program c-genie (Malinsky and Matschiner; unpublished). This program simulates phylogenetic histories with recombination and calculates the average length of "c-genes" ("coalescent genes"; Doyle 1995), genomic regions uninterrupted by recombination. In addition, c-genie also calculates the lengths of "single-topology tracts", fragments sharing the same tree topology even though recombination within these fragments might have led to variable branch lengths in different parts of the fragment. The length of these single-topology tracts might be more relevant for phylogenetic analysis, because one could argue that only those recombination breakpoints that change the tree topology within an aligment have negative consequence for phylogenetic inference while those breakpoints that only change the branch lengths are safe to ignore.
As the probability of recombination depends on many factors, including the recombination rate, the generation time, the length of branches in the species tree, and the population size, c-genie requires estimates for these parameters as input. Conveniently, we can use the species tree as well as the population size estimate inferred by SNAPP in tutorial Divergence-Time Estimation with SNP Data for the simulations with c-genie. In addition, we'll assume (as in other tutorials) again a generation time of 3 years for cichlid fishes as well as a recombination rate of 2×10-8 per generation.
-
Start by download the c-genie program from its github repository, using the following command:
wget https://raw.githubusercontent.com/mmatschiner/c-genie/master/c-genie
-
Make c-genie executable with this command:
chmod +x c-genie
-
Then, have a look at the help text of c-genie with this command:
./c-genie -h
You'll see that besides calculating the lengths of c-genes and single-topology tracts, c-genie also allows the simulation of alignments with recombination, which can then be used to test the accuracy of phylogenetic methods in the presence of within-alignment recombination. Here, however, we will use c-genie only to find out how plausible the assumed absence of recombination in our 50 kb-alignment blocks really is.
-
You may also have noticed from c-genie's help text that a generation time of 3 years and a recombination rate of 2×10-8 are already the default values for these two parameters, thus we only need to specify the species tree and the population size estimate from the SNAPP analysis in tutorial Divergence-Time Estimation with SNP Data. Thus, copy the species tree named
snapp.tre
from the analysis directory of the other tutorial if you followed it, or download the tree file by clicking on the link. -
Now, run c-genie with file
snapp.tre
as the species tree, "lamprologini" as the prefix for the file, and an assumed population size of 100,000, according to the estimate from the SNAPP analysis:./c-genie snapp.tre lamprologini -n 100000
This anaysis should finish within a few minutes.
-
As you will see from the screen output of c-genie, an output file in HTML format has been written to file
lamprologini.html
. Open this HTML file in a web browser such as Firefox and scroll through the plots included in this file.Question 11: What are the mean lengths of c-genes and single-topology tracts? What is the probability that an alignment of 50 kb includes just one c-gene or one single-topology tract? (see answer)
-
Repeat the simulations with different assumptions for the population size to see how these assumptions inluence the resulting lengths of c-genes and single-topology tracts.
There are some reasons why the lengths of c-genes and single-topology tracts may not be as short in practice as the results of c-genie may suggest. For example, undetected past population bottlenecks would reduce the amount of incomplete lineage sorting, which could extend the length of both types of fragments. In addition, the presence of "recombination deserts" (Yu et al. 2001) could lead to some very large c-genes and single-topology tracts. Nevertheless, it is important to be aware of the possible presence of recombination within the alignments used for phylogenetic inference. While the study of Lanier and Knowles (2012) as well as preliminary analyses with c-genie suggest that within-alignment recombination may in fact not have a strong influence on the inference of the species-tree topology, the consequences for divergence-time estimates and downstream analyses of introgression are less clear and should be investigated in future studies.
- Question 1: The filtering of alignment blocks based on the proportion of missing data should have removed about 200 alignment files from directory
alignment_blocks
. In my analysis, there were 672 alignment files before filtering and 477 files remained after removing those that had more than 90% missing data.
- Question 2: Almost all BEAST2 analyses should reach rather high ESS values for the posterior probability, usually between 300 and 1,000. This is one indication that the MCMC chains of these analyses have reached stationarity, but it does not tell us if all model parameters have similarly high ESS values.
- Question 3: With the command that we used,
summary()
, displays two tables. The first of these contains the mean and standard deviation for the estimates of all model parameters, and the second table shows the 2.5%, 25%, 50% (the median), 75%, and 97.5% quantiles. The mutation-rate parameter is named "mutationRate.s.NC_031969_02066951_02116950" and its mean estimate should be around 7.5e-04 (7.547e-04 in my analysis).
- Question 4: Given that all ESS values are above 600, the MCMC chain of this analysis appears to have reached a high degree of stationarity. Nevertheless, for a more thorough investigation at least on replicate analysis of the same XML file should be performed to test if multiple chains also converge to the same posterior distribution.
- Question 5: After removing alignment blocks with low ESS values, about 130 blocks should be left in directory
alignment_blocks_filtered
(131 in my analysis).
- Question 6: Good question! The difference between Bayesian posterior probabilities and Bootstrap support values could on the one hand indicate that the HKY model used for BEAST2 analyses is too constrained and that summary trees would be less strongly supported if we had used a more flexible substitution model instead. Alternatively, this difference could be due to the different ways in which Bayesian posterior probabilities and Boostrap support values are calculated. Recall that bootstrapping is based on randomly drawing (with replacement) alignment sites from the original alignment to fill a new alignment, followed by phylogenetic analysis with this new alignment. With the large proportion of missing data in all alignment blocks (up to 80%), this means that many of the alignments generated by bootstrapping have an even higher proportion of missing data. As a result, many of the 100 bootstrap phylogenies used to assess node support will have a weaker phylogenetic signal, which could reduce the inferred node support. In contrast, sites that are completely missing are simply ignored in analyses with BEAST2.
- Question 7: The age estimate for the divergence of the two Neolamprologus cancellatus sequences ("LJC9_A" and "LJC9_B") is over 2 Ma (2.23 Ma in my analysis). This is remarkable given that the two sequences were sampled from one and the same individual fish ("LJC9").
- Question 8: This third species is Altolamprologus fasciatus ("altfas"): The mean divergence time between Telmatochromis vittatus ("telvit"; species "C") and Altolamprologus fasciatus ("altfas", species "B") is 4.46 Ma, whereas the mean divergence times between Neolamprologus cancellatus ("neocan"; species "A") and both of these species are 3.005 Ma and 2.6 Ma, respectively. Thus, the second-oldest mean divergence time in this trio (3.005 Ma) is 1.455 million years younger than the oldest divergence time (4.46 Ma). This confirms the results obtained in tutorial Analysis of Introgression with SNP Data according to which Neolamprologus cancellatus is formed by hybridization between Altolamprologus fasciatus and Telmatochromis vittatus.
-
Question 9: The topology frequencies for the three species Neolamprologus brichardi ("neobri"), Neolamprologus olivaceous ("neooli"), and Neolamprologus pulcher ("neopul") are in fact very asymmetric according to the output of the script
get_topologies.py
shown below:(neobri,(neooli,neopul)): 61.2825 (neooli,(neobri,neopul)): 58.9650 (neopul,(neobri,neooli)): 10.7525
This shows that about 60 of the 131 alignment blocks (the numbers are weighed to account phylogenetic uncertainty) support a sister-group relationship between Neolamprologus olivaceous ("neooli") and Neolamprologus pulcher ("neopul"), and about the same number also supports a sister group relationship between Neolamprologus brichardi ("neobri") and Neolamprologus pulcher ("neopul"). In contrast, only about ten alignment blocks support a sister-group relationship between Neolamprologus brichardi ("neobri") and Neolamprologus olivaceous ("neooli"). Given that with incomplete lineage sorting alone, the topology of the true species tree would be expected to be most frequent while the other two topologies should have similar frequencies, the pattern observed here can not be explained with incomplete lineage sorting. Therefore, this pattern supports introgression between Neolamprologus brichardi ("neobri") and Neolamprologus pulcher ("neopul"), corroborating the results based on mean divergence times.
-
Question 10: You might have found further examples of asymmetry in topology frequencies. One such example is the species trio Neolamprologus savoryi ("neosav"), Neolamprologus helianthus ("neohel"), and Neolamprologus gracilis ("neogra"), the topology frequencies of which are shown below:
(neosav,(neogra,neohel)): 65.4775 (neogra,(neohel,neosav)): 48.8475 (neohel,(neogra,neosav)): 16.6750
- Question 11: Unfortunately, the mean lengths of c-genes and single-topology tracts are extremely short: between 13 and 26 bp. Consequently, the probability that a 50 kb alignment contains just one c-gene or single-topology tract is 0. Instead, alignments of this length contain on average around 3680 c-genes and about 1850 single-topology tracts. This means that if our assumptions for the species tree, the recombination rate, and the population size were correct, all the phylogenies inferred for block alignments in fact averaged over a large number of different topologies.