Table of Contents
This project focuses on the study of three extinct feline species that once inhabited the American continent: Smilodon (saber-toothed tiger), Homotherium (scimitar-toothed tiger), and Miracinonyx trumani (American cheetah). These species became extinct approximately 13,000 years ago, at the end of the last glacial period. Ancient DNA sequences of the cytochrome b protein from these species have been successfully sequenced, enabling the investigation of their evolutionary relationships with contemporary feline species.
The objective is to reconstruct the phylogenetic tree of these feline species using a method based on calculating evolutionary distances between DNA sequences. The dataset includes sequences from domestic cats, lions, leopards, tigers, pumas, cheetahs, and wild cats from Africa, China, and Europe. Additionally, sequences from non-feline species are present in the dataset.
# Clone the repository to your local machine
git clone https://github.com/Jaffar-Hussein/PhyloExplora.git
# Change directory to the cloned repository
cd PhyloExplora
# Compile the project using the makefile
make
# Run all sections of the project
./phylo all
# Run a specific section of the project (replace {section name} with the name of the section, e.g., nj)
#./phylo {section name}
Here's a preview of what you can expect:
The method you will use to calculate the phylogenetic tree requires calculating the evolutionary distance between the sequences. Before you can calculate them, you first need to align the sequences considering three types of mutations:
- Substitutions (one nucleotide is replaced by another)
- Insertions
- Deletions
For example, the sequences "ACTCCTGA" and "ATCTCGTGA" have several possible alignments (there are others):
-ACTCCTGA ATCTCGTGA A-CTCCTGA ATCTCGTGA AC-TCCTGA ATCTCGTGA
It is also possible that a deletion is not a mutation, but actually due to poor sequencing or DNA degradation as is often the case for sequencing DNA from extinct species.
The "-" denotes a gap, which is a "blank" in the alignment that was caused by an insertion or a deletion. These two types of mutations are grouped under the term indel.
These alignments correspond to a multitude of different phylogenetic histories. To select the best alignment, we therefore need to introduce the maximum parsimony hypothesis, which favors the phylogenetic history that involves the fewest assumptions and therefore, the fewest evolutionary changes. For example, among the three alignments above, we will prefer alignment 2 because it corresponds to the scenario with the fewest mutations:
- Alignment 1 implies at least 1 indel and 2 substitutions
- Alignment 2 implies at least 1 indel and 1 substitution
The Needleman-Wunsch algorithm is a dynamic programming algorithm used for sequence alignment in bioinformatics. It finds the optimal global alignment between two sequences. The algorithm operates in three main steps:
-
Initialization of a score matrix (M) and a traceback matrix (T) of size (n+1) x (m+1), where n and m are the lengths of the two sequences to be aligned.
-
Filling of the matrices M and T using the following formula:
where s is a function that calculates the similarity score between two nucleotides. The T matrix is filled with directions ('d' for diagonal, 'l' for left, 'u' for up) based on which value was used to calculate M[i,j].
- Traceback from the final element of the T matrix, T[n,m], to the start ('o'), following the directions. This gives the optimal global alignment between the two sequences.
The final element of the M matrix, M[n,m], gives the score of the alignment.
To construct phylogenetic trees, we need a set of aligned sequences (multiple alignment). Therefore, we cannot use the Needleman-Wunsch algorithm as it only allows for pairwise alignment (between two sequences). As a result, the "cat_dna_aligne.fasta" file contains the sequences already aligned.
In the case of very close sequences (which is the case here), we estimate that the real evolutionary distance between the sequences is close to the p-distance
, which is simply the number of substitutions in the alignment over the total number of nucleotides. To simplify, we ignore positions aligned to gaps.
We then apply the Jukes-Cantor correction to take into account the phenomenon of saturation (the same site can mutate several times over time). Its formula is:
where d
is the Jukes-Cantor distance and p-distance
is the proportion of differing nucleotide sites.
UPGMA is a simple hierarchical clustering method used to construct phylogenetic trees. The method operates on a distance matrix, which provides the distances between each pair of sequences. The steps of the UPGMA method are as follows:
- Start with a forest of trees, each consisting of a single leaf for each sequence.
- Find the minimum distance in the matrix, join the two corresponding nodes into a single node, and update the distance matrix.
- Repeat step 2 until only a single node remains.
The distance between two clusters is calculated as the average distance between all pairs of sequences in the two clusters. This is where the "unweighted" and "arithmetic mean" parts of the name come from.
The result of the UPGMA method is a rooted tree, which means that it assumes a constant rate of evolution (molecular clock hypothesis).
Here's an example of how the UPGMA method works:
Initial distance matrix: A B C D A - 5 4 7 B 5 - 7 10 C 4 7 - 7 D 7 10 7 - Step 1: Join A and C (smallest distance), update distance matrix: AC B D AC - 6 5.5 B 6 - 10 D 5.5 10 - Step 2: Join AC and D, update distance matrix: ACD B ACD - 8 B 8 - Step 3: Join ACD and B: ACD-B ACD-B -
The final tree is (((A,C),D),B).
Neighbor Joining is another algorithm for calculating a phylogenetic tree from a distance matrix. It has the advantage of making fewer assumptions than UPGMA on the data (they are not necessarily ultrametric) and therefore gives better trees in almost all cases.
The steps of the Neighbor Joining method are as follows:
- For each pair of taxa, calculate the sum of the distances to all other taxa.
- Calculate the matrix of "neighbor-joining distances" based on the original distances and the sums calculated in step 1.
- Find the pair of taxa with the smallest neighbor-joining distance and join them into a single node, updating the distance matrix.
- Repeat steps 1-3 until only a single node remains.
Note: To simplify life when building our tree, we ignore the distance between the branches. We are only interested in the links of kinship.
You can find an example of applying this algorithm here.
The following resources were used in the creation of this project:
-
Project Description by George Marchment: The project description provided by the professor heavily influenced the structure and content of this README. The project description can be found here.
-
Wikipedia: For the description of the algorithm and examples
-
Youtube: For learning more about binary trees and Linked Lists.
-
AI: For advice on creation of ASCII art in C .
-
README Inspiration: Github Awesome README's awesome readme
This project would not have been possible without the contributions of the following individuals:
George Marchment - For providing the project and guidance in its development.
Made with ❤️ by :
And lots of ☕!