diff --git a/2024/alignment.md b/2024/alignment.md new file mode 100644 index 0000000..797699f --- /dev/null +++ b/2024/alignment.md @@ -0,0 +1,1109 @@ +![](https://i.imgur.com/2dwPOHb.png) + +# Sequence alignment + +In the previous lecture we have seen the principle behind dynamic programming. This approach is extremely useful for comparing biological sequences, which is coincidentally one of the main points of this course. This lecture explain how this is done. In writing this text I heavily relied on wonderful [course](http://www.langmead-lab.org/teaching-materials/) taught by Ben Langmead at Johns Hopkins. The cover image shows pairwise alignments for human, mouse, and dog *KIF3* locus from [Dubchak et al. 2000](http://genome.cshlp.org/content/10/9/1304.long). + +## How different are two sequences? + +Suppose you have two sequences of the same length: + + +``` +A C A T G C C T A +A C T G C C T A C +``` + +How different are they? In other words, how many bases should be change to turn one sequence onto the other: + + +``` +A C A T G C C T A +| | * * * | * * * +A C T G C C T A C +``` + +the vertical lines above indicate positions that are identical between the two sequences, while asterisks show differences. It will take six substitutions to turn one sequence into the other. This number -- six substitutions -- is called [*Hamming distance*](https://en.wikipedia.org/wiki/Hamming_distance) or the *minimal* number of substitutions required to turn one string into another. The code below computes the Hamming distance. Try it. You can change `seq1` and `seq2` into whatever you want except that they should have the same length. + + + + + +Now in addition to *substitutions* (i.e., changing one character into another) let's allow **insertions** and **deletions**. This will essentially allow us to insert dashes (gaps) between characters: + +``` +A C A T G C C T A - +| | * | | | | | | * +A C - T G C C T A C +``` + +In this case the [**edit distance**](https://en.wikipedia.org/wiki/Edit_distance) between two sequences is 2. It is defined as the minimum number of operations (substitutions, insertions, and deletions) requited to turn one string into another. The compared strings do not have to be of the same length to be able to compute the edit distance as we can compensate for length differences using deletions and insertions. While the situation above (where we inserted two dashes) is biologically much more meaningful (and realistic), it is much more difficult to find. + +# Generalizing the problem + +Before we can develop an algorithm that will help us to compute the edit distance let's develop a general framework that would allow us to think about the problem in exact terms. Let's look at a pair of VERY long sequences. So long, that we do not even see the left end -- it disappears into $\infty$: + +$$ + \color{red}{\texttt{.....A C T G C C T A}}\texttt{ G}\\ + \color{red}{\texttt{.....A C T G C C T A}}\texttt{ C}\\ +$$ + +the red parts of the two sequences represent **prefixes** for the last nucleotides shown in black. Let's assume that the edit distance between the two prefixes is known (don't ask how we know, we just do). For simplicity let's "compact" the prefix of the first sequence into $\alpha$ and the prefix of the second sequence into $\beta$: + + +$$ + \alpha \texttt{G}\\ + \beta \texttt{C} +$$ + + +again, the edit distance between $\alpha$ and $\beta$ is known to us. The three possibilities for computing the edit distance between $\alpha G$ and $\beta C$ are: + + + +$$Edit\ Distance(\alpha\texttt{G},\beta\texttt{C}) = min\begin{cases} + Edit\ Distance(\alpha,\beta) + 1 \leftarrow\ because\ they\ do\ not\ match& \\ + Edit\ Distance(\alpha\texttt{G},\beta) + 1 \leftarrow\ because\ we\ are\ adding\ a\ gap& \\ + Edit\ Distance(\alpha,\beta\texttt{C}) + 1 \leftarrow\ because\ we\ are\ adding\ a\ gap + \end{cases}$$ + + + +but we not always have $\texttt{G}$ and $\texttt{C}$ as two last nucleotiodes, so for the general case: + + + +$$Edit\ Distance(\alpha\texttt{x},\beta\texttt{y}) = min\begin{cases} + Edit\ Distance(\alpha,\beta) + \delta(x,y) & \\ + Edit\ Distance(\alpha\texttt{x},\beta) + 1 & \\ + Edit\ Distance(\alpha,\beta\texttt{y}) + 1 + \end{cases}$$ + + +where $\delta(x,y) = 0$ if $x = y$ (nucleotides match) and $\delta(x,y) = 1$ if $x \neq y$ (nucleotides do not match). The $\delta(x,y)$ has a particular meaning. If the two nucleotides at the end are equal to each other, there is nothing to be done -- no substitution operation is necessary. If a substitution is necessary however, we will record this by adding 1. When we will be talking about global and local alignment below the power of $\delta(x,y)$ will become even more clear. + +We can write an algorithm that would exhaustively evaluate the above expression for all possible suffixes. This algorithm is below. Try executing it. It will take roughly ~30 seconds to find the edit distance between the two sequences used in the beginning of this lecture: + + + +The take-home-message here is that it takes a very long time to compute the edit distance between two sequences that are only **nine** nucleotides long! Why is this happening? Figure 1 below shows a small subset of situations the algorithm is evaluating for two very short strings $\texttt{TAG}$ and $\texttt{TAC}$: + +> ![](http://www.bx.psu.edu/~anton/bioinf-images/editDist.png) +> +> **Figure 1** | A fraction of situations evaluated by the naïve algorithm for computing the edit distance. Just like in the case of the change problem discussed in the previous lecture a lot of time is wasted on computing distances between suffixes that has been seen more than once (shown in red). + +To understand the magnitude of this problem let's look at slightly modified version of the previous Python code below. All we do here is keeping track how many times a particular pair of suffixes (in this case $\texttt{AC}$ and $\texttt{AC}$) are seen by the program. The number is staggering: 48,639. So this algorithm is **extremely** wasteful. + + + +While this approach to the edit distance problem is correct, it will hardly help us on the genome-wide scale. Just like in case of the change problem and Manhattan tourist problem dynamic programming is going to save the day. + + +------ + +# Dynamic programming to the rescue + +It turns out that in order to find the alignment we first need to learn how to compute edit distances between sequences efficiently. So, suppose we have two sequences that deliberately have different lengths: + +$\texttt{G C T A T A C}$ + +and + +$\texttt{G C G T A T G C}$ + +Let's represent them as the following matrix where the first character $\epsilon$ represents an empty string: + +$$ +\begin{array}{ c | c | c | c | c | c | c} +& \epsilon & G & C & T & A & T & A & C\\ +\hline +\epsilon\\ +\hline +G\\ +\hline + C\\ + \hline +G\\ +\hline +T\\ +\hline +A\\ + \hline +T\\ +\hline +G\\ +\hline +C +\end{array} +\\ +\textbf{Note}: sequence\ \texttt{X}\ is\ vertical\ and\ sequence\ \texttt{Y}\ is\ horizontal. +$$ + + +Let's fill the first column and first row of the matrix. Because the distance between a string and an empty string is equal to the length of the string (e.g., a distance between string $\texttt{TCG}$ and an empty string is 3) this resulting matrix will look like this: + + + +$$ +\begin{array}{ c | c | c | c | c | c | c} +& \epsilon & G & C & T & A & T & A & C\\ + \hline + \epsilon & 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7\\ + \hline + G & 1\\ + \hline + C & 2\\ + \hline + G & 3\\ + \hline + T & 4\\ + \hline + A & 5\\ + \hline + T & 6\\ + \hline + G & 7\\ + \hline + C & 8 +\end{array} +\\ +\textbf{Note}: sequence\ \texttt{X}\ is\ vertical\ and\ sequence\ \texttt{Y}\ is\ horizontal. +$$ + + +Now, let's fill in the cell between $\texttt{G}$ and $\texttt{G}$. Recall that: + +$$ +Edit\ Distance(\alpha\texttt{x},\beta\texttt{y}) = min\begin{cases} + \color{red}{Edit\ Distance(\alpha,\beta) + \delta(x,y)} & \\ + \color{blue}{Edit\ Distance(\alpha\texttt{x},\beta) + 1} & \\ + \color{green}{Edit\ Distance(\alpha,\beta\texttt{y}) + 1} + \end{cases} +$$ + + +where $\delta(x,y) = 0$ if $x = y$ and $\delta(x,y) = 1$ if $x \neq y$. And now let's color each of the cells corresponding to each part of the above expression: + +$$ +\begin{array}{ c | c | c | c | c | c | c} +& \epsilon & G & C & T & A & T & A & C\\ +\hline +\epsilon & \color{red}0 & \color{green}1 & 2 & 3 & 4 & 5 & 6 & 7\\ +\hline +G & \color{blue}1\\ + \hline + C & 2\\ + \hline + G & 3\\ + \hline + T & 4\\ + \hline + A & 5\\ + \hline + T & 6\\ + \hline + G & 7\\ + \hline + C & 8 +\end{array} +\\ +\textbf{Note}: sequence\ \texttt{X}\ is\ vertical\ and\ sequence\ \texttt{Y}\ is\ horizontal. +$$ + +In this specific case: + +$$Edit\ Distance(\epsilon\texttt{G},\epsilon\texttt{G}) = min\begin{cases} + \color{red}{Edit\ Distance(\epsilon,\epsilon) + \delta(G,G)\ or\ 0\ +\ 0\ =\ 0} & \\ + \color{blue}{Edit\ Distance(\epsilon\texttt{G},\epsilon) + 1\ or\ 1\ +\ 1\ =\ 2} & \\ + \color{green}{Edit\ Distance(\epsilon,\epsilon\texttt{G}) + 1\ or\ 1\ +\ 1\ =\ 2} + \end{cases} +$$ + + +This minimum of 0, 2, and 2 will be 0, so we are putting zero into that cell: + +$$ +\begin{array}{ c | c | c | c | c | c | c} +& \epsilon & G & C & T & A & T & A & C\\ +\hline +\epsilon & \color{red}0 & \color{green}1 & 2 & 3 & 4 & 5 & 6 & 7\\ +\hline +G & \color{blue}1 & \color{red}0\\ + \hline + C & 2\\ + \hline + G & 3\\ + \hline + T & 4\\ + \hline + A & 5\\ + \hline + T & 6\\ + \hline + G & 7\\ + \hline + C & 8 +\end{array} +\\ +\textbf{Note}: sequence\ \texttt{X}\ is\ vertical\ and\ sequence\ \texttt{Y}\ is\ horizontal. +$$ + +Using this uncomplicated logic we can fill the entire matrix like this: + +$$ +\begin{array}{ c | c | c | c | c | c | c} +& \epsilon & G & C & T & A & T & A & C\\ +\hline +\epsilon & 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7\\ +\hline + G & 1 & 0 & 1 & 2 & 3 & 4 & 5 & 6\\ + \hline + C & 2 & 1 & 0 & 1 & 2 & 3 & 4 & 5\\ + \hline + G & 3 & 2 & 1 & 1 & 2 & 3 & 4 & 5\\ + \hline + T & 4 & 3 & 2 & 1 & 2 & 2 & 3 & 4\\ + \hline + A & 5 & 4 & 3 & 2 & 1 & 2 & 2 & 3\\ + \hline + T & 6 & 5 & 4 & 3 & 2 & 1 & 2 & 3\\ + \hline + G & 7 & 6 & 5 & 4 & 3 & 2 & 2 & 3\\ + \hline + C & 8 & 7 & 6 & 5 & 4 & 3 & 3 & \color{red}2 +\end{array} +\\ +\textbf{Note}: sequence\ \texttt{X}\ is\ vertical\ and\ sequence\ \texttt{Y}\ is\ horizontal. +$$ + + +The lower rightmost cell highlighted in red is special. It contains the value for the edit distance between the two strings. The following Python script implements this idea. You can see that it is essentially instantaneous: + + + +------ + +# From edit distance to alignment + +In the previous section we have filled the dynamic programming matrix to find out that the edit distance between the sequences is 2. But in biological applications we are most often interested not in edit distance *per se* but in the actual **alignment** between two sequences. + +## The traceback + +We can use the dynamic programming matrix to reconstruct the alignment. This is done using **traceback** procedure. Let's look at the rightmost bottom cell of the matrix highlighted in **bold**: + +$$ +\begin{array}{ c | c | c | c | c | c | c} +& \epsilon & G & C & T & A & T & A & C\\ +\hline +\epsilon & 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7\\ + \hline +G & 1 & 0 & 1 & 2 & 3 & 4 & 5 & 6\\ +\hline +C & 2 & 1 & 0 & 1 & 2 & 3 & 4 & 5\\ +\hline +G & 3 & 2 & 1 & 1 & 2 & 3 & 4 & 5\\ +\hline +T & 4 & 3 & 2 & 1 & 2 & 2 & 3 & 4\\ +\hline +A & 5 & 4 & 3 & 2 & 1 & 2 & 2 & 3\\ +\hline +T & 6 & 5 & 4 & 3 & 2 & 1 & 2 & 3\\ +\hline +G & 7 & 6 & 5 & 4 & 3 & 2 & \color{red}2 & \color{green}3\\ +\hline +C & 8 & 7 & 6 & 5 & 4 & 3 & \color{blue}3 & \textbf{2} +\end{array} +$$ + +When we were filling this matrix did we come to this point from above ($\color{green}3$), from the left ($\color{blue}3$), or diagonally ($\color{red}2$): + +$$ + \begin{array}{ | c | c } + \hline + \color{red}2 & \color{green}3\\ + \hline + \color{blue}3 & \textbf{2} + \end{array} +$$ + + +Remembering the fact that when filling the matrix we are trying to minimize the expression for edit distance, we clearly arrived to this point diagonally from $\color{red}2$. This because arriving from top ($\color{green}3$) or left ($\color{blue}3$) would add 1. So we highlight diagonal cell in red: + + +$$ +\begin{array}{ c | c | c | c | c | c | c} +& \epsilon & G & C & T & A & T & A & C\\ +\hline +\epsilon & 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7\\ +\hline +G & 1 & 0 & 1 & 2 & 3 & 4 & 5 & 6\\ +\hline +C & 2 & 1 & 0 & 1 & 2 & 3 & 4 & 5\\ +\hline +G & 3 & 2 & 1 & 1 & 2 & 3 & 4 & 5\\ +\hline +T & 4 & 3 & 2 & 1 & 2 & 2 & 3 & 4\\ +\hline +A & 5 & 4 & 3 & 2 & 1 & 2 & 2 & 3\\ +\hline +T & 6 & 5 & 4 & 3 & 2 & 1 & 2 & 3\\ +\hline +G & 7 & 6 & 5 & 4 & 3 & 2 & \color{red}2 & 3\\ +\hline +C & 8 & 7 & 6 & 5 & 4 & 3 & 3 & \color{red}2 +\end{array} +$$ + + +Continuing this idea we will draw a trace like the one below until we hit an interesting point: + +$$ +\begin{array}{ c | c | c | c | c | c | c} +& \epsilon & G & C & T & A & T & A & C\\ +\hline +\epsilon & 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7\\ +\hline +G & 1 & 0 & 1 & 2 & 3 & 4 & 5 & 6\\ +\hline +C & 2 & 1 & 0 & 1 & 2 & 3 & 4 & 5\\ +\hline +G & 3 & 2 & \color{red}1 & 1 & 2 & 3 & 4 & 5\\ +\hline +T & 4 & 3 & 2 & \color{red}1 & 2 & 2 & 3 & 4\\ +\hline +A & 5 & 4 & 3 & 2 & \color{red}1 & 2 & 2 & 3\\ +\hline +T & 6 & 5 & 4 & 3 & 2 & \color{red}1 & 2 & 3\\ +\hline +G & 7 & 6 & 5 & 4 & 3 & 2 & \color{red}2 & 3\\ +\hline +C & 8 & 7 & 6 & 5 & 4 & 3 & 3 &\color{red}2 +\end{array} +$$ + + +At this point we have arrived to this value from the top because 0 + 1 = 1. If we were arriving diagonally it would be 1 + 1 = 2 since $\texttt{G}\ \neq\ \texttt{C}$, so we are turning traceback upward and then again diagonally: + + +$$ +\begin{array}{ c | c | c | c | c | c | c} +& \epsilon & G & C & T & A & T & A & C\\ +\hline +\epsilon & \color{red}0 & 1 & 2 & 3 & 4 & 5 & 6 & 7\\ +\hline +G & 1 & \color{red}0 & 1 & 2 & 3 & 4 & 5 & 6\\ +\hline +C & 2 & 1 & \color{red}0 & 1 & 2 & 3 & 4 & 5\\ +\hline +G & 3 & 2 & \color{red}1 & 1 & 2 & 3 & 4 & 5\\ +\hline +T & 4 & 3 & 2 & \color{red}1 & 2 & 2 & 3 & 4\\ +\hline +A & 5 & 4 & 3 & 2 & \color{red}1 & 2 & 2 & 3\\ +\hline +T & 6 & 5 & 4 & 3 & 2 & \color{red}1 & 2 & 3\\ +\hline +G & 7 & 6 & 5 & 4 & 3 & 2 & \color{red}2 & 3\\ +\hline +C & 8 & 7 & 6 & 5 & 4 & 3 & 3 & \color{red}2 +\end{array} +$$ + + +Now going through traceback line from the top we are getting the following alignment between two two sequences: + +``` +G C - T A T A C +| | * | | | * | +G C G T A T G C +``` + +## Approximate matching + +Let's now get a bit more practical. In many real biological applications we are trying to see if one sequence is contained within another. So, how can we use dynamic programming to find if there is an approximate match between two sequences $\it\texttt{P}$ and $\texttt{T}$? + +Suppose we have two strings: + +$\it{T} = \texttt{A A C C C T A T G T C A T G C C T T G G A}$ + +and + +$\it{P} = \texttt{T A C G T C A G C}$ + +Let's construct the following matrix: + +$$ +\begin{array}{ c | c | c | c | c | c | c | c | c | c | c | c | c | c | c | c | c } +& \epsilon & A & A & C & C & C & T & A & T & G & T & C & A & T & G & C & C & T & T & G & G & A\\ +\hline +\\ +\hline +T\\ +\hline + A\\ +\hline +C\\ +\hline +G\\ +\hline +T\\ +\hline +C\\ +\hline +A\\ +\hline +G\\ +\hline +C\\ +\end{array} +\\ +\textbf{Note}: sequence\ \texttt{T}\ is\ horizontal\ while\ \texttt{P}\ is\ vertical. +$$ + + + +Let me remind you that our goal is to find where $\it\texttt{P}$ matches $\texttt{T}$. *A priori* we do not know when it may be, so we will start by filling the entire first row with zeroes. This is because the match between $\it\texttt{P}$ and $\texttt{T}$ may start at any point up there. The first column we will initialize the same way we did previously: with increasing sequence of numbers: + + +$$ +\begin{array}{ c | c | c | c | c | c | c | c | c | c | c | c | c | c | c | c | c | c | c | c | c | c | c | c } +& \epsilon & A & A & C & C & C & T & A & T & G & T & C & A & T & G & C & C & T & T & G & G & A\\ +\hline +& 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\ +\hline +T & 1\\ +\hline + A & 2\\ +\hline +C & 3\\ +\hline +G & 4\\ +\hline +T & 5\\ +\hline +C & 6\\ +\hline +A & 7\\ +\hline +G & 8\\ +\hline +C & 9\\ +\end{array} +$$ +Now let's fill this matrix in using the same logic we used before: +$$ +\begin{array}{ c | c | c | c | c | c | c | c | c | c | c | c | c | c | c | c | c | c | c | c | c | c | c | c } +& \epsilon & A & A & C & C & C & T & A & T & G & T & C & A & T & G & C & C & T & T & G & G & A\\ +\hline +& 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\ +\hline +T & 1 & 1 & 1 & 1 & 1 & 1 & 0 & 1 & 0 & 1 & 0 & 1 & 1 & 0 & 1 & 1 & 1 & 0 & 0 & 1 & 1 & 1\\ +\hline +A & 2 & 1 & 1 & 2 & 2 & 2 & 1 & 0 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 2 & 2 & 1 & 1 & 1 & 2 & 1\\ +\hline +C & 3 & 2 & 2 & 1 & 2 & 2 & 2 & 1 & 1 & 2 & 2 & 1 & 2 & 2 & 2 & 1 & 2 & 2 & 2 & 2 & 2 & 2\\ +\hline +G & 4 & 3 & 3 & 2 & 2 & 3 & 3 & 2 & 2 & 1 & 2 & 2 & 2 & 3 & 2 & 2 & 2 & 3 & 3 & 2 & 2 & 3\\ +\hline +T & 5 & 4 & 4 & 3 & 3 & 3 & 3 & 3 & 2 & 2 & 1 & 2 & 3 & 2 & 3 & 3 & 3 & 2 & 3 & 3 & 3 & 3\\ +\hline +C & 6 & 5 & 5 & 4 & 3 & 3 & 4 & 4 & 3 & 3 & 2 & 1 & 2 & 3 & 3 & 3 & 3 & 3 & 3 & 4 & 4 & 4\\ +\hline +A & 7 & 6 & 5 & 5 & 4 & 4 & 4 & 4 & 4 & 4 & 3 & 2 & 1 & 2 & 3 & 4 & 4 & 4 & 4 & 4 & 5 & 4\\ +\hline +G & 8 & 7 & 6 & 6 & 5 & 5 & 5 & 5 & 5 & 4 & 4 & 3 & 2 & 2 & 2 & 3 & 4 & 5 & 5 & 4 & 4 & 5\\ +\hline +C & 9 & 8 & 7 & 6 & 6 & 5 & 6 & 6 & 6 & 5 & 5 & 4 & 3 & 3 & 3 & 2 & 3 & 4 & 5 & 5 & 5 & 5 +\end{array} +$$ + +Now that we have filled in the complete matrix let's look at the bottom row. Instead of using the rightmost cell we will find a cell with the lowest number. That would be 2 (highlighted in red): + + +$$ +\begin{array}{ c | c | c | c | c | c | c | c | c | c | c | c | c | c | c | c | c | c | c | c | c | c | c | c } + & \epsilon & A & A & C & C & C & T & A & T & G & T & C & A & T & G & C & C & T & T & G & G & A\\ + \hline + & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\ + \hline + T & 1 & 1 & 1 & 1 & 1 & 1 & 0 & 1 & 0 & 1 & 0 & 1 & 1 & 0 & 1 & 1 & 1 & 0 & 0 & 1 & 1 & 1\\ + \hline + A & 2 & 1 & 1 & 2 & 2 & 2 & 1 & 0 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 2 & 2 & 1 & 1 & 1 & 2 & 1\\ + \hline + C & 3 & 2 & 2 & 1 & 2 & 2 & 2 & 1 & 1 & 2 & 2 & 1 & 2 & 2 & 2 & 1 & 2 & 2 & 2 & 2 & 2 & 2\\ + \hline + G & 4 & 3 & 3 & 2 & 2 & 3 & 3 & 2 & 2 & 1 & 2 & 2 & 2 & 3 & 2 & 2 & 2 & 3 & 3 & 2 & 2 & 3\\ + \hline + T & 5 & 4 & 4 & 3 & 3 & 3 & 3 & 3 & 2 & 2 & 1 & 2 & 3 & 2 & 3 & 3 & 3 & 2 & 3 & 3 & 3 & 3\\ + \hline + C & 6 & 5 & 5 & 4 & 3 & 3 & 4 & 4 & 3 & 3 & 2 & 1 & 2 & 3 & 3 & 3 & 3 & 3 & 3 & 4 & 4 & 4\\ + \hline + A & 7 & 6 & 5 & 5 & 4 & 4 & 4 & 4 & 4 & 4 & 3 & 2 & 1 & 2 & 3 & 4 & 4 & 4 & 4 & 4 & 5 & 4\\ + \hline + G & 8 & 7 & 6 & 6 & 5 & 5 & 5 & 5 & 5 & 4 & 4 & 3 & 2 & 2 & 2 & 3 & 4 & 5 & 5 & 4 & 4 & 5\\ + \hline + C & 9 & 8 & 7 & 6 & 6 & 5 & 6 & 6 & 6 & 5 & 5 & 4 & 3 & 3 & 3 & \color{red}2 & 3 & 4 & 5 & 5 & 5 & 5 +\end{array}$$ + + +Starting already familiar traceback procedure at that cell we will get the following path through the matrix: + + +$$ +\begin{array}{ c | c | c | c | c | c | c | c | c | c | c | c | c | c | c | c | c | c | c | c | c | c | c | c } + & \epsilon & A & A & C & C & C & T & A & T & G & T & C & A & T & G & C & C & T & T & G & G & A\\ + \hline + & 0 & 0 & 0 & 0 & 0 & \color{red}0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\ + \hline + T & 1 & 1 & 1 & 1 & 1 & 1 & \color{red}0 & 1 & 0 & 1 & 0 & 1 & 1 & 0 & 1 & 1 & 1 & 0 & 0 & 1 & 1 & 1\\ + \hline + A & 2 & 1 & 1 & 2 & 2 & 2 & 1 & \color{red}0 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 2 & 2 & 1 & 1 & 1 & 2 & 1\\ + \hline + C & 3 & 2 & 2 & 1 & 2 & 2 & 2 & 1 & \color{red}1 & 2 & 2 & 1 & 2 & 2 & 2 & 1 & 2 & 2 & 2 & 2 & 2 & 2\\ + \hline + G & 4 & 3 & 3 & 2 & 2 & 3 & 3 & 2 & 2 & \color{red}1 & 2 & 2 & 2 & 3 & 2 & 2 & 2 & 3 & 3 & 2 & 2 & 3\\ + \hline + T & 5 & 4 & 4 & 3 & 3 & 3 & 3 & 3 & 2 & 2 & \color{red}1 & 2 & 3 & 2 & 3 & 3 & 3 & 2 & 3 & 3 & 3 & 3\\ + \hline + C & 6 & 5 & 5 & 4 & 3 & 3 & 4 & 4 & 3 & 3 & 2 & \color{red}1 & 2 & 3 & 3 & 3 & 3 & 3 & 3 & 4 & 4 & 4\\ + \hline + A & 7 & 6 & 5 & 5 & 4 & 4 & 4 & 4 & 4 & 4 & 3 & 2 & \color{red}1 & \color{red}2 & 3 & 4 & 4 & 4 & 4 & 4 & 5 & 4\\ + \hline + G & 8 & 7 & 6 & 6 & 5 & 5 & 5 & 5 & 5 & 4 & 4 & 3 & 2 & 2 & \color{red}2 & 3 & 4 & 5 & 5 & 4 & 4 & 5\\ + \hline + C & 9 & 8 & 7 & 6 & 6 & 5 & 6 & 6 & 6 & 5 & 5 & 4 & 3 & 3 & 3 & \color{red}2 & 3 & 4 & 5 & 5 & 5 & 5 +\end{array} + $$ + + + +for a corresponding alignment: + +``` +A A C C C T A T G T C A T G C C T T G G A + | | * | | | | * | | + T A C G T C A - G C +``` + +------ + +# Global alignment + +So far in filling the dynamic programming matrix we were using the following expression to compute the number within each cell: + +