Let $A=\{a_1, ..., a_n\}$ and $B=\{b_1, ..., b_m\}$ be the sequences being aligned. To find the best alignment, we first construct a matrix $M$ . Let $M_{i,j}$ be the value of cell $(i, j)$ of matrix $M$. Let $S(a_i,b_j)$ be the score of nucleotides $a_i$ and $b_j$. Let $d$ be the gap penalty. We fill in the matrix $M$ by the following scheme,
$$M_{i,j} = \max\left\{M_{i−1,j−1}+S(a_i,b_j),M_{i−1,j}+d,M_{i,j−1}+d\right\}$$
After the matrix $M$ is filled by the optimal scores $M_{i,j}$, the alignment with the highest score can be reconstructed by tracing back through the lower-right corner to the upper-left corner of matrix $M$.
Algorithm
- Two sequences are arranged in a matrix form with $(n+1)$ columns and $(m+1)$ rows
- Calculating $M_{i,j}$ for each cell. The values in the first row and first column are set to be the scores for gaps
- Tracing back from the lower-right corner to the upper-left corner
Example
Suppose two sequences are ACGTC and AGTC and the score function is $\begin{align} S(a,b)=\begin{cases} 1, &a=b \\ -1, &a \ne b \\ 0, &gap\end{cases} \end{align}$. The matrix $M$ is filled with scores from upper-left corner to lower-right corner as follows.
A | G | T | C | ||
0 | 0 | 0 | 0 | 0 | |
A | 0 | 1 | 1 | 1 | 1 |
C | 0 | 1 | 1 | 1 | 2 |
G | 0 | 1 | 2 | 2 | 2 |
T | 0 | 1 | 2 | 3 | 3 |
C | 0 | 1 | 2 | 3 | 4 |
Then, we trace back from lower-right corner ($4$) to upper-left corner ($0$). The similarity score of the best alignment is $4$ and the alignment is given below
A C G T C
A - G T C