By lliu | Fri, 04/23/2021 - 15:41

Let $A=\{a_1, ..., a_n\}$ and $B=\{b_1, ..., b_m\}$ be the sequences being aligned. To find the best alignment, we first construct a matrix $M$ . Let $M_{i,j}$ be the value of cell $(i, j)$ of matrix $M$. Let $S(a_i,b_j)$ be the score of nucleotides $a_i$ and $b_j$. Let $d$ be the gap penalty. We fill in the matrix $M$ by the following scheme,

$$M_{i,j} = \max\left\{M_{i−1,j−1}+S(a_i,b_j),M_{i−1,j}+d,M_{i,j−1}+d\right\}$$

After the matrix $M$ is filled by the optimal scores $M_{i,j}$, the alignment with the highest score can be reconstructed by tracing back through the lower-right corner to the upper-left corner of matrix $M$.

Algorithm

  • Two sequences are arranged in a matrix form with $(n+1)$ columns and $(m+1)$ rows
  • Calculating $M_{i,j}$ for each cell. The values in the first row and first column are set to be the scores for gaps
  • Tracing back from the lower-right corner to the upper-left corner

Example

Suppose two sequences are ACGTC and AGTC and the score function is $\begin{align} S(a,b)=\begin{cases} 1, &a=b \\ -1, &a \ne b \\ 0, &gap\end{cases} \end{align}$. The matrix $M$ is filled with scores from upper-left corner to lower-right corner as follows.

    A G T C
  0 0 0 0 0
A 0 1 1 1 1
C 0 1 1 1 2
G 0 1 2 2 2
T 0 1 2 3 3
C 0 1 2 3 4

Then, we trace back from lower-right corner ($4$) to upper-left corner ($0$). The similarity score of the best alignment is $4$ and the alignment is given below

A C G T C

A -  G T C

 R code