Needleman–Wunsch algorithm

Let $A=\{a_1, ..., a_n\}$ and $B=\{b_1, ..., b_m\}$ be the sequences being aligned. To find the best alignment, we first construct a matrix $M$ . Let $M_{i,j}$ be the value of cell $(i, j)$ of matrix $M$. Let $S(a_i,b_j)$ be the score of nucleotides $a_i$ and $b_j$. Let $d$ be the gap penalty. We fill in the matrix $M$ by the following scheme,

$$M_{i,j} = \max\left\{M_{i−1,j−1}+S(a_i,b_j),M_{i−1,j}+d,M_{i,j−1}+d\right\}$$

After the matrix $M$ is filled by the optimal scores $M_{i,j}$, the alignment with the highest score can be reconstructed by tracing back through the lower-right corner to the upper-left corner of matrix $M$.

Algorithm

Two sequences are arranged in a matrix form with $(n+1)$ columns and $(m+1)$ rows
Calculating $M_{i,j}$ for each cell. The values in the first row and first column are set to be the scores for gaps
Tracing back from the lower-right corner to the upper-left corner

Example

Suppose two sequences are ACGTC and AGTC and the score function is $\begin{align} S(a,b)=\begin{cases} 1, &a=b \\ -1, &a \ne b \\ 0, &gap\end{cases} \end{align}$. The matrix $M$ is filled with scores from upper-left corner to lower-right corner as follows.

		A	G	T	C
	0	0	0	0	0
A	0	1	1	1	1
C	0	1	1	1	2
G	0	1	2	2	2
T	0	1	2	3	3
C	0	1	2	3	4

Then, we trace back from lower-right corner ($4$) to upper-left corner ($0$). The similarity score of the best alignment is $4$ and the alignment is given below

A C G T C

A - G T C

R code

Needleman–Wunsch algorithm

Phylogenomics