By lliu | Sat, 04/24/2021 - 11:25

Dynamic programming for pairwise alignment can be extended to multiple sequences. Let $A_k=\{a_{1,k}, ..., a_{n_k,k}\}$ for $k=1,...,K$ be the $K$ sequences being aligned, where $n_k$ is the length of sequence $A_k$. To find the best alignment, we first construct a $K$-dimensional array $M$. We use $M_{i_1,...,i_K}$ to denote the value of cell $(i_1,...,i_K)$ of $M$. We fill in the matrix by the following scheme,

$$M_{i_1,...,i_K} = \max\left\{M_{i_1',...,i_K'}+S(a_{i_1',1},...,a_{i_K',K})+d\times (K-1)\times \sum_j I_{(i_j'=i_j)} \right\}$$

Note that $ \sum_j I_{(i_j'=i_j)}$ is the count of gaps. The similarity score $S(a_{i_1',1},...,a_{i_K',K})$ of multiple nucleotides is calcualted by the following formula,

$$S(a_{i_1',1},...,a_{i_K',K})=\sum_{j=1}^{K-1}\sum_{k=j+1}^K I_{(i_j'=i_j-1)}I_{(i_k'=i_k-1)}S(a_{i_j,j},a_{i_k,k})$$

Here $I_{(i_j'=i_j-1)}$ is an indicator funciton for the corresponding nucleotide $a_{i_j,j}$. After the array $M$ is filled by the optimal scores $M_{i_1,...,i_K}$, the alignment with the highest score can be reconstructed by tracing back through the lower-right corner to the upper-left corner of $M$.

Algorithm

  • $K$ sequences are arranged in an array form $M$
  • Calculating $M_{i_1,...,i_K}$ for each cell
  • Tracing back from the lower-right corner to the upper-left corner

The search space thus increases exponentially with $K$ and also increases linearly with sequence length. To find the global optimum for $K$ sequences is an NP-complete problem.