SP alignment and Maximum Weight Trace:

Next: Step 2 Up: Near Optimal Multiple Sequence Previous: Tree Alignment (TA) and

SP alignment and Maximum Weight Trace:

The MSA and maximum weight trace methods [5,14,15,24] differ from the others because they compute an optimal alignment with respect to a well-defined MSA scoring (sum-of-pairs). This is an exact algorithm and the running time is typically O(kⁿ) which is only practical for a very small number (n) and length (k) of sequences. In addition, the scoring function that is used does not reflect the structure of the underlying evolutionary tree, unless the scoring function is weighted, in which case a tree must be known (or constructed). A better and more detailed overview of multiple sequence alignment methods is presented in [27,19,38].
Error Bounds and Performance
Guarantees There is a bounded-error approximation method by Gusfield [16] for the sum of pairs (SP) measure that calculates an MSA with a score that is never more than twice that of the optimal SP score. Here a scoring method is used where lower scores correspond to better alignments. Another bound is presented by Bafna, Lawler and Pevzner [2]. Their method aligns n strings and produces an MSA whose SP score can never exceed $2-\frac{q}{n}$ times the optimal SP score, for any chosen q. For any fixed value of q, the running time of the method is polynomial in k (the assumed string length of each string) and n. As q increases, so does the provable accuracy of the result, but the worst-case running time also increases.
In this extended abstract we present an algorithm that runs in O(n² k²) time and produces an MSA whose score is at least $\frac{n-1}{n}\cdot opt$ . The score of the alignment represents the probability of the entire evolutionary configuration. In the following sections we briefly explain the scoring function. See [11] for a more detailed exposition. We then describe the algorithm for the calculation of MSAs and present experimental results. Methods Sum of Pairs versus Circular Sum measure

Definition 0.6 The tree T(S)=(V, E, L) is a binary, leaf labeled tree with leafset $L = \{s_1, .., s_n\}$ .

In our context a tree T(S) associated with a set of sequences $S = \{s_1, .., s_n\}$ is the tree that corresponds to the evolutionary history of the sequences of S. The internal nodes V represent (usually unknown) ancestor sequences. To calculate the score with the SP measure [5], all $n \choose{2}$ scores of the pairwise alignments are added up. SP methods are obviously deficient from an evolutionary perspective. Consider a tree (Figure 2) constructed for a family containing five proteins. The score of a pairwise alignment $\langle A, B \rangle$ evaluates the probability of evolutionary events on edges (u, A) and (u, B) of the tree; that is, the edges that represent the evolutionary distance between sequence A and sequence B. Likewise, the score of a pairwise alignment $\langle C, D \rangle$ evaluates the likelihood of evolutionary events on edges (C, w), (w, v) and (v, D) of the tree.

**Figure 2:** Traversal of trees using the SP measure. Some edges are traversed more often than others.
$\begin{figure} \begin{center} \mbox{\psfig{file=treeSP2.eps,height=0.12\textheight,angle=0} } \end{center} \end{figure}$

By adding ``ticks'' to the evolutionary tree that are drawn each time an edge is evaluated when calculating the SP score (Figure 2), it is readily seen that with the SP method different edges of the evolutionary tree of the protein family are counted a different numbers of times. In the example tree on the left side that corresponds to the MSA of Figure 1, edges (r, u), (r, w) and (w, v) are each counted six times by the SP method, while edges (u, A), (u, B), (v, D), (v, E), and (w, C) are each counted four times. It gets worse as the tree grows (see tree on the right). There is no theoretical justification to suggest that some evolutionary events are more important than other ones. Thus, SP methods are intrinsically problematic from an evolutionarily perspective for scoring MSAs. This was the motivation to developed a scoring method that evaluates each edge equally. In addition, we wanted a scoring function that does not depend on the actual tree structure. Traveling Salesman

Definition 0.7 A Circular order $C(\mbox{\rm A})$ of an MSA $\mbox{\rm A}$ is any tour through a tree $T(\mbox{\rm A})$ where each edge is traversed exactly twice, and each leaf is visited once.

When a tree is traversed in a circular order, that is, from leaf A to B, from B to C, from C to D, from D to E and then back from E to leaf A (Figure 3), all edges are counted exactly twice, independent of the tree structure. A circular order is the shortest possible tour through a tree where each leaf is visited once (shortest means smallest sum of edge lengths)(see Figure 3) [11]. Since we do not want to rely on any tree, the problem we consider is to find such an order without having any information about the tree structure. To solve this problem we reduce it to the symmetric Traveling Salesman Problem (TSP): given is a matrix M that contains the $n \choose{2}$ distances of n cities [23,22]. The problem is to find the shortest tour where each city is visited once. We use a modified version of the problem: in our case, the cities correspond to the sequences and the distances are the scores of the pairwise alignments. In addition, we are interested in the longest, not shortest tour, as our scores represent probabilities and we are interested in the maximum probability ¹.

Problem 0.2 (TSP problem) Given is a set of sequences $S = \{s_1, .., s_n\}$ and the corresponding scores of the optimal pairwise alignments. The problem is to find the longest tour where each sequence is visited once.

**Figure 3:** Traversal of a tree in circular order
$\begin{figure} \begin{center} \mbox{\psfig{file=treeCS.eps,height=0.15\textheight,angle=0} } \end{center} \end{figure}$

Both the construction of optimal evolutionary trees [8,20] and the Traveling Salesman Problem are NP complete. But the TSP is very well studied and optimal solutions can be calculated within a few hours for up to 1000 cities and in a few seconds for up to 100 cities, whereas the construction of evolutionary trees is still a big problem. There are heuristics for large scale problems that calculate near optimal solutions that are within 1% to 2% of the optimum [29,13]. For real applications we have seldom more than 100 sequences to compare simultaneously, and the calculation of the optimal TSP solution usually takes only a small fraction of the time it takes to compute all pairwise alignments to derive the scores.

Definition 0.8 The TSP order $TSP(\mbox{\rm A})$ of an MSA $\mbox{\rm A}$ is the order of the sequences that is derived from the optimal solution of a TSP.

In summary, we can compute a circular order $C(\mbox{\rm A})$ without knowing $T(\mbox{\rm A})$ . For a more detailed explanation see [11]. MSA scoring function

Definition 0.9 The function $\underline{S}: \Sigma' \times \Sigma' \rightarrow I \!\! R$ is defined as follows:
$S(x, y) = \left\{ { \begin{tabular}{l} $DM(x, y)$ , if $x \neq ''\_''$\space ... ...at\\ depends on the gap length \cite{BennerIndels93} \end{tabular}} \right.$
DM is an entry in the Dayhoff matrix [10].

Our scoring function $F(\mbox{\rm A})$ of an MSA is similar to the sum-of-pairs measure, except that we do not add all scores of all pairwise alignments, but only the scores of the pairwise alignments in the circular order C divided by two. It is the sum-of-pairs in circular order:

Definition 0.10 The scoring function $F(\mbox{\rm A})$ of an MSA $\mbox{\rm A}= \{a_1,..,a_n\}$ is:
$F(\mbox{\rm A}) = \frac{1}{2}\sum\limits_{m=1}^{k} \sum\limits_{i=1}^{n} S(a_{C_{i}}[m], a_{C_{i+1}}[m]),$ with C_n+1 = C₁

where k is the length of a sequence in the MSA, n is the number of sequences, C is the TSP ordering, a_{C_i} is a sequence in the MSA, and so the a_{C_i}[m] are amino acids or gap characters within that sequence. S(x, y) is the score of aligning the symbol $x \in \Sigma'$ with the symbol $y \in \Sigma'$ . The function S(x, y) is defined above (see Definition 2.4). Circular Tours and Gaps Gaps play a special role in MSAs: the process of calculating an MSA can be viewed as inserting gaps at the correct places. Assume we are given an MSA $\mbox{\rm A}$ and we know the correct tree $T(\mbox{\rm A})$ . Insertions and deletions (indels) are evolutionary events that can be placed in the evolutionary tree (see Figure 5). The result of a deletion event are gaps that appear in all sequences of the subtree where the event took place. In contrast, when the event was an insertion, gaps appear in all sequences that are not in the subtree below the event.

Definition 0.11 A gap block B is a set of gaps $\{g_1 \in a_1, .., g_k \in a_k \}$ where the gaps have the following properties: $\forall g_i, g_j \in B: pos(g_i) = pos(g_j)$ and len(g_i) = len(g_j).

Example: In the alignment of Figure 4, there are 4 indel events and therefore 4 gap blocks. The first gap block consists of two gaps that appear in sequences A and B. We assume we have the correct tree and we also assume that we know the place of the root. In this case the event must be a deletion in the ancestor of sequences A and B. Accordingly, event 2 is an insertion in sequence D, event 3 represents a deletion in the ancestor of D and E, and finally event 4 is an insertion in sequence C.

Figure 4: An MSA and a table containing the pairwise scores


Event     1         2        3         4
A: RPCVCP___VLRQAAQ__QVLQRQIIQGPQQLRRLF_AA
B: RPCACP___VLRQVVQ__QALQRQIIQGPQQLRRLF_AA
C: KPCLCPKQAAVKQAAH__QQLYQGQLQGPKQVRRAFRLL
D: KPCVCPRQLVLRQAAHLAQQLYQGQ____RQVRRAF_VA
E: KPCVCPRQLVLRQAAH__QQLYQGQ____RQVRRLF_AA

	A	B	C	D	E
A	0	336	27	44	110
B	336	0	79	56	99
C	27	79	0	171	176
D	44	56	171	0	327
E	110	99	176	327	0

**Figure 5:** Insertion and deletion events are shown in the evolutionary tree.
$\begin{figure} \begin{center} \mbox{\psfig{file=treegaps.eps,height=0.12\textheight,angle=0} } \end{center} \end{figure}$

The table to the right shows the pairwise scores of the sequences. It is easy to verify that the shortest tour is (A, B, C, D, E, A) and that it is a circular tour in the corresponding tree (see Figure 1). The score of the MSA is the sum of pairwise alignments in circular order divided by two, so $F(\mbox{\rm A}) = (336 + 79 + 171 + 327 + 110) /2 = 512$ . Here the MSA has the optimal score. Note that the sequences above are ordered in a circular way $C(\mbox{\rm A})$ . Whenever sequences are in a circular order with respect to the evolutionary tree, the gaps that belong to the same evolutionary event appear in a gap block (ignoring the breaking of the circularity). Results The fact that gaps originating from the same event form gap blocks if a circular tour is known simplifies the task of calculating an MSA.

**Figure 6:** Schematic representation of an MSA in a circular order
$\begin{figure} \begin{center} \mbox{\psfig{file=zylindergaps.EPS,height=0.2\textheight,angle=0} } \end{center} \end{figure}$

In Figure 6, a schematic representation of an MSA is shown, where the vertical lines are sequences in the alignment. The rectangles represent gap blocks, and the cylinder itself represents the circular order of the tree. Once a circular order $C(\mbox{\rm A})$ is known, we have to select a starting point in the cycle. In principle it does not matter where we start, but as we will see later it is simpler if we cut the cycle in a place where the sequences are farther apart. This is the two sequences in the cycle with the smallest similarity score. We then renumber the sequences with respect to the cycle starting from that point (see Figure 6). Algorithm

Definition 0.12 Similar to the definition of the score of an MSA we define the partial score $P(\mbox{\rm A})$ of an MSA $\mbox{\rm A}$ : it is the sum-of-pairs in circular order except for the last term:
$P(\mbox{\rm A}) = \frac{1}{2}\sum\limits_{m=1}^{k} \sum\limits_{i=1}^{n-1} S(a_{C_{i}}[m], a_{C_{i+1}}[m])$
(see 1.3).

The partial score of the MSA in Figure 4 is $P(\mbox{\rm A}) = (336 + 79 + 171 + 327) /2 = 456.5$ .
Assume we are given n sequences $\{s_1, .. ,s_n\}$ which are ordered with respect to a circular order C. Assume further that the first k sequences are already aligned and build an MSA $\mbox{\rm A}' = \langle a_1, .., a_{k-1} \rangle$ with optimal partial score (see Figure 7). We refer to the unaligned input sequences with s_i, to the sequences in the MSA with a_i and to the sequences in an optimal pairwise alignment with t_i.

**Figure:** Sequences $\langle a_1, .., a_{k}\rangle$ are aligned and build an MSA $\mbox{\rm A}'$ with optimal partial score.
$\begin{figure} \begin{center} \mbox{\psfig{file=align1.EPS,height=0.18\textheight,angle=0} } \end{center} \end{figure}$

The algorithm has four steps:

1.: take sequence s_k and calculate an optimal pairwise alignment with the next sequence s_k+1. The sequences in this alignment with the gaps are called t_k and t_k+1 (see Figure 8).
2.: insert all gaps from t_k that are not already present in a_k into all previously aligned sequences $\langle a_1, .., a_{k}\rangle$ (see Figure 9).
3.: take all the gaps that were present in a_k before step 2), and insert them in both, t_k and t_k+1, except for the gaps that are already present in t_k (see Figure 9).
4.: at this point a_k = t_k and a_k+1 = t_k+1. We add a_k+1 to the alignment and increment k by one (see Figure 9).

**Figure 8:** Step 1:
$\begin{figure} \begin{center} \mbox{\psfig{file=align2.EPS,height=0.04\textheight,angle=0} } \end{center} \end{figure}$

We now discuss steps 2 to 4 and show that in each step the MSA has an optimal partial score:

Next: Step 2 Up: Near Optimal Multiple Sequence Previous: Tree Alignment (TA) and

Chantal Korostensky
1999-07-14