next up previous
Next: Sum-of-Pairs Measure Up: Introduction Previous: Introduction

Scoring Pairwise Sequence Alignments

To determine if two sequences $s_1, s_2 \in \Sigma$ are related and have a common ancestor the sequences are usually aligned, and the problem is to find the alignment that maximizes the probability that the two sequences are related: To actually calculate these probabilities, one applies a Markovian model for sequence evolution [Krogh et al., 1994,Baldi et al., 1994]. This begins with an alignment of the two sequences, e.g.

    VNRLQQNIVSL____________EVDHKVANYKPQVEPFGHGPIFMATALVPGLYLLPL
    VNRLQQSIVSLRDAFNDGTKLLEELDHRVLNYKPQANPFGNGPIFMVTAIVPGLHLLPI
The gaps arise from insertions (or their counterpart deletions) during divergent evolution. The alignment is normally done by a dynamic programming (DP) algorithm using Dayhoff matrices [Gotoh, 1982,Smith and Waterman, 1981,Needleman and Wunsch, 1970,Altschul and Erickson, 1986], which finds the alignment that maximizes the probability that the two sequences evolved from an ancestral sequence as opposed to being random sequences. An affine gap cost is used according to the formula $a + l \cdot b$, where a is a fixed gap cost, l is the length of the gap and b is the incremental cost [Altschul and Erickson, 1986,Benner et al., 1993]. More precisely, we are comparing two possibilities
a)
that the two sequences arose independently of each other (implying that the alignment is entirely arbitrary, with amino acid i in one protein being aligned to amino acid j in the other is occurring no more frequently than expected by chance, which is equal to the product of the individual frequencies with which amino acids i and j occur in the database)

\begin{displaymath}Pr \{ \mbox{$i$\space and $j$\space are independent} \} = f_i f_j
\end{displaymath} (1)

b)
that the two sequences have evolved from some common ancestral sequence after t units of evolution where t is measured in PAM units [Gonnet, 1994b].
\begin{figure}
\begin{center}
\mbox{\psfig{file=align.ps,height=0.15\textheight,angle=-90} }
\end{center}
\end{figure}

\begin{displaymath}Pr \{ \mbox{$i$\space and $j$\space descended from some $x$ } \} = \sum_x f_x Pr\{x
\rightarrow i\}Pr\{x \rightarrow j\}
\end{displaymath} (2)

Definition 1.7   A 1-PAM unit is the amount of evolution which will change, on average, $1\%$ of the amino acids. In mathematical terms, this is expressed as a matrix M such that

\begin{displaymath}\sum_{i=1}^{20} f_i (1 - M_{ii} ) = 0.01 \end{displaymath}

where fi is the frequency of the ith amino acid.

Definition 1.8   The score of an optimal pairwise alignment OPA(s1, s2) of two sequences s1, s2 is the score of an alignment with the maximum score where a probabilistic scoring method [Dayhoff et al., 1978,Gonnet et al., 1992] is used. We refer to a pairwise alignment of two sequences s1, s2 with $\langle s_1, s_2
\rangle$.


\begin{displaymath}D_{ij} = 10 \log_{10} \left ( \frac
{ Pr \{ \mbox{$i$\space ...
... \mbox{$i$\space and $j$\space are independent} \} } \right )
\end{displaymath} (3)

The entries of the Dayhoff matrix are the logarithm of the quotient of these two probabilities. Note that scores represent the probabilities that the two sequences have a common ancestor. The larger the score is the more likely it is that the two sequences are homologous and therefore have a common ancestor.
next up previous
Next: Sum-of-Pairs Measure Up: Introduction Previous: Introduction
Chantal Korostensky
1999-07-14