Gap Probabilities and Penalties

Our algorithms for aligning amino acid sequences are based on the test of whether the two sequences evolved from a common ancestral sequence or not. This model allowed us to derive scores (used by the dynamic programming algorithm) from the mutation probabilities. In this section, we derive gap penalty costs from the probabilities of a gap occuring.

The gap model
we use here is the one which is implicit
in the vast majority of the present literature. This
assumes that gaps are accidents at the DNA level which
produce a random insertion or deletion of *k* amino acids
with probability *q*(*k*).
This probability is independent of the amino acids being
inserted or deleted and of any neighbouring amino acids.
It is a natural assumption that this probability also
depends on the PAM distance *t* between the two sequences,
in particular, gaps in similar sequences are less frequent than gaps
in distant ones.
Then

Gap penalties used with dynamic programming are typically of the form

The linear gap penalty function is very popular, in part because it is possible to compute dynamic programming alignments very efficiently using Gotoh's algorithm[18].

The relation between *d*(*k*,*t*) and *q*(*k*,*t*) is simply

This relation is not well known, so we will prove it here. Readers not interested in the mathematical theory of alignments should skip the rest of this section.

The purpose of aligning sequences with dynamic programming
is to find the alignment which maximizes the probability
of the two sequences having evolved
from a common ancestor
as opposed to being just random sequences.
Let *A* and *B* be two sequences with lengths *n*_{A} and
*n*_{B}.
Let *A*_{i} be the *i*^{th} amino acid in *A* (similarly
for *B*).
Let *f*_{i} be the frequency of amino acid *i* as defined in the
previous section.
The probability that the two sequences are random is

For example, given an alignment

where

for all the possible

where this term denotes the probability of a gap of length 3 on one sequence times the probability of the sequence

for the alignment of

once property grouped and simplified. To find the alignment which maximizes this probability we transform the maximization of a product (of all positive values) into the maximization of a sum by computing the logarithm of each term. To find the alignment which maximizes a sum, we use the dynamic programming algorithm. For historical reasons and to work with simpler numbers, these logarithms are multiplied by 10, which has no effect on the maximization process. Notice that the terms like

are exactly the values of the standard Dayhoff matrices. The deletion cost for the above example, under dynamic programming, must be

It is obvious how to generalize this result for any number or length of gaps.

The relationship can be applied in two directions:
either we can estimate the probabilities from a sample and
compute the appropriate gap penalties or we can compute
the probabilities that are implied by a gap penalty
function.
A linear gap penalty function implies an exponential
distribution of probabilities.

implies

For example, if

This further implies that the probability of a gap, of any length, is

in our example, the probability of having a gap would be . Finally, the expected length of a gap is

in our example, the expected length of a gap would be 2.0011.