Next: Better Dayhoff Matrices Up: Point Accepted Mutations and Previous: Interpreting Scores

The Original Dayhoff Matrices

In the late 1960s and 1970s, Dayhoff et. al. published a series of papers containing the first similarity matrices. As more sequence data (a larger statistical sample) became available, they repeated their construction making more and more accurate estimations. In the 1978 paper, Dayhoff, Schwartz and Orcutt[9] examined 1572 accepted mutations between 34superfamilies of closely related sequences.

We can recompute the original Dayhoff matrix using the function

Calling Sequences:
CreateOrigDayMatrix(mutations, counts, PAM)
CreateOrigDayMatrix(mutations, counts, 1..UpperPam)
Parameters:

mutations	:	`array(real, real)`
counts	:	`array(real)`
PAM, UpperPam	:	`real`

Returns: DayMatrix

Synposis: This function returns the Dayhoff matrix (structured type DayMatrix) computed from a given observed mutation matrix mutations, a frequency vector counts and a PAM distance PAM (or range of PAM distances beginning at 1).

The Darwin built-in matrix Mutations1978 contains the observed mutation count report by Dayhoff et. al. The vector OrigFreq contains entries proportional to the reported frequecies.

> print(Mutations1978);
> OrigTot := [87, 41, 40, 47, 33, 38, 50, 89, 34, 37, 
>              85, 81, 15, 40, 51, 70, 58, 10, 30, 65];
> OrigFreq := OrigTot/sum(OrigTot);

It is not clear whether the vector OrigFreq or the amino acid frequencies for the entire database should be used in the computation of the Dayhoff matrix. We compare the difference between OrigFreq and the frequencies for Swiss-Prot version 33. The function GetAaCount(DB) returns a list containing the number of appearances of each amino acid in the database.^17.1 Many frequencies are significantly different.

> DB := ReadDb('~cbrg/DB/SwissProt'):
> SP33Totals := GetAaCount(DB):
> SP33Freq := SP33Totals/sum(SP33Totals):
> printf('                   Orig   SP33\n');
> for i to 20 do
>   printf('%15.15s%7.2f%%%7.2f%%\n', IntToAmino(i), 
>          100*OrigFreq[i], 100*SP33Freq[i]);
> od;

        Alanine   8.69%   7.55%
       Arginine   4.10%   5.16%
     Asparagine   4.00%   4.55%
  Aspartic acid   4.70%   5.30%
       Cysteine   3.30%   1.70%
      Glutamine   3.80%   4.03%
  Glutamic acid   5.00%   6.32%
        Glycine   8.89%   6.86%
      Histidine   3.40%   2.23%
     Isoleucine   3.70%   5.73%
        Leucine   8.49%   9.32%
         Lysine   8.09%   5.95%
     Methionine   1.50%   2.36%
  Phenylalanine   4.00%   4.07%
        Proline   5.09%   4.92%
         Serine   6.99%   7.19%
      Threonine   5.79%   5.77%
     Tryptophan   1.00%   1.26%
       Tyrosine   3.00%   3.21%
         Valine   6.49%   6.52%

We compute the Dayhoff matrices both with OrigFreq and SP33Freq at a PAM distance of 250 (a long distance).

> OrigDM := CreateOrigDayMatrix(Mutations1978, OrigFreq, 250);
> SPDM := CreateOrigDayMatrix(Mutations1978, SP33Freq, 250);

Table

contains a list of selectors for the DayMatrix structured type.

1.1

**Table:** The selectors for the `DayMatrix` structured type.
Selector	Description
`FixedDel`	Adjusted deletion penalty.
`DelFixedLog`	Logarithmic deletion penalty.
`IncDel`	Deletion penalty increment.
`MaxOffDiag`	Max. value not on the diagonal.
`MaxSim`	Max. value in the Dayhoff matrix.
`MinSim`	Min. value in the Dayhoff matrix.
`PamNumber`	PAM distance for which this
	matrix was computed.
`Sim, i, j`	The similarity score for amino
	acid i and amino acid j.
`type`	The type of the Dayhoff matrix.
	ie. `Peptide`, `DNA`.

Comparing the two matrices shows some significant differences. These will somewhat change the results of our alignment algorithms.

> print(OrigDM);
DayMatrix(Peptide, pam=250, Sim: max=17.302, min=-7.510, max offdiag=6.951,
 del=-19.814-1.396*(k-1))
C  12.0
S  -0.0  1.6
T  -2.2  1.3  2.6
P  -2.7  0.9  0.3  5.9
A  -2.0  1.1  1.2  1.1  1.8
G  -3.3  1.1 -0.0 -0.5  1.3  4.8
N  -3.6  0.7  0.4 -0.5  0.2  0.4  2.0
D  -5.1  0.3 -0.1 -1.0  0.3  0.6  2.1  3.9
E  -5.3 -0.0 -0.4 -0.6  0.3  0.2  1.4  3.4  3.9
Q  -5.3 -0.5 -0.8  0.2 -0.4 -1.2  0.8  1.6  2.5  4.1
H  -3.4 -0.8 -1.3 -0.3 -1.4 -2.1  1.6  0.7  0.6  2.9  6.6
R  -3.6 -0.3 -0.9 -0.2 -1.6 -2.6 -0.0 -1.3 -1.1  1.2  1.5  6.1
K  -5.4 -0.2 -0.0 -1.2 -1.2 -1.7  1.0  0.1 -0.1  0.7 -0.1  3.4  4.7
M  -5.2 -1.6 -0.6 -2.1 -1.2 -2.8 -1.8 -2.6 -2.2 -1.0 -2.2 -0.5  0.4  6.6
I  -2.3 -1.4  0.1 -2.0 -0.5 -2.6 -1.8 -2.4 -2.0 -2.0 -2.5 -2.0 -1.9  2.2  4.6
L  -6.0 -2.8 -1.7 -2.6 -1.9 -4.0 -2.9 -4.0 -3.3 -1.8 -2.1 -3.0 -2.9  3.7  2.4  6.0
V  -1.9 -1.0  0.3 -1.2  0.2 -1.4 -1.8 -2.2 -1.8 -1.9 -2.3 -2.5 -2.5  1.8  3.7  1.8  4.3
F  -4.3 -3.2 -3.1 -4.6 -3.5 -4.8 -3.5 -5.6 -5.4 -4.7 -1.8 -4.5 -5.3  0.2  1.0  1.8 -1.2  9.1
Y   0.4 -2.8 -2.8 -5.0 -3.5 -5.3 -2.1 -4.3 -4.3 -4.0 -0.1 -4.2 -4.5 -2.5 -1.0 -0.9 -2.5  7.0 10.2
W  -7.5 -2.3 -5.0 -5.5 -5.6 -6.8 -3.9 -6.6 -6.8 -4.6 -2.5  2.3 -3.3 -4.1 -5.0 -1.7 -6.1  0.5  0.0 17.3

> print(SPDM);
DayMatrix(Peptide, pam=250, Sim: max=16.847, min=-8.222, max offdiag=6.785,
 del=-19.814-1.396*(k-1))
C  12.2
S   1.4  1.8
T  -0.4  1.5  2.6
P  -0.8  1.1  0.5  5.8
A  -0.2  1.2  1.3  1.3  1.7
G  -1.0  1.4  0.5  0.1  1.6  4.4
N  -1.8  0.9  0.5 -0.4  0.3  0.7  2.4
D  -3.3  0.2 -0.2 -1.0  0.3  0.8  2.2  4.1
E  -3.6 -0.2 -0.6 -0.7  0.2  0.3  1.3  3.5  4.3
Q  -3.4 -0.5 -0.8  0.3 -0.4 -0.8  0.7  1.5  2.5  4.2
H  -1.3 -0.3 -0.8  0.1 -0.7 -1.1  1.6  0.8  0.8  2.8  4.8
R  -2.1 -0.4 -1.0 -0.3 -1.6 -2.3 -0.2 -1.8 -1.7  1.2  1.6  6.7
K  -2.9  0.1  0.2 -0.7 -0.7 -0.8  1.1  0.2 -0.0  0.9  0.6  3.4  4.0
M  -4.2 -2.2 -1.2 -2.8 -1.7 -3.1 -2.5 -3.7 -3.2 -1.6 -2.4 -1.2  0.2  8.5
I  -1.2 -2.0 -0.3 -2.7 -1.0 -2.8 -2.5 -3.3 -3.0 -2.9 -2.7 -2.9 -2.3  1.3  5.5
L  -4.5 -3.1 -2.1 -2.8 -2.1 -3.9 -3.3 -4.7 -4.1 -2.1 -2.0 -3.7 -2.9  3.2  1.7  5.8
V  -0.4 -1.0  0.1 -1.2  0.1 -1.0 -1.9 -2.5 -2.3 -2.1 -2.0 -3.0 -2.3  1.1  3.5  1.4  4.1
F  -2.6 -3.3 -3.4 -4.6 -3.6 -4.5 -3.7 -6.1 -6.1 -4.9 -1.6 -5.0 -5.1 -0.7  0.4  1.4 -1.7  9.0
Y   1.8 -2.8 -2.8 -4.9 -3.4 -4.7 -2.1 -4.7 -4.8 -4.2  0.1 -4.6 -4.2 -3.7 -1.8 -1.4 -2.9  6.8 10.2
W  -6.5 -3.1 -5.9 -6.3 -6.3 -7.1 -4.7 -7.7 -8.2 -5.6 -2.9  1.3 -3.9 -6.0 -6.9 -2.8 -7.3 -0.4 -0.9 16.8

> OrigDM[MaxSim];      SPDM[MaxSim];
     17.3021             16.8467
> OrigDM[MinSim];      SPDM[MinSim];
    -7.5098             -8.2217
> OrigDM[MaxOffDiag];  SPDM[MaxOffDiag];
     6.9511              6.7851
> OrigDM[FixedDel];    SPDM[FixedDel];
    -19.8137            -19.8137

Matchings performed via dynamic programming will apply penalties for deletions of length k according to FixedDel + (k-1) * IncDel. This gap penalty implies that a gap of length k occurs with probability $0.00076 \cdot (0.7251)^{k}$ . Chapter - Insertions and Deletions describes this scoring function in more depth.

Next: Better Dayhoff Matrices Up: Point Accepted Mutations and Previous: Interpreting Scores

Gaston Gonnet
1998-09-15