next up previous contents
Next: Better Dayhoff Matrices Up: Point Accepted Mutations and Previous: Interpreting Scores

The Original Dayhoff Matrices

In the late 1960s and 1970s, Dayhoff et. al. published a series of papers containing the first similarity matrices. As more sequence data (a larger statistical sample) became available, they repeated their construction making more and more accurate estimations. In the 1978 paper, Dayhoff, Schwartz and Orcutt[9] examined 1572 accepted mutations between 34superfamilies of closely related sequences.

We can recompute the original Dayhoff matrix using the function






Calling Sequences:
CreateOrigDayMatrix(mutations, counts, PAM)
CreateOrigDayMatrix(mutations, counts, 1..UpperPam)
Parameters:
mutations : array(real, real)
counts : array(real)
PAM, UpperPam : real

Returns: DayMatrix

Synposis: This function returns the Dayhoff matrix (structured type DayMatrix) computed from a given observed mutation matrix mutations, a frequency vector counts and a PAM distance PAM (or range of PAM distances beginning at 1).





The Darwin built-in matrix Mutations1978 contains the observed mutation count report by Dayhoff et. al. The vector OrigFreq contains entries proportional to the reported frequecies.

> print(Mutations1978);
> OrigTot := [87, 41, 40, 47, 33, 38, 50, 89, 34, 37, 
>              85, 81, 15, 40, 51, 70, 58, 10, 30, 65];
> OrigFreq := OrigTot/sum(OrigTot);

It is not clear whether the vector OrigFreq or the amino acid frequencies for the entire database should be used in the computation of the Dayhoff matrix. We compare the difference between OrigFreq and the frequencies for Swiss-Prot version 33. The function GetAaCount(DB) returns a list containing the number of appearances of each amino acid in the database.17.1 Many frequencies are significantly different.

> DB := ReadDb('~cbrg/DB/SwissProt'):
> SP33Totals := GetAaCount(DB):
> SP33Freq := SP33Totals/sum(SP33Totals):
> printf('                   Orig   SP33\n');
> for i to 20 do
>   printf('%15.15s%7.2f%%%7.2f%%\n', IntToAmino(i), 
>          100*OrigFreq[i], 100*SP33Freq[i]);
> od;

        Alanine   8.69%   7.55%
       Arginine   4.10%   5.16%
     Asparagine   4.00%   4.55%
  Aspartic acid   4.70%   5.30%
       Cysteine   3.30%   1.70%
      Glutamine   3.80%   4.03%
  Glutamic acid   5.00%   6.32%
        Glycine   8.89%   6.86%
      Histidine   3.40%   2.23%
     Isoleucine   3.70%   5.73%
        Leucine   8.49%   9.32%
         Lysine   8.09%   5.95%
     Methionine   1.50%   2.36%
  Phenylalanine   4.00%   4.07%
        Proline   5.09%   4.92%
         Serine   6.99%   7.19%
      Threonine   5.79%   5.77%
     Tryptophan   1.00%   1.26%
       Tyrosine   3.00%   3.21%
         Valine   6.49%   6.52%
We compute the Dayhoff matrices both with OrigFreq and SP33Freq at a PAM distance of 250 (a long distance).
> OrigDM := CreateOrigDayMatrix(Mutations1978, OrigFreq, 250);
> SPDM := CreateOrigDayMatrix(Mutations1978, SP33Freq, 250);
Table [*] contains a list of selectors for the DayMatrix structured type.



Table: The selectors for the DayMatrix structured type.
1.1 
Table: The selectors for the DayMatrix structured type.
Selector Description
FixedDel Adjusted deletion penalty.
DelFixedLog Logarithmic deletion penalty.
IncDel Deletion penalty increment.
MaxOffDiag Max. value not on the diagonal.
MaxSim Max. value in the Dayhoff matrix.
MinSim Min. value in the Dayhoff matrix.
PamNumber PAM distance for which this
  matrix was computed.
Sim, i, j The similarity score for amino
  acid i and amino acid j.
type The type of the Dayhoff matrix.
  ie. Peptide, DNA.
 



Comparing the two matrices shows some significant differences. These will somewhat change the results of our alignment algorithms.

> print(OrigDM);
DayMatrix(Peptide, pam=250, Sim: max=17.302, min=-7.510, max offdiag=6.951,
 del=-19.814-1.396*(k-1))
C  12.0
S  -0.0  1.6
T  -2.2  1.3  2.6
P  -2.7  0.9  0.3  5.9
A  -2.0  1.1  1.2  1.1  1.8
G  -3.3  1.1 -0.0 -0.5  1.3  4.8
N  -3.6  0.7  0.4 -0.5  0.2  0.4  2.0
D  -5.1  0.3 -0.1 -1.0  0.3  0.6  2.1  3.9
E  -5.3 -0.0 -0.4 -0.6  0.3  0.2  1.4  3.4  3.9
Q  -5.3 -0.5 -0.8  0.2 -0.4 -1.2  0.8  1.6  2.5  4.1
H  -3.4 -0.8 -1.3 -0.3 -1.4 -2.1  1.6  0.7  0.6  2.9  6.6
R  -3.6 -0.3 -0.9 -0.2 -1.6 -2.6 -0.0 -1.3 -1.1  1.2  1.5  6.1
K  -5.4 -0.2 -0.0 -1.2 -1.2 -1.7  1.0  0.1 -0.1  0.7 -0.1  3.4  4.7
M  -5.2 -1.6 -0.6 -2.1 -1.2 -2.8 -1.8 -2.6 -2.2 -1.0 -2.2 -0.5  0.4  6.6
I  -2.3 -1.4  0.1 -2.0 -0.5 -2.6 -1.8 -2.4 -2.0 -2.0 -2.5 -2.0 -1.9  2.2  4.6
L  -6.0 -2.8 -1.7 -2.6 -1.9 -4.0 -2.9 -4.0 -3.3 -1.8 -2.1 -3.0 -2.9  3.7  2.4  6.0
V  -1.9 -1.0  0.3 -1.2  0.2 -1.4 -1.8 -2.2 -1.8 -1.9 -2.3 -2.5 -2.5  1.8  3.7  1.8  4.3
F  -4.3 -3.2 -3.1 -4.6 -3.5 -4.8 -3.5 -5.6 -5.4 -4.7 -1.8 -4.5 -5.3  0.2  1.0  1.8 -1.2  9.1
Y   0.4 -2.8 -2.8 -5.0 -3.5 -5.3 -2.1 -4.3 -4.3 -4.0 -0.1 -4.2 -4.5 -2.5 -1.0 -0.9 -2.5  7.0 10.2
W  -7.5 -2.3 -5.0 -5.5 -5.6 -6.8 -3.9 -6.6 -6.8 -4.6 -2.5  2.3 -3.3 -4.1 -5.0 -1.7 -6.1  0.5  0.0 17.3

> print(SPDM);
DayMatrix(Peptide, pam=250, Sim: max=16.847, min=-8.222, max offdiag=6.785,
 del=-19.814-1.396*(k-1))
C  12.2
S   1.4  1.8
T  -0.4  1.5  2.6
P  -0.8  1.1  0.5  5.8
A  -0.2  1.2  1.3  1.3  1.7
G  -1.0  1.4  0.5  0.1  1.6  4.4
N  -1.8  0.9  0.5 -0.4  0.3  0.7  2.4
D  -3.3  0.2 -0.2 -1.0  0.3  0.8  2.2  4.1
E  -3.6 -0.2 -0.6 -0.7  0.2  0.3  1.3  3.5  4.3
Q  -3.4 -0.5 -0.8  0.3 -0.4 -0.8  0.7  1.5  2.5  4.2
H  -1.3 -0.3 -0.8  0.1 -0.7 -1.1  1.6  0.8  0.8  2.8  4.8
R  -2.1 -0.4 -1.0 -0.3 -1.6 -2.3 -0.2 -1.8 -1.7  1.2  1.6  6.7
K  -2.9  0.1  0.2 -0.7 -0.7 -0.8  1.1  0.2 -0.0  0.9  0.6  3.4  4.0
M  -4.2 -2.2 -1.2 -2.8 -1.7 -3.1 -2.5 -3.7 -3.2 -1.6 -2.4 -1.2  0.2  8.5
I  -1.2 -2.0 -0.3 -2.7 -1.0 -2.8 -2.5 -3.3 -3.0 -2.9 -2.7 -2.9 -2.3  1.3  5.5
L  -4.5 -3.1 -2.1 -2.8 -2.1 -3.9 -3.3 -4.7 -4.1 -2.1 -2.0 -3.7 -2.9  3.2  1.7  5.8
V  -0.4 -1.0  0.1 -1.2  0.1 -1.0 -1.9 -2.5 -2.3 -2.1 -2.0 -3.0 -2.3  1.1  3.5  1.4  4.1
F  -2.6 -3.3 -3.4 -4.6 -3.6 -4.5 -3.7 -6.1 -6.1 -4.9 -1.6 -5.0 -5.1 -0.7  0.4  1.4 -1.7  9.0
Y   1.8 -2.8 -2.8 -4.9 -3.4 -4.7 -2.1 -4.7 -4.8 -4.2  0.1 -4.6 -4.2 -3.7 -1.8 -1.4 -2.9  6.8 10.2
W  -6.5 -3.1 -5.9 -6.3 -6.3 -7.1 -4.7 -7.7 -8.2 -5.6 -2.9  1.3 -3.9 -6.0 -6.9 -2.8 -7.3 -0.4 -0.9 16.8
> OrigDM[MaxSim];      SPDM[MaxSim];
     17.3021             16.8467
> OrigDM[MinSim];      SPDM[MinSim];
    -7.5098             -8.2217
> OrigDM[MaxOffDiag];  SPDM[MaxOffDiag];
     6.9511              6.7851
> OrigDM[FixedDel];    SPDM[FixedDel];
    -19.8137            -19.8137

Matchings performed via dynamic programming will apply penalties for deletions of length k according to FixedDel + (k-1) * IncDel. This gap penalty implies that a gap of length k occurs with probability $0.00076 \cdot (0.7251)^{k}$. Chapter [*] - Insertions and Deletions describes this scoring function in more depth.


next up previous contents
Next: Better Dayhoff Matrices Up: Point Accepted Mutations and Previous: Interpreting Scores
Gaston Gonnet
1998-09-15