In the late 1960s and 1970s, Dayhoff et. al. published a series of papers containing the first similarity matrices. As more sequence data (a larger statistical sample) became available, they repeated their construction making more and more accurate estimations. In the 1978 paper, Dayhoff, Schwartz and Orcutt[9] examined 1572 accepted mutations between 34superfamilies of closely related sequences.
We can recompute the original Dayhoff matrix using the function
mutations | : | array(real, real) |
counts | : | array(real) |
PAM, UpperPam | : | real |
Returns: DayMatrix
Synposis: This function returns the Dayhoff matrix (structured
type DayMatrix) computed from a given
observed mutation matrix mutations, a frequency vector counts and
a PAM distance PAM (or range of PAM distances beginning at 1).
The Darwin built-in matrix Mutations1978 contains the observed mutation
count report by Dayhoff et. al. The vector OrigFreq contains
entries proportional to the reported frequecies.
> print(Mutations1978); > OrigTot := [87, 41, 40, 47, 33, 38, 50, 89, 34, 37, > 85, 81, 15, 40, 51, 70, 58, 10, 30, 65]; > OrigFreq := OrigTot/sum(OrigTot);
It is not clear whether the vector OrigFreq or the amino acid frequencies for the entire database should be used in the computation of the Dayhoff matrix. We compare the difference between OrigFreq and the frequencies for Swiss-Prot version 33. The function GetAaCount(DB) returns a list containing the number of appearances of each amino acid in the database.17.1 Many frequencies are significantly different.
> DB := ReadDb('~cbrg/DB/SwissProt'): > SP33Totals := GetAaCount(DB): > SP33Freq := SP33Totals/sum(SP33Totals): > printf(' Orig SP33\n'); > for i to 20 do > printf('%15.15s%7.2f%%%7.2f%%\n', IntToAmino(i), > 100*OrigFreq[i], 100*SP33Freq[i]); > od; Alanine 8.69% 7.55% Arginine 4.10% 5.16% Asparagine 4.00% 4.55% Aspartic acid 4.70% 5.30% Cysteine 3.30% 1.70% Glutamine 3.80% 4.03% Glutamic acid 5.00% 6.32% Glycine 8.89% 6.86% Histidine 3.40% 2.23% Isoleucine 3.70% 5.73% Leucine 8.49% 9.32% Lysine 8.09% 5.95% Methionine 1.50% 2.36% Phenylalanine 4.00% 4.07% Proline 5.09% 4.92% Serine 6.99% 7.19% Threonine 5.79% 5.77% Tryptophan 1.00% 1.26% Tyrosine 3.00% 3.21% Valine 6.49% 6.52%We compute the Dayhoff matrices both with OrigFreq and SP33Freq at a PAM distance of 250 (a long distance).
> OrigDM := CreateOrigDayMatrix(Mutations1978, OrigFreq, 250); > SPDM := CreateOrigDayMatrix(Mutations1978, SP33Freq, 250);Table
Comparing the two matrices shows some significant differences. These will somewhat change the results of our alignment algorithms.
> print(OrigDM); DayMatrix(Peptide, pam=250, Sim: max=17.302, min=-7.510, max offdiag=6.951, del=-19.814-1.396*(k-1)) C 12.0 S -0.0 1.6 T -2.2 1.3 2.6 P -2.7 0.9 0.3 5.9 A -2.0 1.1 1.2 1.1 1.8 G -3.3 1.1 -0.0 -0.5 1.3 4.8 N -3.6 0.7 0.4 -0.5 0.2 0.4 2.0 D -5.1 0.3 -0.1 -1.0 0.3 0.6 2.1 3.9 E -5.3 -0.0 -0.4 -0.6 0.3 0.2 1.4 3.4 3.9 Q -5.3 -0.5 -0.8 0.2 -0.4 -1.2 0.8 1.6 2.5 4.1 H -3.4 -0.8 -1.3 -0.3 -1.4 -2.1 1.6 0.7 0.6 2.9 6.6 R -3.6 -0.3 -0.9 -0.2 -1.6 -2.6 -0.0 -1.3 -1.1 1.2 1.5 6.1 K -5.4 -0.2 -0.0 -1.2 -1.2 -1.7 1.0 0.1 -0.1 0.7 -0.1 3.4 4.7 M -5.2 -1.6 -0.6 -2.1 -1.2 -2.8 -1.8 -2.6 -2.2 -1.0 -2.2 -0.5 0.4 6.6 I -2.3 -1.4 0.1 -2.0 -0.5 -2.6 -1.8 -2.4 -2.0 -2.0 -2.5 -2.0 -1.9 2.2 4.6 L -6.0 -2.8 -1.7 -2.6 -1.9 -4.0 -2.9 -4.0 -3.3 -1.8 -2.1 -3.0 -2.9 3.7 2.4 6.0 V -1.9 -1.0 0.3 -1.2 0.2 -1.4 -1.8 -2.2 -1.8 -1.9 -2.3 -2.5 -2.5 1.8 3.7 1.8 4.3 F -4.3 -3.2 -3.1 -4.6 -3.5 -4.8 -3.5 -5.6 -5.4 -4.7 -1.8 -4.5 -5.3 0.2 1.0 1.8 -1.2 9.1 Y 0.4 -2.8 -2.8 -5.0 -3.5 -5.3 -2.1 -4.3 -4.3 -4.0 -0.1 -4.2 -4.5 -2.5 -1.0 -0.9 -2.5 7.0 10.2 W -7.5 -2.3 -5.0 -5.5 -5.6 -6.8 -3.9 -6.6 -6.8 -4.6 -2.5 2.3 -3.3 -4.1 -5.0 -1.7 -6.1 0.5 0.0 17.3 > print(SPDM); DayMatrix(Peptide, pam=250, Sim: max=16.847, min=-8.222, max offdiag=6.785, del=-19.814-1.396*(k-1)) C 12.2 S 1.4 1.8 T -0.4 1.5 2.6 P -0.8 1.1 0.5 5.8 A -0.2 1.2 1.3 1.3 1.7 G -1.0 1.4 0.5 0.1 1.6 4.4 N -1.8 0.9 0.5 -0.4 0.3 0.7 2.4 D -3.3 0.2 -0.2 -1.0 0.3 0.8 2.2 4.1 E -3.6 -0.2 -0.6 -0.7 0.2 0.3 1.3 3.5 4.3 Q -3.4 -0.5 -0.8 0.3 -0.4 -0.8 0.7 1.5 2.5 4.2 H -1.3 -0.3 -0.8 0.1 -0.7 -1.1 1.6 0.8 0.8 2.8 4.8 R -2.1 -0.4 -1.0 -0.3 -1.6 -2.3 -0.2 -1.8 -1.7 1.2 1.6 6.7 K -2.9 0.1 0.2 -0.7 -0.7 -0.8 1.1 0.2 -0.0 0.9 0.6 3.4 4.0 M -4.2 -2.2 -1.2 -2.8 -1.7 -3.1 -2.5 -3.7 -3.2 -1.6 -2.4 -1.2 0.2 8.5 I -1.2 -2.0 -0.3 -2.7 -1.0 -2.8 -2.5 -3.3 -3.0 -2.9 -2.7 -2.9 -2.3 1.3 5.5 L -4.5 -3.1 -2.1 -2.8 -2.1 -3.9 -3.3 -4.7 -4.1 -2.1 -2.0 -3.7 -2.9 3.2 1.7 5.8 V -0.4 -1.0 0.1 -1.2 0.1 -1.0 -1.9 -2.5 -2.3 -2.1 -2.0 -3.0 -2.3 1.1 3.5 1.4 4.1 F -2.6 -3.3 -3.4 -4.6 -3.6 -4.5 -3.7 -6.1 -6.1 -4.9 -1.6 -5.0 -5.1 -0.7 0.4 1.4 -1.7 9.0 Y 1.8 -2.8 -2.8 -4.9 -3.4 -4.7 -2.1 -4.7 -4.8 -4.2 0.1 -4.6 -4.2 -3.7 -1.8 -1.4 -2.9 6.8 10.2 W -6.5 -3.1 -5.9 -6.3 -6.3 -7.1 -4.7 -7.7 -8.2 -5.6 -2.9 1.3 -3.9 -6.0 -6.9 -2.8 -7.3 -0.4 -0.9 16.8
> OrigDM[MaxSim]; SPDM[MaxSim]; 17.3021 16.8467 > OrigDM[MinSim]; SPDM[MinSim]; -7.5098 -8.2217 > OrigDM[MaxOffDiag]; SPDM[MaxOffDiag]; 6.9511 6.7851 > OrigDM[FixedDel]; SPDM[FixedDel]; -19.8137 -19.8137
Matchings performed via dynamic programming will apply penalties for
deletions of length k according to FixedDel + (k-1) * IncDel. This gap penalty implies that a gap of length k occurs
with probability
.
Chapter
-
Insertions and Deletions describes this scoring function in more depth.