One of the ways of breaking a protein into smaller pieces according to a certain pattern is by using enzymes which digest the protein. For example, trypsin breaks a protein after every Arginine (R) or after every Lysine (K) not followed by a Proline (P). It is not very difficult, given the rules, to write a function which will do the theoretical digestion of a sequence. The function for trypsin is
> DigestTrypsin := proc( s:string ) > description 'break a peptide sequence as if digested by trypsin'; > ls := length(s); > res := NULL; > i := 1;the fragments will be defined between
i
and j
> for j to ls-1 do > if s[j] = 'R' or s[j] = 'K' and s[j+1] <> 'P' then > res := res, s[i..j]; > i := j+1 > fi > od;collect the last fragment
> [res, s[i..ls]] > end:If we would subject the protein
> p := 'YKVTLVDQRREGDIAEDQGLDLKPYSCRAGACSTCAGKIVSGDLDDDQIEKG':to the action of trypsin, we would obtain 7 fragments:
> dp := DigestTrypsin(p); dp := [YK, VTLVDQR, R, EGDIAEDQGLDLKPYSCR, AGACSTCAGK, IVSGDLDDDQIEK, G]The molecular weight of fragments can be found experimentally by mass spectrometry methods to a good level of accuracy. More importantly, these methods typically require very small samples in the order of fractions of pico-moles.
In Darwin we can compute the theoretical molecular mass of a protein
sequence by using the function GetMolWeight
.
> print(GetMolWeight); GetMolWeight: Usage: GetMolWeight( s:{string,array(string)} ) Compute the molecular weight of an amino acid sequence. > GetMolWeight(dp); [309.3440, 829.8990, 174.1880, 2009.1640, 867.9860, 1446.5000, 75.0520]
The problem of identifying a sampled protein can be reduced to digesting the protein with an enzyme, finding the molecular weights of each of the pieces and then comparing this set of weights to what would be obtained from the digestion of each protein in the database. The process can be repeated with several different enzymes to increase its selectivity.
The purpose of this chapter is to describe an algorithm to
perform this matching against the database in an efficient way.
Secondly we are interested in estimating when a match of
weights is significant.
This algorithm is available in Darwin under the name
SearchMassDb
.
Readers interested just in its use should skip to the
example section.
As we have done with other algorithms, the next
sections describe the algorithms and their theory.