Next: The comparison algorithm Up: Darwin and Problems from Biochemistry Previous: Computing expected scores

Molecular Weight Traces

In some cases, recognition of proteins can be done by fragmenting the protein according to certain pattern and using the molecular weights of the fragments as a trace. This method is not effective to find the composition of an unknown protein, but it is effective in locating an unknown sample if its sequence is recorded in a protein database.

One of the ways of breaking a protein into smaller pieces according to a certain pattern is by using enzymes which digest the protein. For example, trypsin breaks a protein after every Arginine (R) or after every Lysine (K) not followed by a Proline (P). It is not very difficult, given the rules, to write a function which will do the theoretical digestion of a sequence. The function for trypsin is

> DigestTrypsin := proc( s:string )
> description 'break a peptide sequence as if digested by trypsin';
> ls := length(s);
> res := NULL;
> i := 1;

the fragments will be defined between i and j

> for j to ls-1 do
>     if s[j] = 'R' or s[j] = 'K' and s[j+1] <> 'P' then
>         res := res, s[i..j];
>         i := j+1
>         fi
>     od;

collect the last fragment

> [res, s[i..ls]]
> end:

If we would subject the protein

> p := 'YKVTLVDQRREGDIAEDQGLDLKPYSCRAGACSTCAGKIVSGDLDDDQIEKG':

to the action of trypsin, we would obtain 7 fragments:

> dp := DigestTrypsin(p);
dp := [YK, VTLVDQR, R, EGDIAEDQGLDLKPYSCR, AGACSTCAGK, IVSGDLDDDQIEK, G]

The molecular weight of fragments can be found experimentally by mass spectrometry methods to a good level of accuracy. More importantly, these methods typically require very small samples in the order of fractions of pico-moles.

In Darwin we can compute the theoretical molecular mass of a protein sequence by using the function GetMolWeight.

> print(GetMolWeight);
GetMolWeight: Usage: GetMolWeight( s:{string,array(string)} )
Compute the molecular weight of an amino acid sequence.

> GetMolWeight(dp);
[309.3440, 829.8990, 174.1880, 2009.1640, 867.9860, 1446.5000, 75.0520]

The problem of identifying a sampled protein can be reduced to digesting the protein with an enzyme, finding the molecular weights of each of the pieces and then comparing this set of weights to what would be obtained from the digestion of each protein in the database. The process can be repeated with several different enzymes to increase its selectivity.

The purpose of this chapter is to describe an algorithm to perform this matching against the database in an efficient way. Secondly we are interested in estimating when a match of weights is significant. This algorithm is available in Darwin under the name SearchMassDb. Readers interested just in its use should skip to the example section. As we have done with other algorithms, the next sections describe the algorithms and their theory.

Next: The comparison algorithm Up: Darwin and Problems from Biochemistry Previous: Computing expected scores

Gaston Gonnet
1998-09-15