next up previous contents
Next: Probability foundations Up: Molecular Weight Traces Previous: Molecular Weight Traces

The comparison algorithm

If our mass measures were perfect and our sequence database contained all searched sequences, this would be an almost trivial problem. Search a vector  of weights against all possible vectors of weights computed from the sequence database. This problem is known in computer science as multidimensional search.  This is, unfortunately, not the case for the following reasons:
(a)
The recording of molecular mass is subject to a relative error,  in general less than 1% but not exact enough as to identify even very short sequences of amino acids.
(b)
The searched sequence may not be verbatim in the database, maybe a close relative of the sequence is. In this case the searched sequence and the target could differ due to mutations, insertions and deletions. This will cause some molecular weights to be different.
(c)
The mutations in the database sequence can cause the digestion to be different, splitting into more or fewer fragments. This will cause a complete mismatch of weights involving such fragments.
(d)
Impurities in the sample and in the digesters may produce spurious data in the searched sample.
(e)
The fragmentation (digestion) although in general accurate, is not 100% deterministic. Partial digestion or incorrect ones are also possible.
For all these reasons we have to choose a matching method which will tolerate errors both in the sample and in the database.

The algorithm we will use, for a single digestive enzyme, can be stated in relatively simple terms:

(i)
Find a set of molecular weights of the digested protein (usually found by experimental means).
(ii)
Digest (theoretically) every sequence in the database and find the molecular weights of the fragments.
(iii)
Compute the probability that a match of the given weights against the computed ones happens at random.
(iv)
Record the m lowest probabilities.
This algorithm returns the m most likely candidate sequences from the database. Analysis of these sequences and their probabilities will normally reveal whether we have found a match, a hint or just random noise.
next up previous contents
Next: Probability foundations Up: Molecular Weight Traces Previous: Molecular Weight Traces
Gaston Gonnet
1998-09-15