Next: The Reference Guide Up: Molecular Weight Traces Previous: Probability foundations

Example and probability estimates

Suppose we are given the molecular weights:

> ws := [ 511.3, 563.1, 717.2, 743.2, 836.4, 842.5, 1014.4,
>         1169.4, 1387.5, 1509.7, 1524.0 ]:

from the results of digesting an unknown protein with trypsin. To compute the 5 most significant matches we run the command

> res := SearchMassDb( ws, DigestTrypsin, 5 );
res := [[69.1127, 25358, 3, 5], [70.1624, 779, 28, 5], [70.3967, 8895, 5
, 6], [71.4526, 11384, 11, 4], [148.9464, 11424, 15, 9]]

Each element of res has 4 components: the similarity score, the entry number, the number of selected weights from the database (n) and the number of matched weights from the sample. We can print the above in a more readable format:

> for m in res do
>     printf('Sim %.1f,   %s,   %s\n\n', m[1],
>         GetEntryInfo(Entry(m[2]),'DE')[2],
>         GetEntryInfo(Entry(m[2]),'OS')[2] ) od;
Sim 69.1,   PROPHAGE P2 OGR PROTEIN.,   ESCHERICHIA COLI.

Sim 70.2,   ANGIOTENSIN-CONVERTING ENZYME PRECURSOR, TESTIS-SPECIFIC
 (EC 3.4.15.1) (ACE-T) (DIPEPTIDYL CARBOXYPEPTIDASE I) (KININASE II).,  
 HOMO SAPIENS (HUMAN).

Sim 70.4,   DNA-BINDING PROTEIN II (HB) (HU).,   BACILLUS SUBTILIS,
 AND BACILLUS GLOBIGII.

Sim 71.5,   COAGULATION FACTOR IX (EC 3.4.21.22) (CHRISTMAS FACTOR)
 (FRAGMENT).,   OVIS ARIES (SHEEP).

Sim 148.9,   FATTY ACID-BINDING PROTEIN, LIVER (L-FABP).,  
 GINGLYMOSTOMA CIRRATUM (NURSE SHARK).

How can we determine if the above similarities are significant or not? Doing this directly is far too complicated, it would require to consider the total number of sequences and the total number of choices of fragments to match. A Montecarlo method is much more appropriate in this case. We will run the search for various randomly generated sequences and tabulate their values. For this example we first generate as many random weights as we had in the sample, uniformly distributed in the same range

> rw := copy(ws):
> wmin := min(ws) * 0.98;  wmax := max(ws) * 1.02;
wmin := 501.0740
wmax := 1554.4800

set counters to collect statistics

> BestSim := Stat():

and run 5 experiments

> to 5 do
>     for i to length(rw) do rw[i] := wmin + Rand()*(wmax-wmin) od;
>     res2 := SearchMassDb(rw,DigestTrypsin,1);
>     UpdateStat(BestSim,res2[1,1]);
>     od:

The results of this simulation are

> print(BestSim);

 number of sample points=5
 mean = 77.3 +- 5.5
 variance = 39 +- 61
 skewness=1.45134,   excess=0.375441
 minimum=72.0419,     maximum=87.6283

Now we can state that for this database, similarity scores smaller than

> BestSim[Mean] + 1.96*sqrt(BestSim[Variance]);
89.5599

are not very surprising or interesting. Our most significant match in the previous example is

> (res[5,1] - BestSim[Mean]) / sqrt(BestSim[Variance]);
11.4762

standard deviations away from the average, and that is very significant. Hence we could conclude that the weights 511.3, 563.1, 717.2, etc. come from the digestion of the fatty acid-binding liver protein of sharks or similar sequences.

Next: The Reference Guide Up: Molecular Weight Traces Previous: Probability foundations

Gaston Gonnet
1998-09-15