> ws := [ 511.3, 563.1, 717.2, 743.2, 836.4, 842.5, 1014.4, > 1169.4, 1387.5, 1509.7, 1524.0 ]:from the results of digesting an unknown protein with trypsin. To compute the 5 most significant matches we run the command
> res := SearchMassDb( ws, DigestTrypsin, 5 ); res := [[69.1127, 25358, 3, 5], [70.1624, 779, 28, 5], [70.3967, 8895, 5 , 6], [71.4526, 11384, 11, 4], [148.9464, 11424, 15, 9]]Each element of
res
has 4 components:
the similarity score, the entry number, the number of
selected weights from the database (n) and the
number of matched weights from the sample.
We can print the above in a more readable format:> for m in res do > printf('Sim %.1f, %s, %s\n\n', m[1], > GetEntryInfo(Entry(m[2]),'DE')[2], > GetEntryInfo(Entry(m[2]),'OS')[2] ) od; Sim 69.1, PROPHAGE P2 OGR PROTEIN., ESCHERICHIA COLI. Sim 70.2, ANGIOTENSIN-CONVERTING ENZYME PRECURSOR, TESTIS-SPECIFIC (EC 3.4.15.1) (ACE-T) (DIPEPTIDYL CARBOXYPEPTIDASE I) (KININASE II)., HOMO SAPIENS (HUMAN). Sim 70.4, DNA-BINDING PROTEIN II (HB) (HU)., BACILLUS SUBTILIS, AND BACILLUS GLOBIGII. Sim 71.5, COAGULATION FACTOR IX (EC 3.4.21.22) (CHRISTMAS FACTOR) (FRAGMENT)., OVIS ARIES (SHEEP). Sim 148.9, FATTY ACID-BINDING PROTEIN, LIVER (L-FABP)., GINGLYMOSTOMA CIRRATUM (NURSE SHARK).
How can we determine if the above similarities are significant or not? Doing this directly is far too complicated, it would require to consider the total number of sequences and the total number of choices of fragments to match. A Montecarlo method is much more appropriate in this case. We will run the search for various randomly generated sequences and tabulate their values. For this example we first generate as many random weights as we had in the sample, uniformly distributed in the same range
> rw := copy(ws): > wmin := min(ws) * 0.98; wmax := max(ws) * 1.02; wmin := 501.0740 wmax := 1554.4800set counters to collect statistics
> BestSim := Stat():and run 5 experiments
> to 5 do > for i to length(rw) do rw[i] := wmin + Rand()*(wmax-wmin) od; > res2 := SearchMassDb(rw,DigestTrypsin,1); > UpdateStat(BestSim,res2[1,1]); > od:The results of this simulation are
> print(BestSim); number of sample points=5 mean = 77.3 +- 5.5 variance = 39 +- 61 skewness=1.45134, excess=0.375441 minimum=72.0419, maximum=87.6283
Now we can state that for this database, similarity scores smaller than
> BestSim[Mean] + 1.96*sqrt(BestSim[Variance]); 89.5599are not very surprising or interesting. Our most significant match in the previous example is
> (res[5,1] - BestSim[Mean]) / sqrt(BestSim[Variance]); 11.4762standard deviations away from the average, and that is very significant. Hence we could conclude that the weights 511.3, 563.1, 717.2, etc. come from the digestion of the fatty acid-binding liver protein of sharks or similar sequences.