    Next: Computing expected scores Up: Searching with Fragment Sequences Previous: Searching with Fragment Sequences

# Generating random scores

The first approach consists of matching random sequences against the fragment and studying the distribution of the similarity score. A large number (say 100) random amino acid strings are generated, the best possible match between the given fragment and the test sequence is computed, and then the mean and standard deviation of the similarities are calculated.

Simple univariate statistics  can be collected using internal Darwin functions. To collect these statistics we must initialize the corresponding counters with the structure `Stat`.

```> ByRefShake := Stat('similarity score with LocalAlign'):
> ByDynProgr := Stat('similarity score with DynProg'):
```
The argument to `Stat` can be any name, a printable definition is convenient.

The following statements produce the random test sequences, match them to the given fragment, and update the statistical information using the `updatew`  function for both a refined match against a typical random sequence and exactly against a random sequence of the same length.

```> SeqPos := GetOffset(SearchSeq):
Error, DB must be assigned a database

> for i to 100 do
```
Set `SeqPos` to be like a database pointer to the given fragment and run a loop 100 times. The next statements compute a random sequence. The length of this random sequence affects the quality of the result. The longer the sequence, the better the chances of finding a good match. To solve this problem we will select a random entry in the database and use its length as the length of the random sequence. In this way, our random sequences will have the same length distribution as the database.
```>     RandEntry := round( Rand()*DB[TotEntries] + 0.5 );
>     RandLen := length( Strings(Sequence(Entry(RandEntry))));
>     RandSeq := CreateRandSeq(RandLen,AF);
```
Now we can compute the refined match and collect its similarity
```>     m := LocalAlign( Match(SeqPos,GetOffset(RandSeq)), DM);
>     update(ByRefShake,m[Sim]);
```
and secondly, with a sequence of the same length, we compute an exact match
```>     n := DynProg(SearchSeq,CreateRandSeq(Len,AF),DM,Len,Len);
>     update(ByDynProgr,n);
>     od:
Error, cannot apply selector on object
```

We can display or use the information in two different ways. First we use the `stats`  function which prints all the information. Internal constants define the intervals to be at the 95% confidence level.  For the refined-shaked matches, the similarity score is characterized by

```> print(ByRefShake);
Error, (in Stats_print) division by zero
```

The second method uses selectors on the `Stat` structure. `Mean`,  `Variance`,  `VarVariance`  (the variance of the reported variance), `Skewness`,  `Excess`,  `Number`,  `MeanVar`  (the mean and its 95% confidence interval as a name), `Minimum`,  and `Maximum`.  The function `ProbStdDev` takes a probability as an argument and returns the number of standard deviations  away from the mean that will give this probability. The function `ProbStdDev` can be defined in terms of the inverse complementary error function  `erfcinv`  as:

```> ProbStdDev :=  p -> sqrt(2)*erfcinv(2*p):
```
The values for 0.025 and 0.001 are well known:
```> ProbStdDev(0.025), ProbStdDev(0.001);
1.9600, 3.0902
```
To expect 10 random matches, we should set the probability to be 10/n, where n is the number of entries in the database. The goal should then be
```> Z := ProbStdDev( 10 / DB[TotEntries] );
Error, cannot apply selector on object
```
standard deviations away from the mean random score.

From this data we can compute an approximation to the goal:

```> Goal := ByRefShake[Mean] + Z*sqrt(ByRefShake[Variance]);
Error, (in Stats_select) division by zero
```

The last piece of information needed is the maximum possible similarity score. This can be computed by adding all the corresponding diagonal entries of the Dayhoff matrix for each amino acid in the fragment:

```> MaxGoal := 0:
> for i to Len do
>     MaxGoal := MaxGoal + DM[Sim,AToInt(SearchSeq[i]),AToInt(SearchSeq[i])]
>     od:
> MaxGoal;
45.9342
```
Since the goal is relatively far from the maximum possible score, we are assured that the interval is not too narrow.    Next: Computing expected scores Up: Searching with Fragment Sequences Previous: Searching with Fragment Sequences
Gaston Gonnet
1998-09-15