Next: Generating random scores Up: Darwin and Problems from Biochemistry Previous: Random Sequence Generation

Searching with Fragment Sequences

One of the services that computational biochemistry provides is the searching of databases based on small sequence fragments. This chapter is named after one sequence fragment we were once requested to search. We will discuss how to find all the approximate matches to a given (short) sequence and how to estimate the score boundaries where these matches become significant. As a running example we will try to retrieve all significant matches to the fragment GKSKGFGFV.

The function AlignOneAll can be used to compare a given sequence with an entire sequence database. It takes for its arguments

> print(AlignOneAll);
AlignOneAll: Usage: AlignOneAll( str:{string,posint}, F:database,
 D:DayMatrix, goal:real, entries:posint..posint )
Match a given name (or entry number) against the F database, report
  all matchings which reach the goal using the D similarity matrix.
  Optionally match only against a range of database entries.

and returns a vector of Match structures. For example, suppose we want to search in the SwissProt database. The database needs to be loaded

> ReadDb('SwissProt23'):
ReadDb: attempting to read file SwissProt23
Error, file does not exist or is too small

and the standard set of Dayhoff matrices calculated .

> CreateDayMatrices();

And then the sequence to be searched and it's length must be set.

> SearchSeq := 'GKSKGFGFV':
> Len := length(SearchSeq):

We will describe two methods for computing a boundary that will separate significant from insignificant matches . This value will later be used as the argument goal in AlignOneAll. It is clear that if we set this goal too high we will miss some relevant homologies, and if we set it too low we will retrieve too many irrelevant ones. The first method to estimate this goal is based on generating random amino acid sequences, scoring the given fragment to these random sequences, collecting the resulting scores and estimating the average and expected value of this random score. The second method computes the exact expected value and variance of a random match without deletions.

Finally we need to set a criteria for the selectivity. The more tolerant the criteria is, the more irrelevant matches will be selected, at the other extreme we will be missing some marginal homologies. To set these bounds we will use the following goal: if we were retrieving from a random database, we want to be selective enough as to retrieve only the top 10 matches. In other words, when searching we should expect about 10 spurious matches caused by random coincidences but not due to homology.

Similarity scores are basically sums of Dayhoff matrices entries. It is natural to assume, that by the law of large numbers, the distribution of a random similarity score converges to a normal distribution. We will use this assumption to estimate probabilities.

Next: Generating random scores Up: Darwin and Problems from Biochemistry Previous: Random Sequence Generation

Gaston Gonnet
1998-09-15