GKSKGFGFV
.
The function AlignOneAll
can be used to
compare a given sequence
with an entire sequence database.
It takes for its arguments
> print(AlignOneAll); AlignOneAll: Usage: AlignOneAll( str:{string,posint}, F:database, D:DayMatrix, goal:real, entries:posint..posint ) Match a given name (or entry number) against the F database, report all matchings which reach the goal using the D similarity matrix. Optionally match only against a range of database entries.and returns a vector of
Match
structures.
For example, suppose we want to search in the
SwissProt database.
The database needs to be loaded > ReadDb('SwissProt23'): ReadDb: attempting to read file SwissProt23 Error, file does not exist or is too smalland the standard set of Dayhoff matrices calculated .
> CreateDayMatrices();And then the sequence to be searched and it's length must be set.
> SearchSeq := 'GKSKGFGFV': > Len := length(SearchSeq):We will describe two methods for computing a boundary that will separate significant from insignificant matches . This value will later be used as the argument
goal
in AlignOneAll
.
It is clear that if we set this goal too high we will miss
some relevant homologies, and if we set it
too low we will retrieve too many irrelevant ones.
The first method to estimate this goal is based on generating
random amino acid sequences, scoring the given fragment
to these random sequences, collecting the resulting scores
and estimating the average and expected value of this
random score.
The second method computes the exact expected value and
variance of a random match without deletions.
Finally we need to set a criteria for the selectivity. The more tolerant the criteria is, the more irrelevant matches will be selected, at the other extreme we will be missing some marginal homologies. To set these bounds we will use the following goal: if we were retrieving from a random database, we want to be selective enough as to retrieve only the top 10 matches. In other words, when searching we should expect about 10 spurious matches caused by random coincidences but not due to homology.
Similarity scores are basically sums of Dayhoff matrices
entries.
It is natural to assume, that by the law of large numbers,
the distribution of a random similarity score converges
to a normal distribution.
We will use this assumption to estimate probabilities.