Darwin and Problems from Biochemistry

At the top of our tower lies the secondary structure prediction. This, in some sense, is the ultimate goal for our system and the end of the bioinformatist's job. Our secondary structure predictions are based on multiple sequence alignments. These alignments indicate conserved/non-conserved areas of a protein and highlight the different structural units such as alpha helices and beta sheets. Of course, this implies that the accuracy of our structure predictions are dependent on the accuracy of our multiple sequence alignments. In turn, the accuracy of our alignments is dependent on how accurate our phylogenetic trees represent the true ancestral relationships between the species from which the sequences are taken. Our phylogenetic trees are constructed from the pairwise distances and variances deriven from the pairwise comparison of protein sequences. And, at the bottom of our tower, the protein sequences are extracted from the raw DNA or RNA supplied to us by the biochemist.

Of course, any mistake at any level of this tower percolates upwards. But, conversely, any improvement to an algorithm does too.

We do not claim that the solutions present herein are the only or the best way to go about solving any particular bioinformatics problem. The algorithms we have choosen to include in the Darwin libraries have strong arguments, both mathematical and biological, suggesting they will perform well in practice. However, there are other methods (requiring possibly unrealistic resource demands) that may be more pertinent to your particular situation and data. The strength of Darwin lies in the fact that any method (assuming it is algorithmic) can be programmed in the language.

Each of the following chapters contains:

- a statement of the problem at hand,
- a discussion concerning any biologic assumptions we make about the data,
- an explanation of how we model the problem mathematically,
- a description of the algorithm,
- a Darwin implementation,
- a discussion about the accuracy and efficiency of our algorithm, and
- a short guide to the literature.

Beyond the understanding of the Darwin libraries, we hope such a presentation gives users

- an understanding of some of the classic problems from bioinformatics,
- an understanding of the underlying biochemistry involved in these problems,
- an understanding of the mathematical model upon which these algorithms are predicated,
- an understanding of how the algorithms works, and
- a conceptual overview of how to structure programs in Darwin.

- Point Accepted Mutations and Dayhoff Matrices
- Insertions and Deletions
- The Pairwise Comparison of Amino Acid Sequences
- Searching for Genes
- All versus All
- Phylogenetic Trees
- Phylogenetic Trees
- Multiple Sequence Alignments
- Probabilistic Ancestral Sequences
- Predicting Secondary Structure
- Random Sequence Generation
- Searching with Fragment Sequences
- Molecular Weight Traces