G. Gonnet - M. Hallett
29 August 1997
The Evolution of Darwin
This section is reserved for a short description written by Gonnet and Benner about the early beginnings of Darwin- the motivation, who worked on it etc.
The following can be completed at the end.
Darwin was designed to be a workbench where biochemists could forge tools to explore genetic data quickly and easily. The workbench is a partially interpreted general purpose language containing many built-in routines and data structures (tools from which tools can be built) especially tailored to the wants and needs of the bioinformatics community.
Darwin provides a flexible structure to hold a complete or partial genetic database. It is general enough to hold DNA, RNA or amino acid sequences and allows an unlimited amount of annotation information to be kept alongside each entry.
(Example of a SwissProt in Darwin entry) (Example of an EMBL in Darwin entry)
Dayhoff matrices
(Example of a Dayhoff matrix)
PAM distance
One of the first motivations for the Darwin system was to allow for the complete ``self matching'' of the entire SwissProt database. The goal was to compare every entry (some 30,000 at the time) against every other entry. It is a simple task to create pairwise comparisons of entries in Darwin.
(Example of a AA-AA comparison)
AA-AA comparisons with PAM distances)
(example of PAM distance related to Dayhoff matrices)
The built-in Darwin pairwise comparison algorithms are based on the method of maximum likelihood. In broad terms, this methodology is based on the belief that we can test whether two sequences are truly homologous by testing the hypothesis
Gene finding
(Example of a DNA-AA comparison)
Phylogenetic Trees
(Example of a Phylogenetic tree - rooted)
Multiple Sequence alignments
(Example of an MSA)
Probabilistic ancestral sequences
(Example of a PAS)
Secondary Structure
(Example from SAINT)
Molecular Weight Traces
(Example of a Molecular weight trace)
programming language
(Examples of the programming language)
End of Sample Session
This book is divided into three parts. Part - An
Introduction to Darwin has been designed to familiarize even the most
computer illiterate amongst us with the basic Darwin environment.
We have to tried to write to the biologist/biochemist who perhaps has
had a first year Introduction to Computer Science course and who has
distant memories of for-loops and recursion stored somewhere deep in
the recess of their subconscious.
An attempt has been made to use simple
terminology, only giving those definitions we deem important in later
chapters. For new users, we recommend that Part
be read
``in one sitting'' beginning at Chapter
- Exploring the Basics and following the discussion and examples
through to the end of Chapter
- A Guide to
Debugging.
An experienced programmer may find that Part need only be
skimmed in order to familiarize themself with the pecularities of the
Darwin system. Once comfortable with the language, users may find that
it acts as a short reference guide for looking up commands ``on the
fly''. Towards this end, we have attempted to make
each chapter self-contained.
Chapter - Exploring the Basics
provides a basic session with Darwin designed to
give new users a feeling of how to interact with the system.
Chapter
through to Chapter
provides a more in depth tour of the basic
Darwin language with a focus on the most commonly used commands and
routines built into the
kernel. We recommend new
users become familiar with the topics covered in these chapters before
venturing into Part
.
The more esoteric and application specific routines
are discussed in Chapter - Genetic Databases,
Chapter
- Randomization, Statistics and
Visualization, Chapter
- Producing HTML Code,
Chapter
- Darwin's Interprocessor Skills,
and Chapter
- Calling External Functions.
Chapter - Genetic Databases provides an in depth look at how
Darwin builds, stores and manipulates genetic databases. In some
sense, these data structures are the
cornerstone of the system and a fluency with their manipulation will
greatly ease the difficulty of programming in Darwin.
Chapter
- Randomization, Statistics and Visualization
contains an overview of the randomization and statistics functions
followed with an explanation of the basic primitives available for graphing and
plotting information. These are used extensively throughout later
chapters, most notably
Chapter
- Dayhoff Matrices and Mutation
Matrices, Chapter
- Coping with Insertions and
Deletions, Chpater
- Generating Random
Sequences, and Chapter
- Phylogenetic Trees.
An indepth reading of Chapter - Overloading, Polymorphism and
Object Orientation,
Chapter
- Measuring Performance and
Chapter
- Producing HTML Code should be postponed
until the reader feels he/she is particularly comfortable with the system.
Chapter Darwin's Interprocessor Skills
explains the mechanisms built into
Darwin for interprocessor communication. These routines allow users
to fragment large computationally intensive jobs into smaller pieces
which can be distributed automatically to other processors. A
complete understanding of this topic is not necessary for one to
proceed into later chapters with the exception of the latter half of
Chapter
- All against All
where a program is given which performs an exhausive matching of a set of
amino acid sequences.
Each chapter in Part Darwin and Problems from
Biochemistry
examines a different bioinformatic problem. Every chapter contains
(1) a statement of the problem, (2) a discussion concerning any
biologic assumptions we make about the data,
(3) an explanation of how we
model the problem mathematically, (4) a description of the algorithm,
(5) a Darwin implementation, (6) a discussion about the accuracy and
efficiency of our algorithm,and (7) a short guide to the literature.
In this manner, we tour the Darwin libraries motivating each routine
and data structure with a concrete example.
Beyond the understanding of the Darwin libraries, we hope such a
presentation gives users
The appendices contain some general material including a short introduction to statistics and dynamic programming. For those readers unfamiliar with the mathematics underlying the models we use, these chapters will provide a deeper understanding of our methods. All of the examples and programs used throughout this manual are available via the world wide web (WWW) or by ftp (file transfer protocol). The COMPUTATIONAL BIOCHEMISTRY RESEARCH GROUP at ETH-Zürich maintains a web cite at:
Our group at ETH-Zürich and the University of Florida at Gainsville continue to add code to the Darwin system and we regularly make this new code available via the above web site. Readers are encouraged to submit their code into our algorithms repository. If you feel you have a particularly useful, novel or simply better algorithm for a problem, please send us e-mail at the adress below.
If you should have any comments, suggestions or questions about
Darwin, we can be reached by e-mail at