There is a growing number of sequence databases being made available to the general public. Each of these databases has their own raison d'etre: some contain only protein sequences, or only nucleotide sequences, or the sequences specific to an organism. Unfortunately, there are as many formats for databases as there are databases. However, most of the formats are well-defined and most sites offer short manuals detailing the tagging conventions and overall layout of their database.
Historically, the COMPUTATIONAL BIOCHEMISTRY RESEARCH GROUP has focused mainly on two databases: the annotated protein database Swiss-Prot [3,4] and the nucleotide sequence database EMBL [10].
Figures and
contain an
example of a Swiss-Prot entry and a EMBL entry respectively.
Each line (excluding the sequence entry SQ) begins with a two
letter code which indicates the type of the line. Both in
Swiss-Prot and EMBL, ID indicates the identification
of the sequence, AC is the accession number, DE is
the description, SQ is the actual sequence, and so
forth. Unfortunately, this is not always the case. There are
several types of lines which have no analog in the other database.
For this reason, we must reformat all of the sequence information into a
normalized form Darwin understands. We show how to perform this
normalization using the Swiss-Prot database but note that the
normalization of other databases follows, more or less, the same lines.