There is a growing number of sequence databases being made available to the general public. Each of these databases has their own raison d'etre: some contain only protein sequences, or only nucleotide sequences, or the sequences specific to an organism. Unfortunately, there are as many formats for databases as there are databases. However, most of the formats are well-defined and most sites offer short manuals detailing the tagging conventions and overall layout of their database.
Historically, the COMPUTATIONAL BIOCHEMISTRY RESEARCH GROUP has focused mainly on two databases: the annotated protein database Swiss-Prot [3,4] and the nucleotide sequence database EMBL .
Figures and contain an example of a Swiss-Prot entry and a EMBL entry respectively. Each line (excluding the sequence entry SQ) begins with a two letter code which indicates the type of the line. Both in Swiss-Prot and EMBL, ID indicates the identification of the sequence, AC is the accession number, DE is the description, SQ is the actual sequence, and so forth. Unfortunately, this is not always the case. There are several types of lines which have no analog in the other database. For this reason, we must reformat all of the sequence information into a normalized form Darwin understands. We show how to perform this normalization using the Swiss-Prot database but note that the normalization of other databases follows, more or less, the same lines.