Next: Building a Darwin Sequence Up: Genetic Databases Previous: Sequence Databases

Darwin Sequence Databases

To represent sequence databases internally, Darwin has choosen to use the ISO-SGML (International Standards Organization - Standard Generalized Markup Language) tagging convention [21]. SGML has begun to be used extensively, namely with HTML and the World Wide Web (WWW). SGML offers many advantages:

The format for an SGML tag consists an opening tag <tag> and a closing <\tag> where tag is a name consisting of any sequence of letters and numbers. The opening and closing tag surrounds a field of information. This provides an essentially infinite name space for tags.
SGML tags can be nested to conform to any structure; this allows for a very rich and flexible structuring of data.
Compared with other formats which use blank lines and spacing patterns, SGML tags are quite economical in terms of storage.

A Darwin sequence database has three simple conventions:

1.: Each entry must begin with the tag <E> and end with the tag </E>.
2.: The sequence (whether it consists of nucleotides, ribonucleotides or peptides) must begin with the tag <SEQ> and end with the tag </SEQ>.
3.: No character except those representing a nucleotide (or ribonucleotide, peptide as the case may be) is allowed between the opening and closing <SEQ>, </SEQ> tags.

Beyond these, users may include as many fields as they would like with each entry and the choice for tag names for these fields are entirely the users choice.

Table: A table showing the different types of Darwin sequence databases and the legal bases for each.

1.1

**Table:** A table showing the different types of Darwin sequence databases and the legal bases for each.
Type	Legal Characters
DNA	A,C,G,T
RNA	A,C,G,U
Mixed	A,C,G,T,U
Peptide	A,R,N,D,C,Q,
	E,G,H,I,L,K,
	M,F,P,S,T,W,
	Y,V,X

Table lists the types of sequence databases allowed in Darwin with the legal characters associated with each type..

The following variable ProtoOncogene contains what a Darwin entry might look like for the Swiss-Prot entry from Figure . The backslash symbol (\) allows us to split entries across lines.

> ProtoOncogene := <E>\
> <ID>FYN_HUMAN</ID>\
> <AC>P06241;</AC>\
> <DE>PROTO-ONCOGENE TYROSINE-PROTEIN KINASE FYN (EC 2.7.1.112) (P59-FYN)\
> (SYN) (SLK).</DE>\
> <OS>HOMO SAPIENS (HUMAN).</OS>\
> <OC>EUKARYOTA; METAZOA; CHORDATA; VERTEBRATA; TETRAPODA; MAMMALIA;\
> EUTHERIA; PRIMATES.</OC>\
> <DR>PDB; 1SHF;</DR>\
> <KW>PROTO-ONCOGENE; TRANSFERASE; TYROSINE-PROTEIN KINASE; PHOSPHORYLATION;\
> ATP-BINDING; MYRISTYLATION; SH3 DOMAIN; SH2 DOMAIN; 3D-STRUCTURE.</KW>\
> <FT>ACT_SITE 389 389</FT>\
> <SEQ>GCVQCKDKEATKLTEERDGSLNQSSGYRYGTDPTPQHYPSFGVTSIPNYNNFHAAGGQGLTVFGGVNS\
> SSHTGTLRTRGGTGVTLFVALYDYEARTEDDLSFHKGEKFQILNSSEGDWWEARSLTTGETGYIPSNY\
> VAPVDSIQAEEWYFGKLGRKDAERQLLSFGNPRGTFLIRESETTKGAYSLSIRDWDDMKGDHVKHYKI\
> RKLDNGGYYITTRAQFETLQQLVQHYSERAAGLCCRLVVPCHKGMPRLTDLSVKTKDVWEIPRESLQL\
> IKRLGNGQFGEVWMGTWNGNTKVAIKTLKPGTMSPESFLEEAQIMKKLKHDKLVQLYAVVSEEPIYIV\
> TEYMNKGSLLDFLKDGEGRALKLPNLVDMAAQVAAGMAYIERMNYIHRDLRSANILVGNGLICKIADF\
> GLARLIEDNEYTARQGAKFPIKWTAPEAALYGRFTIKSDVWSFGILLTELVTKGRVPYPGMNNREVLE
> QVERGYRMPCPQDCPISLHELMIHCWKKDPEERPTFEYLQSFLEDYFTATEPQYQPGENL</SEQ>\
> </E>

Notice that not all type lines from the Swiss-Prot entry are present in the Darwin version. Depending on the user's objectives, some information kept in the original database may be discarded in an effort to minimize the amount of memory required for the database and save storage space. In particular, we have choosen to only keep the ID (identification), AC (accession number), DE (description), OS (organism species), KW (keyword), a part of the FT (feature table), and SE (sequence entry) fields.

Now, via the SearchTag command introduced in §, we can extract information in any of the fields.

> SearchTag('DE', ProtoOncogene);
> SearchTag('FT', ProtoOncogene);

The general procedure for building a Darwin sequence database is as follows:

1.: For each entry of the sequence database, convert the entry to the Darwin SGML format.
2.: Concatenate each converted entry into one external file.
3.: Load the file into Darwin via the ReadDb command.

The next subsection presents a simple Darwin program to perform steps 1 and 2 above for the Swiss-Prot database.

Next: Building a Darwin Sequence Up: Genetic Databases Previous: Sequence Databases

Gaston Gonnet
1998-09-15