To represent sequence databases internally, Darwin has choosen to use the ISO-SGML (International Standards Organization - Standard Generalized Markup Language) tagging convention . SGML has begun to be used extensively, namely with HTML and the World Wide Web (WWW). SGML offers many advantages:
<tag>and a closing
<\tag>where tag is a name consisting of any sequence of letters and numbers. The opening and closing tag surrounds a field of information. This provides an essentially infinite name space for tags.
A Darwin sequence database has three simple conventions:
<E>and end with the tag
<SEQ>and end with the tag
Table lists the types of sequence databases allowed in Darwin with the legal characters associated with each type..
The following variable ProtoOncogene contains what a Darwin
entry might look like for the Swiss-Prot entry from
The backslash symbol (
\) allows us to split entries across lines.
> ProtoOncogene := <E>\ > <ID>FYN_HUMAN</ID>\ > <AC>P06241;</AC>\ > <DE>PROTO-ONCOGENE TYROSINE-PROTEIN KINASE FYN (EC 126.96.36.199) (P59-FYN)\ > (SYN) (SLK).</DE>\ > <OS>HOMO SAPIENS (HUMAN).</OS>\ > <OC>EUKARYOTA; METAZOA; CHORDATA; VERTEBRATA; TETRAPODA; MAMMALIA;\ > EUTHERIA; PRIMATES.</OC>\ > <DR>PDB; 1SHF;</DR>\ > <KW>PROTO-ONCOGENE; TRANSFERASE; TYROSINE-PROTEIN KINASE; PHOSPHORYLATION;\ > ATP-BINDING; MYRISTYLATION; SH3 DOMAIN; SH2 DOMAIN; 3D-STRUCTURE.</KW>\ > <FT>ACT_SITE 389 389</FT>\ > <SEQ>GCVQCKDKEATKLTEERDGSLNQSSGYRYGTDPTPQHYPSFGVTSIPNYNNFHAAGGQGLTVFGGVNS\ > SSHTGTLRTRGGTGVTLFVALYDYEARTEDDLSFHKGEKFQILNSSEGDWWEARSLTTGETGYIPSNY\ > VAPVDSIQAEEWYFGKLGRKDAERQLLSFGNPRGTFLIRESETTKGAYSLSIRDWDDMKGDHVKHYKI\ > RKLDNGGYYITTRAQFETLQQLVQHYSERAAGLCCRLVVPCHKGMPRLTDLSVKTKDVWEIPRESLQL\ > IKRLGNGQFGEVWMGTWNGNTKVAIKTLKPGTMSPESFLEEAQIMKKLKHDKLVQLYAVVSEEPIYIV\ > TEYMNKGSLLDFLKDGEGRALKLPNLVDMAAQVAAGMAYIERMNYIHRDLRSANILVGNGLICKIADF\ > GLARLIEDNEYTARQGAKFPIKWTAPEAALYGRFTIKSDVWSFGILLTELVTKGRVPYPGMNNREVLE > QVERGYRMPCPQDCPISLHELMIHCWKKDPEERPTFEYLQSFLEDYFTATEPQYQPGENL</SEQ>\ > </E>
Notice that not all type lines from the Swiss-Prot entry are present in the Darwin version. Depending on the user's objectives, some information kept in the original database may be discarded in an effort to minimize the amount of memory required for the database and save storage space. In particular, we have choosen to only keep the ID (identification), AC (accession number), DE (description), OS (organism species), KW (keyword), a part of the FT (feature table), and SE (sequence entry) fields.
Now, via the SearchTag command introduced in §, we can extract information in any of the fields.
> SearchTag('DE', ProtoOncogene); > SearchTag('FT', ProtoOncogene);
The general procedure for building a Darwin sequence database is as follows: