To represent sequence databases internally, Darwin has choosen to use the ISO-SGML (International Standards Organization - Standard Generalized Markup Language) tagging convention [21]. SGML has begun to be used extensively, namely with HTML and the World Wide Web (WWW). SGML offers many advantages:
<tag>
and a closing <\tag>
where tag is a name
consisting of any sequence of letters and numbers. The opening and
closing tag surrounds a field of information. This provides an
essentially infinite name space for tags.
A Darwin sequence database has three simple conventions:
<E>
and end with the
tag </E>
.
<SEQ>
and end with the tag
</SEQ>
.
<SEQ>
, </SEQ>
tags.
Table lists the types of sequence
databases allowed in Darwin with the legal characters associated with
each type..
The following variable ProtoOncogene contains what a Darwin
entry might look like for the Swiss-Prot entry from
Figure .
The backslash symbol (
\
) allows us to split entries across lines.
> ProtoOncogene := <E>\ > <ID>FYN_HUMAN</ID>\ > <AC>P06241;</AC>\ > <DE>PROTO-ONCOGENE TYROSINE-PROTEIN KINASE FYN (EC 2.7.1.112) (P59-FYN)\ > (SYN) (SLK).</DE>\ > <OS>HOMO SAPIENS (HUMAN).</OS>\ > <OC>EUKARYOTA; METAZOA; CHORDATA; VERTEBRATA; TETRAPODA; MAMMALIA;\ > EUTHERIA; PRIMATES.</OC>\ > <DR>PDB; 1SHF;</DR>\ > <KW>PROTO-ONCOGENE; TRANSFERASE; TYROSINE-PROTEIN KINASE; PHOSPHORYLATION;\ > ATP-BINDING; MYRISTYLATION; SH3 DOMAIN; SH2 DOMAIN; 3D-STRUCTURE.</KW>\ > <FT>ACT_SITE 389 389</FT>\ > <SEQ>GCVQCKDKEATKLTEERDGSLNQSSGYRYGTDPTPQHYPSFGVTSIPNYNNFHAAGGQGLTVFGGVNS\ > SSHTGTLRTRGGTGVTLFVALYDYEARTEDDLSFHKGEKFQILNSSEGDWWEARSLTTGETGYIPSNY\ > VAPVDSIQAEEWYFGKLGRKDAERQLLSFGNPRGTFLIRESETTKGAYSLSIRDWDDMKGDHVKHYKI\ > RKLDNGGYYITTRAQFETLQQLVQHYSERAAGLCCRLVVPCHKGMPRLTDLSVKTKDVWEIPRESLQL\ > IKRLGNGQFGEVWMGTWNGNTKVAIKTLKPGTMSPESFLEEAQIMKKLKHDKLVQLYAVVSEEPIYIV\ > TEYMNKGSLLDFLKDGEGRALKLPNLVDMAAQVAAGMAYIERMNYIHRDLRSANILVGNGLICKIADF\ > GLARLIEDNEYTARQGAKFPIKWTAPEAALYGRFTIKSDVWSFGILLTELVTKGRVPYPGMNNREVLE > QVERGYRMPCPQDCPISLHELMIHCWKKDPEERPTFEYLQSFLEDYFTATEPQYQPGENL</SEQ>\ > </E>
Notice that not all type lines from the Swiss-Prot entry are present in the Darwin version. Depending on the user's objectives, some information kept in the original database may be discarded in an effort to minimize the amount of memory required for the database and save storage space. In particular, we have choosen to only keep the ID (identification), AC (accession number), DE (description), OS (organism species), KW (keyword), a part of the FT (feature table), and SE (sequence entry) fields.
Now, via the SearchTag command introduced in §,
we can extract information in any of the fields.
> SearchTag('DE', ProtoOncogene); > SearchTag('FT', ProtoOncogene);
The general procedure for building a Darwin sequence database is as follows: