The previous section showed what a typical Darwin entry might look like. In this section, we build a rudimentary parsing engine to convert the entire contents of Swiss-Prot into a Darwin friendly format. The Darwin library offers the function SpToDarwin which performs a slightly extended version of what we perform here.
flatfile | : | name |
darwinfile | : | name |
descr | : | name |
compressed | : | boolean |
Returns: Converts a Swiss-Prot flat file (flatfile) into a
Darwin loadable file (
darwinfile). The parameter descr should
contain a name item with the DESCR, DBNAME and DBRELEASE tags. If
compressed is specified and true, flatfileis read using the Unix command zcat.
Our procedure, ConvertDB accepts either three or four parameters. The first argument flatfile is the flat file or raw data file containing all the Swiss-Prot entries we would like to convert. The second argument darwinfile is the destination file where we will store the converted form. The third argument allows us to specify the name of our database. The value of descript is wrapped in <DESCR>, </DESCR> tags. If the boolean variable compressed is specified and true, then flatfile is first decompressed via the UNIX zcat command.
After establishing a pipe between Darwin and the flat file flatfile (see §) and writing the database descriptor to file darwinfile, we can begin parsing the Swiss-Prot flat file. The parsing
follows a simple schematic shown in Figure
.
![]() |
ConvertDB := proc( flatfile : name, darwinfile : name, descript : name, compressed : boolean ) local state; secstruct := CreateString(1..5000); # temporary holder for FT tag state := 0; # state -> 0 : Outside Entry; 1 : Inside Entry; 2 : FT Tag Initial # 3 : FT Tag Latter 4 : SQ Tag 5 : Finish TagsToKeep := {'ID', 'AC', 'DE', 'OS', 'OC', 'FT', 'SQ'}; if compressed then # establish a pipe between Darwin and flatfile OpenPipe('zcat '.flatfile); else readlines(flatfile); fi; WriteFile(darwinfile); printf('<DBDESCR>%s<CONVDATE>%s</CONVDATE></DBDESCR>\n', descript, date()); t := ReadLine(); # get the first line from the file. tag := t[1..2]; # 1st two characters contain the line type. while (true) do # we use break to get out of the loop. if (state=0) then if (t=EOF) then state := 5; next; # go to state Finish and skip to top of loop elif (member(tag, TagsToKeep)) then state := 1; printf('\n<E>'); next; # go to Inside Entry open an entry and skip to top else # do nothing # stay in state 0, ignore line fi; elif (state=1) then if (tag='//') then state := 0; printf('\n</E>'); # close entry elif (tag='FT') then state := 3; next; elif (tag='SQ') then state := 4; next; elif (member(tag, TagsToKeep)) then ParseTag(t); next; fi; elif (state=2) then if (tag='FT') then temp := ParseFTLine(t); if (temp <> NULL) then # information from the line was kept state := 3; printf('\n<FT>%s', temp); # good FT tag fi; elif (tag='//') then state := 0; printf('\n</E>\n'); # go back Outside Entry and close entry else state := 1; next; # go back to Inside Entry fi; elif (state=3) then if (agt='FT') then for i to length(sec_struct) do sec_struct(i):=' '; od; length_struct := 0; ParseFTLine(t, sec_struct, length_struct); elif (tag='//') then state := 0; printf('\n%s</FT></E>\n', sec_struct[1..length_struct]); else state :=1; next; fi; elif (state=4) then if (tag='SQ') then printf('\n<SEQ>%s</SEQ>', ParseSequence(); next; else state := 1; next; fi; elif (state=5) then break; fi; t := ReadLine(); # get the next line from the pipe. tag := t[1..2]; # get the next tag. od; WriteFile(terminal); end:
All that remains from completing our parser is to write the procedures ParseTag, ParseSequence, and ParseFTLine.
ParseTag := proc( t : name ) tag := t[1..2]; printf('\n<%s>', tag); while (true) do p:=3; while (t[p] = ' ') do p := p+1 od; printf('%s ', t[p]..length(t)); t := ReadLine(); if (t[1..2] <> tag) then printf('</%s>\n'); break; fi; od; end; ParseSequence := proc( ) seq := ''; # seq holds the partial sequence while (true) do t := ReadLine(); # get the next line from the pipe tag := t[1..2]; if (tag = ' ') then for p from 3 to length(t) do if (t[p] > ' ') then seq := seq . If(AToInt(t[p]) = 0, 'X', t[p]); fi; od; else return(seq); fi; od; end; ParseFTLine := proc( t : name, sec_struct : name, length_struct : posint ) global sec_struct, length_struct; temp := sscanf(2+t, '%s %d %d'); if (length(temp) = 3) and ((temp[1]='TURN') or (temp[1]='HELIX') or (temp[1]='STRAND')) then if (temp[3] > length_struct) then length_struct := temp[3]; fi; for i from temp[2] to temp[3] do if (temp[1]='TURN') then sec_struct[i] := 't'; elif (temp[1] = 'HELIX') then sec_struct[i] := 'h'; elif (temp[1] = 'STRAND') then sec_struct[i] := 's'; fi; od; fi; end;