next up previous contents
Next: References Up: Genetic Databases Previous: The SearchDb Function

Indexing a Darwin Sequence Database

The SearchDb function can search the entire DB sequence database provided that the entire database is loaded into memory. With extremely large databases this can be a problem; this is particularly true with DNA databases such as EMBL that are on the order of hundreds of megabytes. Even with moderate-sized databases, searches via SearchDb can be rather slow. The following total CPU times were recordered for searching the Swiss-Prot database for patterns via SearchDb.

> rtime(SearchDb('hello'));                # a pattern which is not a DNA,RNA or AA sequence.
> rtime(SearchDb('aaa'));                  # a common pattern takes a long time
> rtime(SearchDb('SLVHLRIKDRIPANNDIYVLKGDLY'));  # a AA sequence
> rtime(SearchDb('P30376));                # searching for an accession number
> rtime(SearchDb('143Z_SHEEP'));           # searching for an identification name
For commonly searched for strings, such as accession numbers and entry IDs, this sluggishness can be an annoyance.

To circumvent these problems, Darwin offers routines to create grid files. A grid file $\ldots$ DESCRIPTION GOES HERE

We present a short Darwin program to create a grid file indexed by the AC (accession number) and ID (identification) field of the Swiss-Prot database. An extended version of what follows can be found in the Darwin library function CreateSpGrid.






Calling Sequence:
CreateSpGrid(filename)
Parameters:
filename : name

Returns: Builds a grid file indexed by the ID (identification) field and AC (accession number) field of the database located at system variable DB. It stores this grid file in external filename.



  IndexDB := proc( filename : name)
    
    format := SetGridFile(ID=string, AC=string, Start=integer, End=integer);        
                               # create a structure of type GridfileFormat

    gf := CreateGrid(filename, format);
    
    for i from 1 to DB[TotEntries] do
      entry := GFstructure();
      holder := op(String(Entry(i)));
      entry['ID'] := SearchTag('ID', holder);
      AC := SearchTag('AC', holder);
      entry['AC'] := ac[1..CaseSearchString(';', AC)];
      entry['StartOffset'] := DB[Entry, i];
      entry['EndOffset'] := If( i < DB[TotEntries], DB[Entry, i+1]-1, DB[TotChars] );
      AddGrid( gf, entry );
    od;
    CompressGrid(gf);
    CloseGrid(gf);
  end:


next up previous contents
Next: References Up: Genetic Databases Previous: The SearchDb Function
Gaston Gonnet
1998-09-15