If a sequence database (in the correct SGML/Darwin format) has never been loaded before, Darwin loads the entire contents of the file into memory where it is analyzed and re-organized into a more convenient format for the system. Using the name you choose for darwinfile in the previous section or the file Sample/SH2, load this file via the ReadDb command.
> ReadDb('Sample/SH2'); Reading 76825 characters from file Sample/SH2 Pre-processing input (peptides) 78 sequences within 78 entries considered Building new Pat index in file Sample/SH2.tree with 43582 entries Pat index with 43582 entries sorted, from "A</SEQ></E>\n\n<E><ID" to "YYYSPASRAEGPPQCEQVA" Peptide file(Sample/SH2(76825), 78 entries, 43582 aminoacids)
If the size of the file is at least the value assigned to the system
variable mapsize (see § - Internal
Variables), a special file is created with the extension .map
added to the name of your flat file. This file contains
information regarding the contents of the database include sequence
type, sequence size, location of the sequences, and the number of
entries.
If the size of your file not does reach the lower bound defined by
mapsize, Darwin recalculates this information each time the
database is loaded.
If you would like to force Darwin to build such a
map, one need only lower mapsize to a value
less than the number of characters contained in the
database.
> ReadDb('Sample/SH2'); Reading 76825 characters from file Sample/SH2 Pre-processing input (peptides) 78 sequences within 78 entries considered Building new Pat index in file Sample/SH2.tree with 43582 entries Pat index with 43582 entries sorted, from "A</SEQ></E>\n\n<E><ID" to "YYYSPASRAEGPPQCEQVA" Peptide file(Sample/SH2(76825), 78 entries, 43582 aminoacids) > Set(mapsize=76825); 131072 > ReadDb('Sample/SH2'); Reading 76825 characters from file Sample/SH2 Pre-processing input (peptides) 78 sequences within 78 entries considered Creating file Sample/SH2.map for mapping Peptide file(Sample/SH2(76825), 78 entries, 43582 aminoacids)
Darwin also places the contents of the SEQ field for each entry into a special data structure called a patricia tree. This is reported by the ReadDb command in the lines
Building new Pat index in file SH2.tree with 43582 entries Pat index with 43582 entriesThis structure facilitates fast searching and matching operations on the database. Darwin stores the patricia tree structure in a file named with a .tree extension added to the flat file name. Now, everytime this database is loaded, the .tree file is also loaded.
With large databases, the Patricia tree may take a long time to build and the resulting .tree file may be much larger than the database itself. Although the Patricia tree speeds up searching and matching operations performed on the database, it is optional. If you decide that you do not want Darwin to build such a structure, create an empty file with name database.tree in the same directory as your sequence database database. The easiest way to do this in Unix is via the touch command as follows:
% touch database.treeNow, when you re-load your database, Darwin acknowledges the existence of the empty .tree file and does not attempt to rebuild the tree.
The patricia tree data structure itself is a type of binary tree which has some nice properties that allow fast and efficient searches with very long or unbounded length strings. Most standard computer science algorithm and data structure textbooks give an introduction to the data structure. Amongst others see [14] for further information.
The ReadDb command returns an object of the built-in structured type database. By default, this structure is assigned to the system variable DB but we may assign it to any Darwin name.
> ReadDb('Sample/SH2'); > type(DB, database); > SH2 := ReadDb('Sample/SH2'); > type(SH2, database);