Map Files and Patricia Trees

Next: Accessing a Darwin Sequence Up: Building a Darwin Sequence Previous: Building a Darwin Sequence

Map Files and Patricia Trees

If a sequence database (in the correct SGML/Darwin format) has never been loaded before, Darwin loads the entire contents of the file into memory where it is analyzed and re-organized into a more convenient format for the system. Using the name you choose for darwinfile in the previous section or the file Sample/SH2, load this file via the ReadDb command.

> ReadDb('Sample/SH2');

Reading 76825 characters from file Sample/SH2
Pre-processing input (peptides)
78 sequences within 78 entries considered
Building new Pat index in file Sample/SH2.tree with 43582 entries
Pat index with 43582 entries
 sorted, from "A</SEQ></E>\n\n<E><ID" to "YYYSPASRAEGPPQCEQVA"
Peptide file(Sample/SH2(76825), 78 entries, 43582 aminoacids)

If the size of the file is at least the value assigned to the system variable mapsize (see § - Internal Variables), a special file is created with the extension .map added to the name of your flat file. This file contains information regarding the contents of the database include sequence type, sequence size, location of the sequences, and the number of entries. If the size of your file not does reach the lower bound defined by mapsize, Darwin recalculates this information each time the database is loaded. If you would like to force Darwin to build such a map, one need only lower mapsize to a value less than the number of characters contained in the database.

> ReadDb('Sample/SH2');
Reading 76825 characters from file Sample/SH2
Pre-processing input (peptides)
78 sequences within 78 entries considered
Building new Pat index in file Sample/SH2.tree with 43582 entries
Pat index with 43582 entries
 sorted, from "A</SEQ></E>\n\n<E><ID" to "YYYSPASRAEGPPQCEQVA"
Peptide file(Sample/SH2(76825), 78 entries, 43582 aminoacids)

> Set(mapsize=76825);
131072

> ReadDb('Sample/SH2');

Reading 76825 characters from file Sample/SH2
Pre-processing input (peptides)
78 sequences within 78 entries considered
Creating file Sample/SH2.map for mapping
Peptide file(Sample/SH2(76825), 78 entries, 43582 aminoacids)

Darwin also places the contents of the SEQ field for each entry into a special data structure called a patricia tree. This is reported by the ReadDb command in the lines

Building new Pat index in file SH2.tree with 43582 entries
Pat index with 43582 entries

This structure facilitates fast searching and matching operations on the database. Darwin stores the patricia tree structure in a file named with a .tree extension added to the flat file name. Now, everytime this database is loaded, the .tree file is also loaded.

With large databases, the Patricia tree may take a long time to build and the resulting .tree file may be much larger than the database itself. Although the Patricia tree speeds up searching and matching operations performed on the database, it is optional. If you decide that you do not want Darwin to build such a structure, create an empty file with name database.tree in the same directory as your sequence database database. The easiest way to do this in Unix is via the touch command as follows:

% touch database.tree

Now, when you re-load your database, Darwin acknowledges the existence of the empty .tree file and does not attempt to rebuild the tree.

The patricia tree data structure itself is a type of binary tree which has some nice properties that allow fast and efficient searches with very long or unbounded length strings. Most standard computer science algorithm and data structure textbooks give an introduction to the data structure. Amongst others see [14] for further information.

The ReadDb command returns an object of the built-in structured type database. By default, this structure is assigned to the system variable DB but we may assign it to any Darwin name.

> ReadDb('Sample/SH2');
> type(DB, database);
> SH2 := ReadDb('Sample/SH2');
> type(SH2, database);

Next: Accessing a Darwin Sequence Up: Building a Darwin Sequence Previous: Building a Darwin Sequence

Gaston Gonnet
1998-09-15