Next: Scoring Pairwise Sequence Alignments Up: Evaluation Measures of Multiple Previous: Evaluation Measures of Multiple

Introduction

One of the most important applications of the data being generated from genome sequencing projects is to reconstruct the evolutionary histories of proteins, nucleic acids, and the organisms that carry them. A model for an evolutionary history typically consists of three parts: (a) an evolutionary tree, which shows the relationships of evolutionary objects (proteins, for example), (b) a multiple sequence alignment (MSA), which shows the evolutionary relationships between parts of these objects (in this case, individual amino acids in the protein sequence), and (c) reconstructed ancestral sequences, models for objects that were intermediates in the evolution of the family. Evolutionary histories are important in deducing biological function in biomolecules, predicting the folded conformation of protein sequences, and reconstructing the history of life on Earth.

Figure: The sequences $\{A, B, C, D, E\}$ of the MSA on the left are related by the evolutionary tree on the right.


A: RPCVCP___VLRQAAQ__QVLQRQIIQGPQQLRRLF_AA
B: RPCACP___VLRQVVQ__QALQRQIIQGPQQLRRLF_AA
C: KPCLCPKQAAVKQAAH__QQLYQGQLQGPKQVRRAFRLL
D: KPCVCPRQLVLRQAAHLAQQLYQGQ____RQVRRAF_VA
E: KPCVCPRQLVLRQAAH__QQLYQGQ____RQVRRLF_AA

$\epsfig{file=tree1.eps,height=0.08\textheight,angle=0}$

Example: In the MSA in Figure 1 five sequences are aligned (n=5). In this example we assume that the sequences are related and that we have the correct tree . The internal nodes represent unknown ancestor sequences.

Definition 1.1 Given is a set of sequences $S = \{s_1, .., s_n\}$ with $s_i \in \Sigma^*$ where $\Sigma$ is a finite alphabet. A Multiple Sequence Alignment (MSA) consists of a set of sequences $\mbox{\rm A} = \langle a_1, a_2, .., a_n\rangle$ with $a_i \in \Sigma^{'*}$ where $\Sigma' = \Sigma \cup \{''\_''\} \not \in \Sigma$ . $\forall a_i \in \mbox{\rm A}: \vert a_i\vert=k$ . The sequence obtained from $a_i \in \mbox{\rm A}$ by removing all $''\_''$ gap characters is equal to s_i.

Definition 1.2 The character $''\_''$ or any contiguous sequence of $''\_''$ within an aligned sequence $a_i \in \mbox{\rm A}$ is called a gap. A gap corresponds to an insertion or deletion event (indel).

Definition 1.3 The tree T(S)=(V, E, S) is a binary, leaf labeled tree with leafset $S = \{s_1, .., s_n\}$ .

Definition 1.4 An tree scoring function is a function $F: T \rightarrow I \!\! R$ .

In our context a tree T(S) associated with a set of sequences $S = \{s_1, .., s_n\}$ is the tree that corresponds to the evolutionary history of the sequences of S. The internal nodes V represent (usually unknown) ancestor sequences. Constructing trees, MSAs, and ancestral sequences encounters three sorts of problems. First, all are dependent on a model for evolutionary processes. At the level of the protein (which is our exclusive concern here), modeling the evolutionary history of a family must begin with assumptions about the frequencies of amino acid mutations, insertions, and deletions. These frequencies are now available from empirical studies of protein sequences [Gonnet et al., 1992]. In this work, a simple model, derived originally from work of Margaret Dayhoff [Dayhoff et al., 1978] and subsequently amplified, are used [Benner et al., 1993]. Second, constructing models for evolutionary histories encounters problems of mathematical complexity. Most versions of the MSA problem [Carillo and Lipman, 1988,Gupta et al., 1995,Gupta et al., 1996,Kececioglu, 1993,Wang and Gusfield, 1996,Roui and Kececioglu, 1998,Jiang et al., 1996,Sankoff and Cedergren, 1983] and the exact construction of evolutionary trees [Sankoff, 1975,Dress and Steel, 1993,Estabrook et al., 1975,Estabrook et al., 1976,Felsenstein, 1973,Felsenstein, 1981,Thorne et al., 1993] are NP-complete [Foulds and Graham, 1982,Jiang and Wang, 1994], and becomes computationally expensive for many real protein families encountered in a contemporary database. This requires that virtually all MSAs and evolutionary trees that will be used in the post-genomic era will be constructed using approximate heuristics.

Definition 1.5 An MSA scoring function is a function $F: \mbox{\rm A}\rightarrow I \!\! R$ .

Definition 1.6 Let $\cal{A}$ be the set of all possible MSAs that can be generated for a given set of sequences $S=\{s_1, s_2, .., s_n\}$ . The optimal MSA $\bar{\mbox{\rm A}} \in \cal{A}$ is an MSA such that w.l.o.g. $F(\bar{\mbox{\rm A}}) = \max\limits_{\mbox{\rm A}\in \cal{A}}$ $F(\mbox{\rm A})$ . In some scoring functions the minimum is the optimal.

Problem 1.1 (MSA problem) Given is a set of sequences $S = \{s_1, .., s_n\}$ . Find the optimal MSA $\mbox{\rm A}$ for S.

The use of heuristics in constructing MSAs creates a third problem, one centering on evaluation. Before using an MSA or tree built by a heuristic, one would like to know approximately how closely the heuristic has approximated an optimum MSA or tree. Even today, it is common for biochemists to evaluate by eye (and adjust by hand) the output of MSA tools. This is a clearly inadequate approach for any systematic reconstruction of natural history using genomic sequence data. A formal method for judging the quality of an MSA is needed. Accordingly, a variety of groups have proposed or used scoring functions [Sankoff and Cedergren, 1983,Altschul, 1989,Thompson et al., 1994,Higgins and Sharp, 1989,Carillo and Lipman, 1988,Gupta et al., 1996] that assess the quality of an MSA. In this paper we are interested in MSAs when no tree is available. In this case, the most commonly used function follows a simple approach that examines every pair of proteins in the family, generates a score for each pairwise alignment using a Dayhoff matrix [Dayhoff et al., 1978], and creates a score for the MSA by summing each of the scores of the pairwise alignments. We shall call these ``sum of pairs'' (SP) methods.

Next: Scoring Pairwise Sequence Alignments Up: Evaluation Measures of Multiple Previous: Evaluation Measures of Multiple

Chantal Korostensky
1999-07-14