How to compute this probability efficiently.

Next: Independence of the position Up: Probability of an evolutionary Previous: Probability of an evolutionary

How to compute this probability efficiently.

Let O be the point which identifies the root of the EPT. To compute the probability of a given position, we use the same idea as Dayhoff et al. [6] i.e. sum over all possible unknown amino acids o, w, x and y and evaluate all the mutation probabilities.

$\begin{displaymath}Pr\{ \mbox{EC} \} = \sum_o f_o \sum_x M^{d_X}_{xo} M^{d_4}_{... ...M^{d_3}_{C y} \sum_w M_{wy}^{d_W} M^{d_1}_{K w} M^{d_2}_{R w} \end{displaymath}$

The interpretation of each part of the product is simple. For each branch between $U \rightarrow V$ we will have a term of the form M^d_V_{v u}, for each unknown internal amino acid at node U we will have a sum for all its possibilities ( $\sum_u$ ) and for the root we have the sum of each possible amino acid times its natural frequency of occurrence. Lower case letters denote the particular amino acid at a given position, e.g. x is the amino acid in a given position of sequence X. The upper case K, R and C indicate the known amino acids at the leaves.

As written, this formula is very expensive to compute, it requires $\vert\Sigma\vert^{k-1}$ products, where k is the number of sequences. We can reorganize this summation by the introduction of new vectors T^X and S^X for each internal or external node X. The value of the vector depends on the position of the root of the tree and it is computed as follows

if X is an external leaf, i.e. a known amino acid, x

$\begin{displaymath}T_{i}^X = \left \{ \begin{array}{ll} 1 & i = x \\ 0 & i \neq x \end{array} \right. \end{displaymath}$

$\begin{displaymath}S^X = M^{d_X} \cdot T^X = M_{*x}^{d_X} \end{displaymath}$

(i.e. S^X is just the $x^{\mbox{th}}$ column of M^d_X).
if X is an internal node, then it is a ternary node. Let Y and Z be the adjacent nodes whose subtrees do not include the root of the tree (away from the root). Then

$\begin{displaymath}T_i^X = \left ( \sum_y M^{d_Y}_{y i} T_{y}^Y \right ) \left ( \sum_z M^{d_Z}_{z i} T_{z}^Z \right ) = S_i^Y \cdot S_i^Z \end{displaymath}$

$\begin{displaymath}S_i^X = \sum_x M_{xi}^{d_X} T_x^X \end{displaymath}$

$\begin{displaymath}S^X = \left ( M^{d_X} \right )^{'} T^X \end{displaymath}$

These recursive definitions can be used if we do the computation from the leaves towards the root. Each S corresponding to an internal node requires $\vert\Sigma\vert^2$ multiplications, and each T requires $\vert\Sigma\vert$ multiplications. For k sequences, each of length m, this requires $m(k-1)\vert\Sigma\vert(\vert\Sigma\vert+1)$ multiplications (420m(k-1) for proteins), which is not inexpensive, but perfectly feasible.

The computation of the probability of the EC is done with the two immediate descendants of the root, call them X and Y.

$\begin{displaymath}Pr\{ \mbox{EC} \} = \sum_i f_i \left ( \sum_x M^{d_X}_{x i} T... ...^{d_Y}_{y i} T^Y_{y} \right ) = \sum_i f_i T_i^O = f \cdot T^O \end{displaymath}$

The PAS at the root of the tree can be computed as follows: for each amino acid at the root compute the probability of such a configuration

$\begin{displaymath}Pr\{ \mbox{aa $i$\space at the root} \} = \mbox{PAS}_i = \lef... ...\right ) \left ( \sum_y M^{d_Y}_{y i} T^Y_{y} \right ) = T_i^O \end{displaymath}$

where X and Y are the descendants of the root. These are probabilities of the entire EC, to find the relative probabilities for each amino acid the PAS is normalized to sum 1, i.e.

$\begin{displaymath}\mbox{PAS} = T^O / \sum_i T_i^O \end{displaymath}$

Next: Independence of the position Up: Probability of an evolutionary Previous: Probability of an evolutionary

Gaston Gonnet
1998-07-14