Next: All versus All Up: Searching for Genes Previous: Searching for hypothetical peptides

Direct nucleotide versus peptide alignment

To find the probable location of sequencing errors, the gap between the chunks and the most similar protein, we want to align the RNA sequence built from both chunks directly with all the proteins that match both chunks, using nucleotide versus peptide dynamic programming.

While peptide versus peptide functions deal with a single peptide database in variable DB, all nucleotide versus peptide (NP) functions require two databases to be loaded: a nucleotide database, accessible by variable NucDB, and a peptide database, accessible by variable PepDB.

Most peptide versus peptide functions have a corresponding NP function. The structure that holds an NP match is called NucPepMatch, finding the best possible alignment for a given pam distance is called LocalAlignNucPep, finding the best possible alignment at the optimal pam distance is called BestLocalAlignNucPep.

> print (NucPepMatch, LocalAlignNucPep, BestLocalAlignNucPep);
NucPepMatch: Usage: NucPepMatch(  )
Structure to hold nucleotide/peptide matches.
- Selectors:
  Sim: real, NucOffset: integer, PepOffset: integer,
  NucLength: integer, PepLength: integer, PamNumber: real,
  PamVariance: real, IntronScoring: {function, string, 0},
  NucGaps: list(posint..posint), PepGaps: list(posint..posint),
  Introns: list([real, posint..posint])
- Formats:
  NucPepMatch (NucEntries: structure, PepEntries: structure)
  NucPepMatch (NucOffset,PepOffset)
  NucPepMatch (Sim,NucOffset,PepOffset,NucLen,PepLen)
  NucPepMatch (Sim,NucOffset,PepOffset,NucLen,PepLen,PamNumber)
  NucPepMatch (Sim,NucOffset,PepOffset,NucLen,PepLen,PamNumber,PamVariance)
  NucPepMatch (Sim,NucOffset,PepOffset,NucLen,PepLen,PamNumber,PamVariance,
          IntronScoring())
  NucPepMatch (Sim,NucOffset,PepOffset,NucLen,PepLen,PamNumber,PamVariance,
          IntronScoring(),NucGaps,PepGaps,Introns).
LocalAlignNucPep: Usage: LocalAlignNucPep( npm:NucPepMatch, D:DayMatrix )
Return the NucPepMatch between the nucleotide and the peptide of npm with the 
  highest score.
BestLocalAlignNucPep: Usage: BestLocalAlignNucPep( m:NucPepMatch )
Apply LocalAlignNucPep and FindNucPepPam until a maximum is found.

To do nucleotide versus peptide dynamic programming, the actual steps are:

1.

Load a peptide database into variable PepDB. Since we already loaded SwissProt into DB, we can simply assign it.

> PepDB := DB:

2.

Find the offset of the peptide sequence to be aligned. As an example we want to align chunk1 with the peptide sequence that matches its first reading frame only.

> first := e[1]: for s in e[2..6] do first := first minus s od:
> pepOffset := Sequence (Entry(first[1]))[1]:

3.

Load a nucleotide database into variable NucDB. Even if we will not use a sequence from it right now, we load the funghi division of the EMBL database. We will use funghi sequences later on in this Darwin session.

> NucDB := ReadDb('DB/EMBL.fun'):

4.

Find the offset of the nucleotide sequence to be aligned. In our example, the sequence is in chunk1, which is not in the database. We find the offset of this chunk with respect to EMBL.fun.

> nucOffset := GetOffset (chunk1, NucDB):

5.

Create an NucPepMatch from these offsets.

> m := NucPepMatch (nucOffset, pepOffset);
m := NucPepMatch(-940568940,25135271)

6.

Use one of the NP dynamic programming functions to align the sequences. The most convenient way to create the Dayhoff matrices required for those is

> CreateDayMatrices ();

To align the sequences at 250 pam, use

> LocalAlignNucPep (m, DM);
NucPepMatch(65.8523,-940568787,25136030,150,50,250,0)

To get the optimal alignment at optimal pam, use

> m1 := BestLocalAlignNucPep(m);
m1 := NucPepMatch(66.4941,-940568787,25136030,150,50,235.6046,2687.1169,0,[],[
],[])

Note that nucleotide versus peptide dynamic programming aligned only 50 amino acids yielding a similarity score of 66.5, while 59 amino acids of the hypothetical sequence were aligned with a similarity score of 113.7. This is because the hypothetical peptide contains an X that aligns at zero cost where the nucleotide contains a stop codon that would contribute

> m1[StopSimil];
Error, (in NPMatch_select) Invalid selector, StopSimil

to the total score. This is an example how aligning hypothetical peptides fails in handling stop codons. In fact, this doubtful match would not have been considered as a candidate by NP dynamic programming because it would not have reached the similarity threshold used above.

7.

Print the NucPepMatch.

> print (m1);

See figure

. Both percent identity and similarity are rather small, the estimates of pam distance and variance are large. All this indicates a very doubtful match.

The first two lines of a printed NP match contain information about the alignment, the next lines information about the sequences in the database:

offsets are the database offsets of the sequences,
lengths are the lengths of the aligned nucleotide and peptide sequences,
simil is the similarity score obtained from dynamic programming,
pam is an estimate of the PAM distance between the two sequences,
var is an estimate of the variance of the PAM distance,
StopSimil is the similarity score used for aligning stop codons,
codons is the number of amino acids/codons that were aligned,
stops is the number of stop codons that were aligned,
identity is the percentage of aligned positions which are identical,
similarity is the percentage of aligned positions not identical but with a positive value in the Dayhoff matrix.

Then the alignment is shown in five rows:

the coordinates of the original nucleotide sequence (first base is 1),
the nucleotide sequence itself,
the hypothetical peptide sequence imposed by the alignment, marking the three bases of the codon with a <, the amino acid itself (or $ for a stop codon), and a >,
the quality of the match indicated by a character just like in peptide alignments:

| exact match

! very good match

: good match

. poor match

very poor match
the peptide sequence from the database.

For nucleotide indels longer than 4 bases, the length is shown in parentheses.

We now want to find the optimal alignment/pam of the concatenated chunks with all the entries in me.

> chunk := chunk1.chunk2:
> chunkOffset := GetOffset (chunk, NucDB):

The second statement again finds the offset of our chunk with respect to NucDB. We now are ready to BestLocalAlignNucPep our RNA sequence with all SwissProt entries we are interested in. For each sequence offset s defined by the list of entries in me, we create an NucPepMatch m of the nucleotide at offset chunkOffset (our two concatenated chunks) with the peptide at offset s, then BestLocalAlignNucPep the match and append it to the list of results. For Sequence to work on entries in PepDB, we need to assign it to DB:

> DB := PepDB:
> res := []:
> for s in Sequence (me) do
>   m := NucPepMatch (chunkOffset, s);
>   res := append (res, BestLocalAlignNucPep(m))
> od:

We sort the matches by similarity score (highest first):

> res := sort (res, x -> -x[Sim]);
res := [NucPepMatch(1504.9172,-940149966,8954634,423,150,4.1652,3.0586,0,[100
..-1, 198..-1, 275..-1, 313..-1, 390..390],[106..113],[]), NucPepMatch(
376.3618,-940149966,8946239,423,152,88.2429,122.6461,0,[100..-1, 198..-1, 
275..-1, 313..-1, 393..393],[86..87, 108..115],[]), NucPepMatch(376.3590,
-940149966,8945323,423,152,87.5727,121.5655,0,[100..-1, 198..-1, 275..-1, 
313..-1, 387..387],[86..87, 108..115],[]), NucPepMatch(350.2784,-940149966,
8944425,423,152,92.4499,133.0105,0,[100..-1, 198..-1, 239..-1, 313..-1, 
390..390],[81..81, 100..100, 108..115],[]), NucPepMatch(350.1464,-940149966,
8953764,420,152,94.6645,140.3467,0,[100..-1, 198..-1, 275..-1, 313..-1, 
393..393],[86..88, 109..116],[]), NucPepMatch(336.6448,-940149966,8950944,423,
153,84.4001,115.0486,0,[70..-1, 76..78, 86..86, 100..-1, 198..-1, 233..235
, 275..-1, 313..-1, 390..390],[24..25, 87..89, 109..116],[]), NucPepMatch(
330.6828,-940149966,8947208,423,153,88.6275,124.4639,0,[100..-1, 198..-1, 
233..235, 275..-1, 313..-1, 390..390],[22..22, 87..89, 109..116],[]), 
NucPepMatch(307.8176,-940149966,8949040,420,157,87.1001,118.8091,0,[79..-1, 86
..86, 100..-1, 198..-1, 243..244, 259..-2, 275..-1, 313..-1, 387..387],[27
..27, 74..77, 92..94, 114..121],[]), NucPepMatch(290.7973,-940149966,8948096,
423,155,102.2532,156.4797,0,[100..-1, 208..209, 275..-1, 313..-1, 393..393
],[74..77, 89..90, 111..118],[])]

By printing the best match, we see that there are probably sequencing errors at positions 102-103, 200-201, 277-278, 392 and that about 25 bases are missing between the two chunks. See figure

> print (res[1]);

Of course, we could have been searching the database with LocalAlignNucPep only, without first looking for hypothetical peptides. However, scanning an entire protein database with NP dynamic programming is more expensive than scanning it with peptide versus peptide dynamic programming over the hypothetical proteins. The following simple experiment reveals this: We first measure the time it requires to align the three hypothetical peptides with the peptide of the best match above. To make it use a significant amount of time, we repeat it 10 times.

> ofs := Sequence (me)[1]:
> PepTime := time ():
> for rep to 10 do
>   for i to 3 do
>     LocalAlign (Match (GetOffset (pep[i]), ofs), DM)
>   od
> od:
> PepTime := time () - PepTime;
PepTime := 0.6900

We then measure the time it requires to do the same with direct NP alignment.

> ofs1 := GetOffset (chunk1, NucDB):
> NucTime := time ():
> for rep to 10 do
>   LocalAlignNucPep (NucPepMatch (ofs1, ofs), DM)
> od:
> NucTime := time () - NucTime;
NucTime := 15.5600

NP dynamic programming is therefore about 23 times slower than standard peptide versus peptide dynamic programming. Of course, it is much more accurate and has the following advantages:

Sequencing errors do not damage the entire alignment.
Nucleotide indels need not be aligned to codon boundaries.
Stop codons are handled properly. In NP alignments, the similarity score for matching an amino acid with a stop codon is a parameter of a match (called StopSimil).
Introns causing frame shifts can be detected.

As an example for intron detection, we align the gene for ligninase isozyme H8 of Phanerochaete chrysosporium (in EMBL.fun) with the corresponding protein (in SwissProt).

> DB := NucDB:  SearchDb ('ligninase', 'h8', 'chrysosporium');
Entry(2170,2171)

> nucOffset := Sequence (")[1]:
> DB := PepDB:  SearchDb ('ligninase', 'h8', 'chrysosporium');
Entry(20766)

> pepOffset := Sequence (")[1]:

> print (BestLocalAlignNucPep (NucPepMatch ((nucOffset, pepOffset))));

The result is shown in figure

. The GT-AG rule for introns states that introns start with the dinucleotide GT and end with AG. Keeping this in mind, we immediately see where exactly the introns are. Note that 4 of the 7 introns cause frame shifts and hence frame encoding and peptide versus peptide alignment will never work.

In our current model, introns are scored as indels. If an intron becomes too long, the scoring for it as an indel may avoid inclusion of an exon and shorten the match. An example is the gene for yeast's ribosomal protein L29. We want to align this gene with the protein it encodes. We first search for the gene in NucDB.

> DB := NucDB:  SearchDb ('yeast', 'ribosom', 'L29');
Entry(1823,1824,3231,3361,4639,4640,6187)

> PrintInfo (", 'DE', 'AC');
Entry 1823:
  DE=Neurospora crassa crp-1 gene for ribosomal protein homologous to
     yeast cyH2 gene encoding L29
  AC=X06320;
Entry 1824:
  DE=Neurospora crassa crp-1 mRNA for ribosomal protein homologous to
     yeast cyH2 gene encoding L29
  AC=X13254;
Entry 3231:
  DE=S.cerevisiae chromosome III complete DNA sequence
  AC=X59720; S43845; S49180; S58084; S93798;
Entry 3361:
  DE=Yeast CYH2 gene for ribosomal protein L29
  AC=X01573; K01162;
Entry 4639:
  DE=S.cerevisiae CYH2 gene encoding ribosomal protein L29, 5' end.
  AC=M19490;
Entry 4640:
  DE=yeast (S.cerevisiae) ribosomal protein L29, (CYH2) gene, complete
     cds.
  AC=K01162;
Entry 6187:
  DE=S.pombe rpgL29 gene for ribosomal protein L29
  AC=X57207;

Since this gene is contained several times in the database, we choose the one with accession number X01573.

> nucEntry := Entry(3361):
> nucOffset := Sequence (nucEntry)[1]:

To find the SwissProt entry expressed by this gene, we look at the DR tag of the EMBL entry. It contains references to other databases.

> PrintInfo (nucEntry, 'DR');
Entry 3361:
  DR=SWISS-PROT; P02406; RL2A_YEAST.

> DB := PepDB:  pepEntry := SearchDb ('P02406;');
pepEntry := Entry(31443)

> pepOffset := Sequence (pepEntry)[1]:
> m := NucPepMatch (nucOffset, pepOffset):

Our first attempt is to get a single best alignment at distance 1 pam (remember that we are aligning a protein with its gene).

> pam1 := SearchDayMatrix (1, DMS);
pam1 := DayMatrix(Peptide, pam=1.00432, Sim: max=18.820, min=-37.514,
 max offdiag=-11.346, del=-37.626-1.396*(k-1))

> m1 := LocalAlignNucPep (m, pam1);
m1 := NucPepMatch(1648.1674,7042031,21681479,399,133,1.0043,0)

Next: All versus All Up: Searching for Genes Previous: Searching for hypothetical peptides

Gaston Gonnet
1998-09-15

`\|`	exact match
`!`	very good match
`:`	good match
`.`	poor match
	very poor match