To find the probable location of sequencing errors, the gap between the chunks and the most similar protein, we want to align the RNA sequence built from both chunks directly with all the proteins that match both chunks, using nucleotide versus peptide dynamic programming.
While peptide versus peptide functions deal with a single peptide database
in variable DB
,
all nucleotide versus peptide (NP) functions require two databases to be
loaded: a nucleotide database, accessible by variable NucDB
, and
a peptide database, accessible by variable PepDB
.
Most peptide versus peptide functions have a corresponding NP function.
The structure that holds an NP match is called NucPepMatch
,
finding the best possible alignment for a given pam distance is called
LocalAlignNucPep
, finding the best possible alignment at the
optimal pam distance is called BestLocalAlignNucPep
.
> print (NucPepMatch, LocalAlignNucPep, BestLocalAlignNucPep); NucPepMatch: Usage: NucPepMatch( ) Structure to hold nucleotide/peptide matches. - Selectors: Sim: real, NucOffset: integer, PepOffset: integer, NucLength: integer, PepLength: integer, PamNumber: real, PamVariance: real, IntronScoring: {function, string, 0}, NucGaps: list(posint..posint), PepGaps: list(posint..posint), Introns: list([real, posint..posint]) - Formats: NucPepMatch (NucEntries: structure, PepEntries: structure) NucPepMatch (NucOffset,PepOffset) NucPepMatch (Sim,NucOffset,PepOffset,NucLen,PepLen) NucPepMatch (Sim,NucOffset,PepOffset,NucLen,PepLen,PamNumber) NucPepMatch (Sim,NucOffset,PepOffset,NucLen,PepLen,PamNumber,PamVariance) NucPepMatch (Sim,NucOffset,PepOffset,NucLen,PepLen,PamNumber,PamVariance, IntronScoring()) NucPepMatch (Sim,NucOffset,PepOffset,NucLen,PepLen,PamNumber,PamVariance, IntronScoring(),NucGaps,PepGaps,Introns). LocalAlignNucPep: Usage: LocalAlignNucPep( npm:NucPepMatch, D:DayMatrix ) Return the NucPepMatch between the nucleotide and the peptide of npm with the highest score. BestLocalAlignNucPep: Usage: BestLocalAlignNucPep( m:NucPepMatch ) Apply LocalAlignNucPep and FindNucPepPam until a maximum is found.To do nucleotide versus peptide dynamic programming, the actual steps are:
PepDB
. Since we
already loaded SwissProt into DB
, we can simply assign it.> PepDB := DB:
chunk1
with the peptide sequence
that matches its first reading frame only. > first := e[1]: for s in e[2..6] do first := first minus s od: > pepOffset := Sequence (Entry(first[1]))[1]:
NucDB
. Even if
we will not use a sequence from it right now, we load the funghi division
of the EMBL database. We will use funghi sequences later on in
this Darwin session.> NucDB := ReadDb('DB/EMBL.fun'):
chunk1
, which is not in the database.
We find the offset of this chunk with respect to EMBL.fun
.> nucOffset := GetOffset (chunk1, NucDB):
NucPepMatch
from these offsets.> m := NucPepMatch (nucOffset, pepOffset); m := NucPepMatch(-940568940,25135271)
> CreateDayMatrices ();To align the sequences at 250 pam, use
> LocalAlignNucPep (m, DM); NucPepMatch(65.8523,-940568787,25136030,150,50,250,0)To get the optimal alignment at optimal pam, use
> m1 := BestLocalAlignNucPep(m); m1 := NucPepMatch(66.4941,-940568787,25136030,150,50,235.6046,2687.1169,0,[],[ ],[])Note that nucleotide versus peptide dynamic programming aligned only 50 amino acids yielding a similarity score of 66.5, while 59 amino acids of the hypothetical sequence were aligned with a similarity score of 113.7. This is because the hypothetical peptide contains an X that aligns at zero cost where the nucleotide contains a stop codon that would contribute
> m1[StopSimil]; Error, (in NPMatch_select) Invalid selector, StopSimilto the total score. This is an example how aligning hypothetical peptides fails in handling stop codons. In fact, this doubtful match would not have been considered as a candidate by NP dynamic programming because it would not have reached the similarity threshold used above.
NucPepMatch
.> print (m1);See figure
<
, the amino acid
itself (or $
for a stop codon), and a >
,
| |
exact match |
! |
very good match |
: |
good match |
. |
poor match |
very poor match |
We now want to find the optimal alignment/pam of the concatenated chunks
with all the entries in me
.
> chunk := chunk1.chunk2: > chunkOffset := GetOffset (chunk, NucDB):The second statement again finds the offset of our chunk with respect to
NucDB
.
We now are ready to BestLocalAlignNucPep
our RNA sequence with all
SwissProt entries we are interested in. For each sequence offset s
defined by the list of entries in me
,
we create an NucPepMatch
m
of the nucleotide at offset
chunkOffset
(our two concatenated chunks) with the peptide at offset
s
, then BestLocalAlignNucPep
the match and append it to the list of
results. For Sequence
to work on entries in PepDB
, we need
to assign it to DB
:> DB := PepDB: > res := []: > for s in Sequence (me) do > m := NucPepMatch (chunkOffset, s); > res := append (res, BestLocalAlignNucPep(m)) > od:We sort the matches by similarity score (highest first):
> res := sort (res, x -> -x[Sim]); res := [NucPepMatch(1504.9172,-940149966,8954634,423,150,4.1652,3.0586,0,[100 ..-1, 198..-1, 275..-1, 313..-1, 390..390],[106..113],[]), NucPepMatch( 376.3618,-940149966,8946239,423,152,88.2429,122.6461,0,[100..-1, 198..-1, 275..-1, 313..-1, 393..393],[86..87, 108..115],[]), NucPepMatch(376.3590, -940149966,8945323,423,152,87.5727,121.5655,0,[100..-1, 198..-1, 275..-1, 313..-1, 387..387],[86..87, 108..115],[]), NucPepMatch(350.2784,-940149966, 8944425,423,152,92.4499,133.0105,0,[100..-1, 198..-1, 239..-1, 313..-1, 390..390],[81..81, 100..100, 108..115],[]), NucPepMatch(350.1464,-940149966, 8953764,420,152,94.6645,140.3467,0,[100..-1, 198..-1, 275..-1, 313..-1, 393..393],[86..88, 109..116],[]), NucPepMatch(336.6448,-940149966,8950944,423, 153,84.4001,115.0486,0,[70..-1, 76..78, 86..86, 100..-1, 198..-1, 233..235 , 275..-1, 313..-1, 390..390],[24..25, 87..89, 109..116],[]), NucPepMatch( 330.6828,-940149966,8947208,423,153,88.6275,124.4639,0,[100..-1, 198..-1, 233..235, 275..-1, 313..-1, 390..390],[22..22, 87..89, 109..116],[]), NucPepMatch(307.8176,-940149966,8949040,420,157,87.1001,118.8091,0,[79..-1, 86 ..86, 100..-1, 198..-1, 243..244, 259..-2, 275..-1, 313..-1, 387..387],[27 ..27, 74..77, 92..94, 114..121],[]), NucPepMatch(290.7973,-940149966,8948096, 423,155,102.2532,156.4797,0,[100..-1, 208..209, 275..-1, 313..-1, 393..393 ],[74..77, 89..90, 111..118],[])]By printing the best match, we see that there are probably sequencing errors at positions 102-103, 200-201, 277-278, 392 and that about 25 bases are missing between the two chunks. See figure
> print (res[1]);Of course, we could have been searching the database with
LocalAlignNucPep
only, without first looking for hypothetical
peptides.
However, scanning an entire protein database with NP dynamic
programming is more expensive than scanning it with peptide versus
peptide dynamic programming over the hypothetical proteins.
The following simple experiment reveals this:
We first measure the time it requires to align the three hypothetical
peptides with the peptide of the best match above. To make it use a
significant amount of time, we repeat it 10 times.> ofs := Sequence (me)[1]: > PepTime := time (): > for rep to 10 do > for i to 3 do > LocalAlign (Match (GetOffset (pep[i]), ofs), DM) > od > od: > PepTime := time () - PepTime; PepTime := 0.6900We then measure the time it requires to do the same with direct NP alignment.
> ofs1 := GetOffset (chunk1, NucDB): > NucTime := time (): > for rep to 10 do > LocalAlignNucPep (NucPepMatch (ofs1, ofs), DM) > od: > NucTime := time () - NucTime; NucTime := 15.5600NP dynamic programming is therefore about 23 times slower than standard peptide versus peptide dynamic programming. Of course, it is much more accurate and has the following advantages:
StopSimil
).
EMBL.fun
) with the corresponding
protein (in SwissProt).> DB := NucDB: SearchDb ('ligninase', 'h8', 'chrysosporium'); Entry(2170,2171) > nucOffset := Sequence (")[1]: > DB := PepDB: SearchDb ('ligninase', 'h8', 'chrysosporium'); Entry(20766) > pepOffset := Sequence (")[1]:
> print (BestLocalAlignNucPep (NucPepMatch ((nucOffset, pepOffset))));The result is shown in figure
In our current model, introns are scored as indels. If an intron becomes too
long, the scoring for it as an indel may avoid inclusion of an exon and
shorten the match. An example is the gene for yeast's ribosomal protein L29.
We want to align this gene with the protein it encodes. We first search
for the gene in NucDB
.
> DB := NucDB: SearchDb ('yeast', 'ribosom', 'L29'); Entry(1823,1824,3231,3361,4639,4640,6187) > PrintInfo (", 'DE', 'AC'); Entry 1823: DE=Neurospora crassa crp-1 gene for ribosomal protein homologous to yeast cyH2 gene encoding L29 AC=X06320; Entry 1824: DE=Neurospora crassa crp-1 mRNA for ribosomal protein homologous to yeast cyH2 gene encoding L29 AC=X13254; Entry 3231: DE=S.cerevisiae chromosome III complete DNA sequence AC=X59720; S43845; S49180; S58084; S93798; Entry 3361: DE=Yeast CYH2 gene for ribosomal protein L29 AC=X01573; K01162; Entry 4639: DE=S.cerevisiae CYH2 gene encoding ribosomal protein L29, 5' end. AC=M19490; Entry 4640: DE=yeast (S.cerevisiae) ribosomal protein L29, (CYH2) gene, complete cds. AC=K01162; Entry 6187: DE=S.pombe rpgL29 gene for ribosomal protein L29 AC=X57207;Since this gene is contained several times in the database, we choose the one with accession number X01573.
> nucEntry := Entry(3361): > nucOffset := Sequence (nucEntry)[1]:To find the SwissProt entry expressed by this gene, we look at the DR tag of the EMBL entry. It contains references to other databases.
> PrintInfo (nucEntry, 'DR'); Entry 3361: DR=SWISS-PROT; P02406; RL2A_YEAST. > DB := PepDB: pepEntry := SearchDb ('P02406;'); pepEntry := Entry(31443) > pepOffset := Sequence (pepEntry)[1]: > m := NucPepMatch (nucOffset, pepOffset):Our first attempt is to get a single best alignment at distance 1 pam (remember that we are aligning a protein with its gene).
> pam1 := SearchDayMatrix (1, DMS); pam1 := DayMatrix(Peptide, pam=1.00432, Sim: max=18.820, min=-37.514, max offdiag=-11.346, del=-37.626-1.396*(k-1)) > m1 := LocalAlignNucPep (m, pam1); m1 := NucPepMatch(1648.1674,7042031,21681479,399,133,1.0043,0)