Next: Direct nucleotide versus peptide Up: Searching for Genes Previous: Searching for Genes

Searching for hypothetical peptides

Since we do not know the correct reading frame, we have to translate both chunks in all three reading frames, resulting in six hypothetical peptides to be searched for.

To do the translation, we write a function NucToPep. It takes a nucleotide sequence as input and returns the peptide sequence it encodes. The universal genetic code is built into Darwin as the function GenCodeToInt .

> print (GenCodeToInt);
GenCodeToInt: Usage: GenCodeToInt( UUU:string )
Genetic code function. Translates 3 bases (three letters or a name of
  length 3) to the corresponding amino acid number (1 to 20, unknown=21,
  stop=22).

Because we want to align all amino acids derived from all three reading frames, even if there are sequencing errors that invalidate parts of the hypothetical peptides, the function continues encoding even if it encounters a stop codon. The stop codon is translated to the unknown amino acid X.

> NucToPep := proc (nuc: string)

Create an empty peptide name of appropriate length.

>   pep := CreateString(trunc (length (nuc) / 3));
>   for i to length (pep) do

Use the genetic code function to translate the codon; translate stop codons to X.

>     p := GenCodeToInt (nuc[3*i-2..3*i]);
>     pep[i] := If (p = 22, 'X', IntToA (p))
>   od;
>   pep
> end:

We now can create an array pep containing all hypothetical peptide sequences our chunks code for.

> pep := [NucToPep(chunk1), NucToPep(1+chunk1), NucToPep(2+chunk1),
>         NucToPep(chunk2), NucToPep(1+chunk2), NucToPep(2+chunk2)]:

The hypothetical peptide derived from the third reading frame of chunk2 is:

> pep[6];
NIGGRYSLWSSIGFPAALALGWEGFQPASRRWRGYGS

Next we search SwissProt for similar proteins and store all the matches with pep[i] in ms[i]. Since we have to assume that there are some sequencing errors, none of our six hypothetical proteins will be real, and we do not expect very good similarity to any known protein. We therefore set the similarity score threshold to 70 at pam 250.

> DB := ReadDb('DB/SwissProt'): 
> DM := CreateDayMatrix (NewLogPAM1, 250):
> ms := CreateArray(1..6):
> for i to 6 do
>   ms[i] := AlignOneAll (pep[i], DB, DM, 70)
> od:

The matches found are best shown in a table with one row per database entry and one column per hypothetical peptide. For each entry and each hypothetical peptide, the table lists the range of aligned positions in the SwissProt protein.

The following code makes use of the fact that the matches in ms[i] are sorted by database offset (and entry number).

Create six counters for the six match lists.

> k := CreateArray(1..6, 1):
> do

Find the current minimum offset of all match lists.

>   ofs := DB[TotChars];
>   for i to 6 do
>     if k[i] <= length (ms[i]) then
>       ofs := min (ofs, ms[i,k[i],Offset1])
>     fi
>   od;

If there is no minimum, we have reached the end of all lists.

>   if ofs = DB[TotChars] then break fi;

Get the entry corresponding to that offset and the offset of its first amino acid.

>   entry := GetEntryNumber(ofs);
>   seq := Sequence (Entry(entry));

If at the beginning, print two title lines.

>   if sum (k) = 6 then
>     printf ('Entry| 0+chunk1| 1+chunk1| 2+chunk1|');
>     printf (' 0+chunk2| 1+chunk2| 2+chunk2|\n');
>   fi;
>   printf ('%5d|', entry);

Print amino acid position range for all matches of this entry.

>   for i to 6 do
>     if k[i] <= length (ms[i]) and
>        GetEntryNumber(ms[i,k[i],Offset1]) = entry then
>       m := ms[i,k[i]];  p := m[Offset1] - seq[1];
>       printf ('%4d-%4d|', p+1, p+m[Length1]);
>       k[i] := k[i] + 1
>     else
>       printf ('         |')
>     fi
>   od;
>   printf ('\n')
> od:
Entry| 0+chunk1| 1+chunk1| 2+chunk1| 0+chunk2| 1+chunk2| 2+chunk2|
 2614|         |   3-  64|         |         |         |         |
 3503|         | 110- 133|         |         |         |         |
 3703|         | 472- 527|         |         |         |         |
 8669|         |   2-  53|         |         |         |         |
 9200|         |         |  27-  79|         |         |         |
10083|         |         |1348-1428|         |         |         |
12881|         | 188- 235|         |         |         |         |
12899|         |         |         |         |         |  58-  82|
12900|         | 183- 221| 151- 184|         |         | 267- 291|
12901|         | 183- 221| 151- 184|         |         | 267- 290|
12902|         | 197- 235| 165- 198|         |         | 281- 306|
12903|         | 188- 224|         |         |         | 270- 294|
12904|         |         | 163- 196|         |         | 282- 307|
12905|         | 226- 278|         |         |         | 322- 345|
12906|         |         |         |         |         | 269- 293|
12907|         | 187- 223|         |         |         | 269- 293|
12909|         |         |         |         |         | 323- 346|
12910|         | 195- 231| 161- 194|         |         | 278- 303|
12911| 170- 228| 168- 240| 136- 239|         |         | 249- 274|
13427|         |         |   1-  42|         |         |         |
14063|         |         |  97- 150|         |         |         |
19427|         |  71- 152|         |         |         |         |
23318|         | 289- 331|         |         |         |         |
23351|         | 289- 331|         |         |         |         |
23900|         | 799- 869|         |         |         |         |
24199|         |1383-1414|         |         |         |         |
25060|         |         |2214-2289|         |         |         |
33357|         |  19-  77|         |         |         |         |
33358|         |  19-  77|         |         |         |         |
33359|         |  19-  77|         |         |         |         |
35217|         |         |         |         |         | 252- 287|
36131|         |         | 198- 291|         |         |         |
36132|         |         |  54- 147|         |         |         |
36680| 750- 809|         |         |         |         |         |
37919|         | 281- 357|         |         |         |         |
37920|         | 281- 357|         |         |         |         |
40388|         |         | 261- 310|         |         |         |
40696|         |         | 589- 670|         |         |         |
43854|         | 738- 787|         |         |         |         |

The table shows us that the hypothetical peptides derived from the second and third reading frame of the first chunk and from the third reading frame of the second chunk are similar to known proteins.

We want to have a closer look at those proteins which match both chunks. The set of entries which match both chunks is the intersection of the set of entries which matches the first chunk and the set of entries which matches the second chunk. The set of matched entries for each of the six hypothetical peptides can be extracted from the matches.

The Darwin function zip manipulates elements of lists directly and returns a list with the manipulated elements. It avoids creating empty arrays and filling them with for loops. The function f uses zip to extract the entry numbers of Offset1 from a list of matches and converts them to a set.

> f := x -> {op (zip ((y->GetEntryNumber(y[Offset1]))(x)))}:

Remember that ms contains a list of lists of matches. If we apply f on each element of ms, we get a list of sets of SwissProt entries. e[i] will contain the set of all SwissProt entries referenced by ms[i] and hence being similar to pep[i].

> e := zip (f (ms));
e := [{12911,36680}, {2614,3503,3703,8669,12881,12900,12901,12902,12903,
12905,12907,12910,12911,19427,23318,23351,23900,24199,33357,33358,33359,
37919,37920,43854}, {9200,10083,12900,12901,12902,12904,12910,12911,13427,
14063,25060,36131,36132,40388,40696}, {}, {}, {12899,12900,12901,12902,
12903,12904,12905,12906,12907,12909,12910,12911,35217}]

me will hold all entries which match both chunks by matching any hypothetical peptide of each chunk.

> me := (e[1] union e[2] union e[3]) intersect
>       (e[4] union e[5] union e[6]);
me := {12900,12901,12902,12903,12904,12905,12907,12910,12911}

Convert this set to an Entry data structure and print the description and species of all these entries:

> me := Entry(op (me));
me := Entry(12900,12901,12902,12903,12904,12905,12907,12910,12911)

> PrintInfo (me, 'DE', 'OS');
Entry 12900:
  DE=GLUCOSE-6-PHOSPHATE ISOMERASE, CHLOROPLAST (GPI) (EC 5.3.1.9)
     (PHOSPHOGLUCOSE ISOMERASE) (PGI) (PHOSPHOHEXOSE ISOMERASE) (PHI).
  OS=CLARKIA UNGUICULATA.
Entry 12901:
  DE=GLUCOSE-6-PHOSPHATE ISOMERASE (GPI) (EC 5.3.1.9) (PHOSPHOGLUCOSE
     ISOMERASE) (PGI) (PHOSPHOHEXOSE ISOMERASE) (PHI).
  OS=ESCHERICHIA COLI.
Entry 12902:
  DE=GLUCOSE-6-PHOSPHATE ISOMERASE (GPI) (EC 5.3.1.9) (PHOSPHOGLUCOSE
     ISOMERASE) (PGI) (PHOSPHOHEXOSE ISOMERASE) (PHI).
  OS=HAEMOPHILUS INFLUENZAE.
Entry 12903:
  DE=GLUCOSE-6-PHOSPHATE ISOMERASE (GPI) (EC 5.3.1.9) (PHOSPHOGLUCOSE
     ISOMERASE) (PGI) (PHOSPHOHEXOSE ISOMERASE) (PHI) (NEUROLEUKIN) (NLK).
  OS=HOMO SAPIENS (HUMAN).
Entry 12904:
  DE=GLUCOSE-6-PHOSPHATE ISOMERASE (GPI) (EC 5.3.1.9) (PHOSPHOGLUCOSE
     ISOMERASE) (PGI) (PHOSPHOHEXOSE ISOMERASE) (PHI).
  OS=KLUYVEROMYCES LACTIS (YEAST).
Entry 12905:
  DE=GLUCOSE-6-PHOSPHATE ISOMERASE (GPI) (EC 5.3.1.9) (PHOSPHOGLUCOSE
     ISOMERASE) (PGI) (PHOSPHOHEXOSE ISOMERASE) (PHI).
  OS=LEISHMANIA MEXICANA.
Entry 12907:
  DE=GLUCOSE-6-PHOSPHATE ISOMERASE (GPI) (EC 5.3.1.9) (PHOSPHOGLUCOSE
     ISOMERASE) (PGI) (PHOSPHOHEXOSE ISOMERASE) (PHI).
  OS=SUS SCROFA (PIG).
Entry 12910:
  DE=GLUCOSE-6-PHOSPHATE ISOMERASE (GPI) (EC 5.3.1.9) (PHOSPHOGLUCOSE
     ISOMERASE) (PGI) (PHOSPHOHEXOSE ISOMERASE) (PHI).
  OS=SACCHAROMYCES CEREVISIAE (BAKER'S YEAST).
Entry 12911:
  DE=GLUCOSE-6-PHOSPHATE ISOMERASE (GPI) (EC 5.3.1.9) (PHOSPHOGLUCOSE
     ISOMERASE) (PGI) (PHOSPHOHEXOSE ISOMERASE) (PHI).
  OS=ZYMOMONAS MOBILIS.

The above tables lead to the following conlusions:

Because there are similarities for more than just one reading frame, the chunks most likely contain sequencing errors.
Both chunks seem to belong to the same mRNA sequence, and chunk2 obviously follows chunk1 in that sequence.
The mRNA seems to be transcribed from some gene for phosphoglucose isomerase.

Next: Direct nucleotide versus peptide Up: Searching for Genes Previous: Searching for Genes

Gaston Gonnet
1998-09-15