Nonrandom Tripeptide Sequence Distributions at Protein Carboxyl Termini

Table 2.

Most Frequent Tripeptide Sequences Observed Within the Genomes Studied

Organism N(Occ) N(Seq) (expected) N(Seq) (observed) Sequences observed
M. jannaschii 12 0.1 ± 0.3 2 KEE (4.0), LKK (1.8)
 (1773 ORFs) 10 0.4 ± 0.6 1 KKL (1.9)
8 1.4 ± 1.1 2 KKK (1.0), LKE (1.6)
7 2.7 ± 1.6 4 IIK (2.1), KKE (1.1), LNK (3.0),RLL (4.8)
6 5.4 ± 2.2 9 EKE (1.6), EKL (1.8), IKK (1.0), KIE (1.8), KKD (3.3), KKI (1.3),RKK (1.8), VKE (2.6), VKK (2.0)
E. coli 11 0.0 ± 0.2 1 AKK (3.5)
 (4290 ORFs) 10 0.1 ± 0.3 3 KKK (3.5),RSH (9.7), RSR(4.8)
9 0.4 ± 0.6 2 EAK (2.5), RLK(2.5)
8 1.2 ± 1.1 7 AAQ (3.9), EEA (3.5), EVK (3.3), GLL (4.3), LEI (8.0), LLG (2.9), RRG (4.7)
7 3.6 ± 1.8 8 DGE (7.4), EEV (6.5), GGK (4.1), KLA (2.4), LAS (2.8), NLA (4.0), RRR (3.1), SEE (5.4)
S. cerevisiae 21 0.0 ± 0.0 1 SKK(3.3)
 (6215 ORFs) 19 0.0 ± 0.0 1 KKK(2.8)
17 0.0 ± 0.1 1 VGE(16.5)
16 0.0 ± 0.1 2 AKK (4.3),WIH (120.2)
15 0.0 ± 0.2 2 DEL(7.3), IAN(12.0)
13 0.1 ± 0.3 1 SKL(2.2)
12 0.3 ± 0.5 4 EKK (1.5),LKK (1.5), LLL (1.9), LSK (1.9)
11 0.6 ± 0.7 2 KKE (2.9), LLK (1.6)
10 1.3 ± 1.1 1 GKK(2.8)
9 2.9 ± 1.7 7 DEE (7.1), DSK (2.7), FWC (158.6), LSI (2.3), MLL (6.5), QKI (4.6), SSS (3.0)
8 6.1 ± 2.4 12 DDE (7.0), EVD(8.8), IPK (5.8), KEK (2.2), KKD (2.6), KKN (1.9), LDL (2.3), LLV (2.6), RRK (3.5), SLA (3.2), SSL (1.7), TKK(2.2)
A. thaliana 99 0.0 ± 0.0 1 SSS(3.4)
 (25561 ORFs) 54 0.0 ± 0.0 1 DYW(95.3)
43 0.0 ± 0.1 1 SSL (1.4)
40 0.0 ± 0.2 1 ASS(2.6)
39 0.1 ± 0.2 2 DEL (5.2),TSS (2.6)
38 0.1 ± 0.3 1 SKL(1.8)
36 0.2 ± 0.4 1 LKL (1.8)
34 0.2 ± 0.5 1 LLS (1.6)
32 0.3 ± 0.6 1 EEE (8.2)
31 0.4 ± 0.6 1 SST (2.3)
30 0.5 ± 0.7 2 LSS (1.1), STS (2.0)
29 0.6 ± 0.8 4 KKK (3.4), LLL (1.3),PSS (2.1), RRR (3.8)
28 0.8 ± 0.9 2 SSI (1.8), VSS (1.7)
26 1.2 ± 1.1 3 DSD (4.3), GSS (1.9), LVF (3.2)
25 1.6 ± 1.3 2 DEE (6.7), KKR (2.7)
24 1.9 ± 1.3 4 FLL (2.1), FSS(1.7), LSL (0.9), RRS (2.1)
23 2.5 ± 1.5 8 DDE (7.5), EED (6.8), SFL (1.7), SLL (1.0), SSR (1.2), SSV (1.1), VSA (2.3), VTL (2.7)
C. elegans 70 0.0 ± 0.0 1 KKK (4.6)
 (19833 ORFs) 45 0.0 ± 0.0 1 LCE(20.4)
38 0.0 ± 0.0 2 SKL (2.3),YNP (33.7)
36 0.0 ± 0.0 1 PGY(20.8)
32 0.0 ± 0.0 2 GKK (4.3),TKY (5.8)
30 0.0 ± 0.0 1 SSK(2.2)
28 0.0 ± 0.1 3 DDE (11.5), KKN (2.2),SKK (1.8)
26 0.1 ± 0.2 2 DSD (7.7), RRK (3.8)
24 0.2 ± 0.4 5 AKK (2.8), DEE (8.4), KKL (1.5), KRK (2.4), LKK(1.6)
23 0.3 ± 0.5 5 AKL (2.6),DEL (4.9), GRK (4.6), KKE (2.3), KKI (2.1)
22 0.4 ± 0.6 3 EKK (2.3), SKN (1.6), TNS(3.9)
21 0.7 ± 0.8 1 TRR(5.4)
20 1.1 ± 1.1 4 ERA (5.3), KKQ (2.3), RKL (1.8), RRR (5.7)
19 1.8 ± 1.3 7 DKE (3.6), FGK (4.3), INY (5.4), LGL (2.8), NKK (3.1), SSF (1.5), VSS (2.9)
18 2.6 ± 1.5 9 EKL (1.8),FGG (12.2), KSE (2.1), LFN (2.5), LKI (1.6), RIC (9.5), SRR (3.3), SSS (1.8), VKK (1.8)
H. sapiens 32 0.0 ± 0.0 1 DEL (6.3)
 (14760 ORFs) 31 0.0 ± 0.0 1 EKK(5.3)
28 0.0 ± 0.0 1 KKK(4.5)
25 0.0 ± 0.1 1 LKF(5.1)
22 0.1 ± 0.3 1 EEE (6.3)
21 0.2 ± 0.4 2 LLL (1.6), SDQ(6.0)
20 0.3 ± 0.5 2 LAL (2.2), SSK(1.9)
19 0.4 ± 0.6 3 EEL (2.5), LLK (2.2), WNK (28.0)
18 0.7 ± 0.8 3 ASS (2.1), TRL (2.7), TSL (1.8)
17 1.0 ± 1.0 6 KGK(3.4), KRK (3.3), LGL (1.6), LLS (1.6), RKK (3.5), SLL (1.2)
16 1.6 ± 1.3 5 EDD (7.1), RRR (5.8), SES (1.7), SKL (1.2), TEL (2.2)
15 2.7 ± 1.6 9 GSS (1.9), KRR (4.2),NKI (8.5), PSS (1.8), RRK (3.8), SSL (1.0), SSS (1.2), TKL (1.8), TVV (5.0)
14 4.2 ± 2.0 9 APL (2.2), EKP (3.2), ERA (4.1), GKK (2.6), KSS (1.5), LVS (2.2), PGP (4.4), SCC (11.1), TEV (3.3)
13 6.5 ± 2.4 13 AKL (1.6),CGF (12.8), DSD (4.7), DTM (18.3), EDL (2.3), KKN (3.9), LEA (2.6), PPQ (4.8), SHL (2.8), SSP (1.7), SVS (1.9), TSI (3.3), VSS (2.0)
12 10.1 ± 3.0 20 AAS (2.2), EED (3.8), EKL (1.4), EVD (5.4), FGG(9.4), KAK (2.5), LKL (1.0), LPQ (3.0), LSL (0.9), LSS (1.0), PAS (2.3), QGL (2.6), RPY (7.6), SEI (2.6), SLS (1.0), SLT (2.0), SSV (1.4), TAL (1.9), TTV (3.8), VLL (1.7)
  • For each organism, the number of ORFs used for the analysis is indicated. N(Occ) indicates the number of occurrences for a particular sequence in the genome; N(Seq) indicates the number of sequences that appear N(Occ) number of times. The expected value of N(Seq) is derived from the genome jumbling method (with uncertainties shown at one standard deviation). Values in parentheses accompanying each sequence refer to the ratio of the number of times that sequence is observed to the number of times that sequence is expected based on positional amino acid frequencies. Sequences in boldface are known recognition motifs; italicized sequences belong to, entirely or in part, highly repeated sequences (e.g., homologous proteins or transposon ORFs), and underlined sequences take the form XKK (XSS in A. thaliana).

This Article

  1. Genome Res. 13: 617-623

Preprint Server