Markup | Genome Research

Table 2.

Most Frequent Tripeptide Sequences Observed Within the Genomes Studied

Organism	N(Occ)	N(Seq) (expected)	N(Seq) (observed)	Sequences observed
M. jannaschii	12	0.1 ± 0.3	2	KEE (4.0), LKK (1.8)
(1773 ORFs)	10	0.4 ± 0.6	1	KKL (1.9)
	8	1.4 ± 1.1	2	KKK (1.0), LKE (1.6)
	7	2.7 ± 1.6	4	IIK (2.1), KKE (1.1), LNK (3.0),RLL (4.8)
	6	5.4 ± 2.2	9	EKE (1.6), EKL (1.8), IKK (1.0), KIE (1.8), KKD (3.3), KKI (1.3),RKK (1.8), VKE (2.6), VKK (2.0)
E. coli	11	0.0 ± 0.2	1	AKK (3.5)
(4290 ORFs)	10	0.1 ± 0.3	3	KKK (3.5),RSH (9.7), RSR(4.8)
	9	0.4 ± 0.6	2	EAK (2.5), RLK(2.5)
	8	1.2 ± 1.1	7	AAQ (3.9), EEA (3.5), EVK (3.3), GLL (4.3), LEI (8.0), LLG (2.9), RRG (4.7)
	7	3.6 ± 1.8	8	DGE (7.4), EEV (6.5), GGK (4.1), KLA (2.4), LAS (2.8), NLA (4.0), RRR (3.1), SEE (5.4)
S. cerevisiae	21	0.0 ± 0.0	1	SKK(3.3)
(6215 ORFs)	19	0.0 ± 0.0	1	KKK(2.8)
	17	0.0 ± 0.1	1	VGE(16.5)
	16	0.0 ± 0.1	2	AKK (4.3),WIH (120.2)
	15	0.0 ± 0.2	2	DEL(7.3), IAN(12.0)
	13	0.1 ± 0.3	1	SKL(2.2)
	12	0.3 ± 0.5	4	EKK (1.5),LKK (1.5), LLL (1.9), LSK (1.9)
	11	0.6 ± 0.7	2	KKE (2.9), LLK (1.6)
	10	1.3 ± 1.1	1	GKK(2.8)
	9	2.9 ± 1.7	7	DEE (7.1), DSK (2.7), FWC (158.6), LSI (2.3), MLL (6.5), QKI (4.6), SSS (3.0)
	8	6.1 ± 2.4	12	DDE (7.0), EVD(8.8), IPK (5.8), KEK (2.2), KKD (2.6), KKN (1.9), LDL (2.3), LLV (2.6), RRK (3.5), SLA (3.2), SSL (1.7), TKK(2.2)
A. thaliana	99	0.0 ± 0.0	1	SSS(3.4)
(25561 ORFs)	54	0.0 ± 0.0	1	DYW(95.3)
	43	0.0 ± 0.1	1	SSL (1.4)
	40	0.0 ± 0.2	1	ASS(2.6)
	39	0.1 ± 0.2	2	DEL (5.2),TSS (2.6)
	38	0.1 ± 0.3	1	SKL(1.8)
	36	0.2 ± 0.4	1	LKL (1.8)
	34	0.2 ± 0.5	1	LLS (1.6)
	32	0.3 ± 0.6	1	EEE (8.2)
	31	0.4 ± 0.6	1	SST (2.3)
	30	0.5 ± 0.7	2	LSS (1.1), STS (2.0)
	29	0.6 ± 0.8	4	KKK (3.4), LLL (1.3),PSS (2.1), RRR (3.8)
	28	0.8 ± 0.9	2	SSI (1.8), VSS (1.7)
	26	1.2 ± 1.1	3	DSD (4.3), GSS (1.9), LVF (3.2)
	25	1.6 ± 1.3	2	DEE (6.7), KKR (2.7)
	24	1.9 ± 1.3	4	FLL (2.1), FSS(1.7), LSL (0.9), RRS (2.1)
	23	2.5 ± 1.5	8	DDE (7.5), EED (6.8), SFL (1.7), SLL (1.0), SSR (1.2), SSV (1.1), VSA (2.3), VTL (2.7)
C. elegans	70	0.0 ± 0.0	1	KKK (4.6)
(19833 ORFs)	45	0.0 ± 0.0	1	LCE(20.4)
	38	0.0 ± 0.0	2	SKL (2.3),YNP (33.7)
	36	0.0 ± 0.0	1	PGY(20.8)
	32	0.0 ± 0.0	2	GKK (4.3),TKY (5.8)
	30	0.0 ± 0.0	1	SSK(2.2)
	28	0.0 ± 0.1	3	DDE (11.5), KKN (2.2),SKK (1.8)
	26	0.1 ± 0.2	2	DSD (7.7), RRK (3.8)
	24	0.2 ± 0.4	5	AKK (2.8), DEE (8.4), KKL (1.5), KRK (2.4), LKK(1.6)
	23	0.3 ± 0.5	5	AKL (2.6),DEL (4.9), GRK (4.6), KKE (2.3), KKI (2.1)
	22	0.4 ± 0.6	3	EKK (2.3), SKN (1.6), TNS(3.9)
	21	0.7 ± 0.8	1	TRR(5.4)
	20	1.1 ± 1.1	4	ERA (5.3), KKQ (2.3), RKL (1.8), RRR (5.7)
	19	1.8 ± 1.3	7	DKE (3.6), FGK (4.3), INY (5.4), LGL (2.8), NKK (3.1), SSF (1.5), VSS (2.9)
	18	2.6 ± 1.5	9	EKL (1.8),FGG (12.2), KSE (2.1), LFN (2.5), LKI (1.6), RIC (9.5), SRR (3.3), SSS (1.8), VKK (1.8)
H. sapiens	32	0.0 ± 0.0	1	DEL (6.3)
(14760 ORFs)	31	0.0 ± 0.0	1	EKK(5.3)
	28	0.0 ± 0.0	1	KKK(4.5)
	25	0.0 ± 0.1	1	LKF(5.1)
	22	0.1 ± 0.3	1	EEE (6.3)
	21	0.2 ± 0.4	2	LLL (1.6), SDQ(6.0)
	20	0.3 ± 0.5	2	LAL (2.2), SSK(1.9)
	19	0.4 ± 0.6	3	EEL (2.5), LLK (2.2), WNK (28.0)
	18	0.7 ± 0.8	3	ASS (2.1), TRL (2.7), TSL (1.8)
	17	1.0 ± 1.0	6	KGK(3.4), KRK (3.3), LGL (1.6), LLS (1.6), RKK (3.5), SLL (1.2)
	16	1.6 ± 1.3	5	EDD (7.1), RRR (5.8), SES (1.7), SKL (1.2), TEL (2.2)
	15	2.7 ± 1.6	9	GSS (1.9), KRR (4.2),NKI (8.5), PSS (1.8), RRK (3.8), SSL (1.0), SSS (1.2), TKL (1.8), TVV (5.0)
	14	4.2 ± 2.0	9	APL (2.2), EKP (3.2), ERA (4.1), GKK (2.6), KSS (1.5), LVS (2.2), PGP (4.4), SCC (11.1), TEV (3.3)
	13	6.5 ± 2.4	13	AKL (1.6),CGF (12.8), DSD (4.7), DTM (18.3), EDL (2.3), KKN (3.9), LEA (2.6), PPQ (4.8), SHL (2.8), SSP (1.7), SVS (1.9), TSI (3.3), VSS (2.0)
	12	10.1 ± 3.0	20	AAS (2.2), EED (3.8), EKL (1.4), EVD (5.4), FGG(9.4), KAK (2.5), LKL (1.0), LPQ (3.0), LSL (0.9), LSS (1.0), PAS (2.3), QGL (2.6), RPY (7.6), SEI (2.6), SLS (1.0), SLT (2.0), SSV (1.4), TAL (1.9), TTV (3.8), VLL (1.7)

[i] For each organism, the number of ORFs used for the analysis is indicated. N(Occ) indicates the number of occurrences for a particular sequence in the genome; N(Seq) indicates the number of sequences that appear N(Occ) number of times. The expected value of N(Seq) is derived from the genome jumbling method (with uncertainties shown at one standard deviation). Values in parentheses accompanying each sequence refer to the ratio of the number of times that sequence is observed to the number of times that sequence is expected based on positional amino acid frequencies. Sequences in boldface are known recognition motifs; italicized sequences belong to, entirely or in part, highly repeated sequences (e.g., homologous proteins or transposon ORFs), and underlined sequences take the form XKK (XSS in A. thaliana).