Identification and Characterization of the Potential Promoter Regions of 1031 Kinds of Human Genes

Table 1.

Predicted TF Binding Sites and CpG Islands in the 1031 PPRs

TF definition Matrix ID Hit No. (%) Preferred region (searched region) Cutoff value Consensus sequence
TATA box V$TATA_01 329 (32%) −40 ∼ −23 0.77 STATAAAWRNNNNNN
(−90 ∼ +27)
Initiator V$CAP_01 872 (85%) −5 ∼ +6 0.87 NCANNNNN
(−55 ∼ +56)
GC box V$GC_01 999 (97%) −74 ∼ −45 0.78 NRGGGGCGGGGCNK
(−124 ∼ +5)
CAAT box V$CAAT_01 663 (64%) −105 ∼ −70 0.78 NNNRRCCAATSA
(−155 ∼ −20)
Hit No. (%) Length (bp) CpG ratio GC content (%)
CpG island 493 (48%) >200 0.6 50
  • The search for TF binding sites was performed using the preferred region of each TF binding motif. For example, because the preferred region of the TATA box is −40 to −23, the region of −90 to +27 was searched. Fifty-base margins were added at both ends of the preferred region because in some cases multiple mRNA start sites were observed.

  • A TRANSFAC notation, which starts with an identifier that indicates vertebrates (V$), followed by an acronym for the factor (for more details, see http://transfac.gbf.de/TRANSFAC/doc/site3.html).

  • The symbols used in addition to A, C, G, and T are: W = A or T; S = C or G; R = A or G; T = C or T; K = G or T; M = A or C; B = C, G, or T; D = A, G, or T; H = A, C, or T; V = A, C, or G; N = A, C, G, or T.

This Article

  1. Genome Res. 11: 677-684

Preprint Server