A code for transcription initiation. (A) Dinucleotide frequencies at fixed distances from dominant transcription start sites in HepG2 cells. Dinucleotide counts in the −3/+3 region around 7734 transcription start sites are shown as a table. These frequencies are highly non-random; each dinuceotide has a P-value describing its over-representation, where low P-values correspond to high over-representation. Dinucleotides are shaded by colors according to the P-value range they belong to, where red and blue represent the most and least significant categories, respectively (see legend at right of table). In general, oligonucleotide frequencies in the −50/+50 region constitute a code for TSS selection. The most frequent motifs in this region are shown in B. (B) Over-represented k-mers at fixed distances from dominant transcription start sites in HepG2 cells. This is a graphical representation of the same type of data as in A, but extended to all over-represented DNA words (or k-mers) in the −50 to +50 region around dominant transcription start sites. Statistically over-represented k-mers are displayed at the positions where they occur relative to the dominant TSS, whose first transcribed nucleotide is at +1. As in A, k-mers are colored according to their over-representation P-value. From left to right, the word columns can be described as SP1-like (at –50/−37), TATA-box (−32/−25), Inr/Pyrimidine-Purine (−2/+3), gcg-motif (+12/+21), and gcg echo (+25/+32). Each column (motif) is sorted by P-values independently of the other columns; for instance, the words in the Inr column are all more significantly over-represented than those in the gcg column. See Supplemental Figure S2 with legend for a more detailed description of each motif with corresponding statistics, sorted by overall P-value, and Supplemental Figures S3–S7 for corresponding figures using other cell lines from human and mouse.
